A Bytecode-based Approach for Smart Contract Classification

05/31/2021 ∙ by Chaochen Shi, et al. ∙ Monash University 0

With the development of blockchain technologies, the number of smart contracts deployed on blockchain platforms is growing exponentially, which makes it difficult for users to find desired services by manual screening. The automatic classification of smart contracts can provide blockchain users with keyword-based contract searching and helps to manage smart contracts effectively. Current research on smart contract classification focuses on Natural Language Processing (NLP) solutions which are based on contract source code. However, more than 94 application scenarios of NLP methods are very limited. Meanwhile, NLP models are vulnerable to adversarial attacks. This paper proposes a classification model based on features from contract bytecode instead of source code to solve these problems. We also use feature selection and ensemble learning to optimize the model. Our experimental studies on over 3,300 real-world Ethereum smart contracts show that our model can classify smart contracts without source code and has better performance than baseline models. Our model also has good resistance to adversarial attacks compared with NLP-based models. In addition, our analysis reveals that account features used in many smart contract classification models have little effect on classification and can be excluded.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Asmart contract is an event-driven program running on distributed ledgers. The concept of smart contract was originally introduced by Szabo [szabo1996smart], providing a commitment defined in a digital form. As of April 2020, the number of smart contracts on Ethereum  [buterin2014next] exceeds two million [etherscan]. As the number of contracts increases, how to help users find the services they need in massive contracts has become an important issue. The primary query APIs of smart contracts provided by blockchain platforms are based on contract address, block number, transaction hash, and timestamp. Some commercial tools such as Google Bigquery [bigquery] and Dfuse [dfuse] provide SQL and GraphQL supported blockchain databases to realize complex queries on row data. However, as blockchain platforms are gradually evolving into distributed data centers, users desire a more convenient searching experience, e.g., searching by keywords [jiang2020searchain] or categories [huang2017towards]. An essential step to conduct such searching is labeling smart contracts accurately. Currently, the identification of smart contracts relies on manual labeling, which is costly and inefficient. Therefore, it is necessary to design an effective classification model to classify and label existing or newly uploaded contracts automatically. The goal of the classification is as follows:

The dataset is defined as , where refers to a smart contract and belongs to which is a predefined collection of categories, . The goal is to learn a mapping function which maps an input to the category it belongs to.

Fig. 1: The composition of the smart contract.

As shown in Fig. 1, a smart contract consists of source code and comments. Currently, the prevalent method to classify smart contracts is using NLP techniques such as the attention-based LSTM network [tang2015effective] to capture the semantic features from source code and comments (described in Section 2). However, there are two main problems in the existing NLP-based models:

  1. []

  2. NLP-based models have limited application scenarios. They can only classify open-source contracts [tian2020smart]. However, open-source code is not mandatory for contract developers. Less than 4% contracts are open-source on Ethereum [etherscan], which means that NLP models cannot classify more than 96% of smart contracts;

  3. NLP-based models can be easily attacked by adversarial examples [zhang2020adversarial]. Context features like comments, variable names, and function names can be easily modified in the source code without changing the logic of the code. Developers can write their source code in different ways to perform the same functions; however, any additions, deletions, or modification of the source code may fool the NLP-based classifier.

To solve these problems, we need models which can classify smart contracts without the source code. Inspired by the wide use of bytecode features in other areas of smart contracts, such as contract vulnerability and fraud detection, we found that bytecodes can reflect the functional features from the logical aspect. We expect that bytecode features can also be successfully applied to smart contract classifications.

In this paper, we propose a multi-classification model for smart contracts based on bytecode features. Compared with NLP-based models, our bytecode-based approach can effectively classify smart contracts without source code, which significantly expands the application scenarios. In addition, adversarial attacks against the source code have little effect on the bytecode-based model because the attacked semantic information is discarded after being compiled. Considering that the categories of smart contracts are unevenly distributed in the blockchain platforms, we propose employing feature selection (Binary Particle Swarm Optimization 

[kennedy1997discrete]) and ensemble learning techniques (Adaboost [sun2007cost]) to solve the problem of data imbalance in our models. This paper focuses on the Ethereum platform, but the approach can easily be expanded to other blockchain platforms. The key contributions of this paper are as follows.

  1. []

  2. We propose a bytecode-based approach, which is the first approach to classify smart contracts when their source code is intentionally hidden.

  3. We demonstrate that our bytecode-based approach has better resistance to adversarial attacks than state-of-the-art source code-based approaches.

  4. We prove that feature selection and ensemble learning are competitive alternatives to solve data imbalance problems in smart contracts classification.

  5. We determine that account features have little effect on classifications compared with code features and explain the reason. It has guiding significance for future research on the classification model of smart contracts.

2 Background and Related Work

2.1 Background

Ethereum is one of the most popular programmable blockchain with a built-in Turing-complete instruction set. Users can develop customized cryptocurrencies or decentralized applications (Dapps) built on smart contracts on the Ethereum platform. As the core of Ethereum, the Ethereum virtual machine (EVM) can compile high-level programming languages such as Solidity into bytecode. The bytecode consists of a series of bytes, and each byte refers to a specific operation represented by a corresponding mnemonic form predefined in the Ethereum yellow paper [wood2014ethereum]. For example, the mnemonic of value 0x01 is ADD, which means the add operation. These mnemonic forms are called opcodes, which reflect the operational logic of programs directly from EVM level. Table I lists some frequently-used opcodes and their meanings.






Modulo addition operation.



Exponential operation.



Less than operation



Get the caller address.



Get the amount of available gas.



Place a 1 byte item on the stack.



Load the first word from storage.

TABLE I: Examples of Ethereum opcodes.

Based on the EVM, developers can deploy smart contracts on the Ethereum platform easily. The process can be divided into three steps: first, use a high-level language like Solidity to write the smart contract source code; second, compile the source code into bytecode through EVM; and finally, deploy the compiled contract through Ethereum clients.

Every user of Ethereum can hold an account. An Ethereum account has a 20-byte address, including four unique fields: nonce, balance, contract bytecode (if any), and storage (usually empty). Only contract accounts have code fields which store codeHash (the hash value of the EVM code for this account). This field cannot be modified after creation, which means that the smart contract is immutable. When the contract account receives a message, the contract is activated. This allows it to read and write to the internal storage, send messages out, or create a new contract. We use both bytecode features and account information in training the smart contract classifier.

2.2 Related Work

There are few studies on the classification of smart contracts. Huang et al. [huang2017towards]

have introduced a smart contract classification method based on the word embedding model. This method captures the semantics of the contract source code through the LSTM network and obtains word vectors. Finally, word vectors and account characteristics are input into the feedforward neural network; the probability distribution of the category labels is output. Gang et al. 

[tian2020smart] have proposed a novel classification model called SCC-BiLSTM. It employs the Gaussian LDA (GLDA) model and attention mechanism to improve the classifier’s performance. This model solves the sparse semantic problem of annotations in the source code, and the attention mechanism is used to capture vital code features. The experimental results show that this model achieves superior effectiveness on smart contract classification tasks, but it still relies on open-source contracts.

Studies have used bytecode or opcode to analyze smart contracts. Oyente [luu2016making] is a symbolic execution tool released by Melonport to detect potential security vulnerabilities such as reentrancy, timestamp dependence, and logic errors in smart contracts. Oyente works directly with the EVM bytecode and opcodes without access to high-level programming languages like Solidity or Python. The research by Chen et al. [chen2018detecting]

has used features extracted from bytecode to detect Ponzi schemes in Ethereum smart contracts. This model extracts features from bytecode in manually labeled contract samples and trains the regression tree model with the XGBoost 

[chen2016xgboost] algorithm. The most significant innovation is that by using this bytecode-based model Ponzi schemes can be detected once contracts are created. Chen et al. [chen2018understanding] conducted an investigation on Ethereum through graph analysis. They collected all transaction data by customizing Ethereum client using opcodes. Barati et al. [bar20] show that some data privacy rules can be translated into smart contracts and appear as opcodes to verify the way providers operate user data automatically.

Unlike source code-based approaches, we use features from contract bytecode to train the classification model. Since bytecode is immutable and is open to access, the bytecode-based classifier is universal to all contracts no matter they are open-source or not. This is the main difference between our approach and NLP-based approaches.

3 Proposed Methodology

3.1 Framework

The overall framework is illustrated as Fig 2. We first collected verified smart contracts by crawling from Ethereum explore (etherscan.io and stateofthedapps.com), including the contract bytecode and related account information. The second step converts the bytecode into opcodes and extracts the code features to train the 0-day model, which classifies contracts once they are uploaded. The third step integrates the contract behavior features from the transaction history to train the full-feature model. To solve the problems of feature redundancy and sample imbalance in the model training, we also propose an ensemble learning-based multi-classification algorithm with a binary particle swarm optimization (BPSO) method.

Fig. 2: The framework of the smart contracts classification approach.

3.2 Data

To train and test our model, we collected 11,000 smart contracts of top 100 Ethereum Dapps ranked by their user activities (unique source addresses in transactions to DApp contracts) over the past 30 days, as of May 1, 2020. All smart contracts are collected from Ethereum explores etherscan.io and stateofthedapps.com through web crawlers. After deleting duplicate contracts and contracts which have never been triggered, 3,381 contracts are left, and 1,501 contracts of them are open-source. Each contract contains full information, including the bytecode and account information. We also collected all of the transaction histories of these contracts, such as the number of transactions and the amount of transferred Ether for further feature extraction. The collected contracts are manually divided into six categories: Governance, Finance, Gambling, Game, Wallet, Social according to the Dapps to which they belong. The distribution of the collected contracts is shown in Fig 3. The imbalance ratio of the samples is 19, which is similar to the current Ethereum environment; game and gambling contracts appear more frequently than other categories.

Fig. 3: The distribution of collected smart contracts.

3.3 Feature Extraction

Feature extraction and selection are key upstream capabilities for building a high-performance classifier. Previous work, e.g.,  [luu2016making, pham2016anomaly]

has extracted code features mainly from bytecode for vulnerability detection and pattern recognition. In other proposals, e.g.,  

[tian2020smart, 8525395], account and transaction information have been selected as account features to characterize contracts. In our work, we build a 0-day model that is based on code features to classify contracts as early as day 0. This is possible because the code features are available immediately and are immutable once the contracts are uploaded. We integrate code features and account features to train the full-feature model to improve the classification accuracy for the already deployed contracts.

3.3.1 Code Features

As the main body of a smart contract, the bytecode is stored as a string of hexadecimal numbers with the contract account in a Merkle Patricia tree. Unlike source code, bytecode is transparent and can be easily obtained from every contract. As mentioned in Section 1, each byte represents a certain opcode, so we disassemble the bytecode into equivalent opcode with evmdis111https://github.com/Arachnid/evmdis, accessd May 1, 2020 to facilitate the feature extraction. The opcode features can be directly used in classification without any modifications because they reflect all of the logical behaviors of contracts [atzei2017survey] from the perspective of the EVM.

After disassembling bytecodes into opcodes, the frequency of each kind of opcode is calculated and regarded as a feature. Please note that for some opcodes with the same functions, we merge them into one category. For example, both DUP1 and DUP2 are considered as DUP; both PUSH1 and PUSH2 are considered as PUSH. Finally, we find 61 different kinds of opcodes from all 1,501 contracts, which means the dimension of the code feature is 62, including the size of the bytecode. Table II shows the top 10 code features (except size feature) ranked by their average values in three categories.

Rank Game contracts Social contracts Financial contracts
Feature Value (avg) Feature Value (avg) Feature Value (avg)
1 PUSH 134.18 PUSH 84.13 PUSH 57.36
2 DUP 97.44 JUMP 52.12 JUMP 50.21
3 SWAP 89.15 DUP 45.32 SWAP 18.35
4 JUMP 53.24 MSTORE 32.87 RETURN 14.23
5 POP 39.23 SSTORE 31.25 DUP 9.09
6 RETURN 11.21 SWAP 7.62 MUL 4.23
7 MLOAD 4.35 CALL 3.38 MSTORE 2.21
8 CALL 2.31 POP 1.09 SUB 0.77
9 MSTORE 0.89 AND 0.43 STOP 0.34
10 ADD 0.72 ISZERO 0.31 CREATE 0.15
TABLE II: Top 10 code features ranked by average values.

According to Table II, the distributions of code features are different among the three categories. The most frequently used features are PUSH, DUP, SWAP, and JUMP. These features all relate to stack operations. That is because almost any operation, such as defining variables and functions, involves stack operations on the EVM. We also found that MSTORE and MLOAD are more frequently used in game contracts than in others. This outcome is reasonable because some game data needs to store and load from memory. Other categories of contracts also have characteristic feature distributions that reflect their unique characteristics. Thus, we believe code features can be used in contract classification. Although there are some common frequently appearing opcodes among different categories, we still use all 61 opcodes as features because they may have hidden connections with each other and cannot be excluded by a simple standard.

3.3.2 Account Features

Account features are selected from account attributes and related transaction history information. These features are only available after the deployment of contracts and may change over time, reflecting how contracts work in a real environment. Thus, we can extract these features from already deployed contracts and combine them with code features to train a full-feature model that classifies deployed contracts.

Previous research [tian2020smart, pham2016anomaly, 8525395] provides a variety of account features. From these, we select features to model smart contracts as follows:

  • Balance: the balance of contract account, measured by wei.

  • Nonce: the nonce records the sequence of contract creation.

  • Nbr_trans_act and Nbr_trans_psv: the number of active and passive transactions involving the contract.

  • Eth_in and Eth_out: the total amount of income and output Ether of the contract.

  • Eth_avg and Eth_sdev:

    the mean and standard deviation of the Ether transferred by the contract.

  • Lifetime: the time gap between the initial and the last transaction.

  • Trs_gap_avg and Trs_gap_sdev: the average and standard deviation of the time gap between every two transactions.

  • Nbr_addr: the total number of addresses the contract interacted with.

3.4 Feature Selection

According to the statistics from stateofthedapps.com

, the distribution of the smart contract categories on Ethereum is very uneven. The number of game and gambling contracts is much larger than other contracts, while wallet and governance contracts are rare. Thus, the smart contract classification problem can be regarded as a multi-classification problem on an imbalanced data set. Traditional classification algorithms such as Decision Tree, K-Nearest Neighbor, and Support Vector Machine present challenges in achieving the desired performance on an imbalanced data set because of their bias towards the majority class. It may treat the minority class samples as noise 

[sun2007cost], which results in the poor classification performance of the minority class.

To improve the classification performance, We integrate feature selection in our classification model. The feature selection process can eliminate the irrelevant and redundant features to reduce the noise in the sample space [chandrashekar2014survey], thereby improving the classification performance of minority classes. In addition, feature selection also helps us find critical features and hidden relationships among a massive number of original features and decreases the time complexity.

When we eliminate the irrelevant features, there is also a risk to the potential loss of useful information because the feature selection procedure may alter the original data distribution [sun2015novel]. We prefer employing warpper methods rather than filter methods such as Mutual Information [guyon2003introduction] or Relief-based algorithms [urbanowicz2018relief] to imbalanced data classification, since the correlation between features and targets is not clear. BPSO [kennedy1997discrete]

is a stochastic evolutionary algorithm which is widely used for solving optimization problems in binary space. Compared with other wrapper methods such as Genetic algorithm, Differential Evolution, etc., the complexity of BPSO is much lower since it does not contain crossover and mutation operations. In this paper, we choose BPSO as the feature selection method. Our method follows the original BPSO algorithm and only changes the particle representation and fitness values. We encode binary particles as a multi-dimensional vector with values

and each bit of the vector represents a feature which is selected (value 1) or not (value 0). The fitness value of a particle is usually the classification accuracy of the sample subset indicated by the particle. Here, we choose the normalized AUC_area shown as Eq. (6) (decribed in Section 4.1) as the fitness value instead.

3.5 Classification Model

Ensemble learning [dietterich2002ensemble]

is a machine learning method that uses a specific rule to combine multiple classifiers as a collection to achieve better predictive performance than an individual classifier. The idea of ensemble learning is that even if a weak classifier obtains an incorrect prediction, other classifiers can correct it. Adaboost.M1 

[freund1996experiments] is a typical ensemble learning algorithm that has been widely used to solve multi-classification problems because of its good performance, low complexity, and good resistance to overfitting. Adaboost. M1 creates a simple weak learner for each feature. Weak learners do not need high accuracy in the initial stages, as long as their accuracy is higher than random classification. The weight of the correctly classified samples decreases, and the weight of the incorrectly classified samples remains unchanged after each iteration. Suppose is the number of samples and is the collection of categories, . Then the classification error rate is


where is the weight of sample , , and is the weak hypothesis . Setting a parameter


the weight would be updated as


where is the current number of iterations. In this way, the distribution of samples becomes more balanced after each iteration. Finally, we can obtain a strong classifier with a superior predictive performance by combining the weak learners obtained in iterations. The final strong hypothesis is


where is the total number of iterations.

Adaboost.M1 requires relatively high-performance weak learners. We choose C4.5 [quinlan2014c4] as the algorithm of the weak learner based on two reasons:

  1. []

  2. There are numerous missing values in our code features. Thus, C4.5 is suitable for our case as it has good performance and low sensitivity to missing values.

  3. Although all of the binary classification algorithms can be expanded to multi-classification versions via the OvO or OvA strategy [fernandez2013analysing], this significantly increases the complexity of the algorithm. So, we choose C4.5, which can be directly used in multi-classification problems.

We use the BPSO algorithm for feature extraction and then put the sample subset selected from samples by the particles into the Adaboost.M1 algorithm. To compute the AUC_area as Eq. (6), the classifier needs to output a

-dimensional probability vector for each sample in

-class classification. Values in the probability vector are the probabilities of a sample belongs to each class. So there are two probability matrixes and with size belonging to the weak learner and the final strong classifier respectively. Since and are updated in each iteration, the could be the average of weighted :


Algorithm 1 presents the pseudocode of the whole classification algorithm, including the training and classification process. In additon to , we can also obtain the prediction results and the optimal subset of features . The notations used in the algorithm are listed in Table III.

Notation Explanation
the training set;
the size of ;
the collection of categories, ;
the number of iterations of Adaboost.M1;
the scale of particles;
the generation limit of BPSO;
the inertia weight of BPSO;
, the acceleration factors of BPSO.
the strong hypothesis obtained by ensemble learning;
the optimal subset of features.
TABLE III: The notations used in algorithm 1.
0:   with samples and categories; , , and , .
0:  , and the classification results;
1:  Initialize each particle randomly as [kennedy1997discrete], ;
2:  while (Number of generations BPSO converged) do
3:     for  to  do
4:        Select sample subset according to particle ;
5:        Initialize distribution ;
6:        for  to  do
7:           Train C4.5 classifier with to obtain weak hypothesis and ;
8:           Compute the classification error rate as Eq. (1);
9:           Set parameter as Eq. (2);
10:           Update as Eq. (3);
11:        end for
12:        Obtain as Eq. (4);
13:        Compute as Eq. (5);
14:        Set , update the fitness value of each particle and the global fitness value based on ;
15:        Update the position and velocity of particles with parameter , and ;
16:     end for
17:     ;
18:  end while
Algorithm 1 Framework of the BPSO-Adaboost algorithm.

4 Experiments and Analysis

4.1 Evaluation Metrics

For imbalanced data, the Receiver Operating Characteristic (ROC) curve [fawcett2006introduction]

is a well-recognized evaluation metric of classifier performance. ROC curve comes from confusion matrix as Table 

IV, taking FPR () as X-axis and TPR () as Y-axis. However, the ROC curve can not quantitatively evaluate the performance of classifiers. Thus the area under the ROC curve (AUC) is widely used as the evaluation metric. The bigger the AUC is, the better the classifier performance is.

Traditional AUC values can only be used in binary classifiers. For n-class classification problems, we can combine classes in pairs and find the AUC of each pair individually. Finally, there are AUC values. We put all AUC values in a polar coordinate system and calculate the area of the graph covered by all AUC values as the metric, called AUC_area [hassan2010novel]. The larger AUC_area means better classification performance. Assuming there are AUC values respectively where , the normalized AUC_area is:


AUC_area is sensitive to categories which have poor AUC values. If there is a poor AUC, the AUC_area will also be poor. Therefore, a classification model needs to obtain high AUC values on all categories to keep a high AUC_area. Compared with average AUC value, AUC_area is more suitable for our case since it has no bias toward the majority category. In this paper, we use normalized AUC_area, accuracy, and Micro-F1 score as evaluation metrics.

Actual Values
Positive Negative
Positive TP FP
Negative FN TN
TABLE IV: Confusion matrix of binary classification.

4.2 Experiment Settings

We train and test our 0-day model and full-feature model (mentioned in section 3.2) with 10-fold cross-validation [kohavi1995study] on all 3,381 contracts. We specify and specify BPSO related parameters as standard PSO algorithm settings, given in Table V. To evaluate the effect of feature selection and ensemble learning in smart contract classification, we compared our algorithm with C4.5, Adaboost.M1 and BPSO-based C4.5.

Parameter Value
TABLE V: BPSO parameters used in our model.

We train models with the same data set and test them on different occasions to compare the performance and robustness between our approach and state-of-the-art NLP-based approaches. Please note that even the data sets they use are the same. The code features they use are from bytecode and source code, respectively. Thus we train models on 1,301 verified contracts that have both source code and bytecode. We test these models on three different test sets: verified contracts (by employing 10-fold cross-validation), 1,880 unverified contracts, and 1,020 contracts with adversarial source code. Adversarial source code means each comment, variable name and function name of the source code is attacked by one of the four operations randomly: add, drop, swap, and replacement. It is similar to real-world attack settings.

4.3 Performance Evaluation

4.3.1 Performance comparison between 0-day model and full-feature model

Table VI and VII show the performance of each algorithm on the 0-day model and full-feature model respectively over our data set. To compare the classification performance of different algorithms intuitively, Fig. 5 and 5 show the polar graphs of the AUC values of these algorithms. In the following figures, six categories: Governance, Finance, Gambling, Game, Wallet, Social are numbered from 1 to 6 in order. The results show that the gap between the two models are tiny. In fact, 23 code features and only 2 account features, (Balance and Trs_gap_avg) were selected in the best feature subset of full-feature model. It means account features have little impact on classification performance. That is reasonable because account features are not stable since they change over time and can be influenced by many external factors. For example, market volatility and policy changes may lead to a substantial increase or decrease in Eth_avg and Eth_sdev. Thus account features are not robust enough to be as key features. In most cases, our 0-day model is sufficient for both newly uploaded and deployed contracts because contract bytecode is immutable after uploading.

Algorithm AUC_area Micro-F1
Accuracy for each category
governance Finance Gambling Game Wallet Social
BPSO-Adaboost 0.923 0.955 0.932 0.917 0.923 0.910 0.951 0.885 0.919
Adaboost 0.894 0.878 0.904 0.792 0.897 0.912 0.946 0.779 0.871
BPSO-C4.5 0.829 0.911 0.851 0.788 0.834 0.864 0.887 0.795 0.828
C4.5 0.797 0.721 0.780 0.745 0.803 0.852 0.893 0.731 0.778
TABLE VI: Performance of different algorithms on 0-day model.
Algorithm AUC_area Micro-F1
Accuracy for each category
governance Finance Gambling Game Wallet Social
BPSO-Adaboost 0.931 0.964 0.939 0.922 0.928 0.915 0.960 0.891 0.924
Adaboost.M1 0.904 0.893 0.918 0.799 0.904 0.911 0.934 0.792 0.841
BPSO-C4.5 0.845 0.920 0.877 0.796 0.857 0.875 0.899 0.795 0.833
C4.5 0.803 0.738 0.774 0.747 0.812 0.861 0.897 0.723 0.769
TABLE VII: Performance of different algorithms on a full-feature model.
Algorithm Verified Contracts Unverified Contracts Adversarial Examples
AUC_area Micro-F1 Accuracy AUC_area Micro-F1 Accuracy AUC_area Micro-F1 Accuracy
BPSO-Adaboost 0.916 0.943 0.920 0.904 0.938 0.934 0.912 0.946 0.929
SCC-BiLSTM 0.925 0.957 0.933 0.242 0.457 0.489 0.389 0.638 0.664
TABLE VIII: The performance of the BPSO-Adaboost and SCC-BiLSTM algorithms on different test sets.
Fig. 4: The AUC_area of algorithms on 0-day model
Fig. 5: The AUC_area of algorithms on full-feature model
Fig. 4: The AUC_area of algorithms on 0-day model

4.3.2 The Effect of Feature Selection and Ensemble Learning

According to the experimental results, the relationship between two evaluation metrics remain consistent: higher AUC_area values have higher accuracies. Overall, the performance of the BPSO-Adaboost algorithm is superior to the other three algorithms. The Adaboost.M1 algorithm performs much better than C4.5 over all categories, demonstrating that ensemble learning improves the performance of individual weak classifiers on the imbalanced data set. Besides, we found that algorithms with the BPSO algorithm exhibit better performance those without in terms of both overall accuracy and the accuracy of minority categories. It confirms our previous assumption that the BPSO algorithm can exclude redundant features, which reduces the noise in minority samples and improves the overall performance of the model. In conclusion, both feature selection and ensemble learning have positive contributions to our model; they play essential roles in classifying samples of minority categories, which are of particular concern in imbalanced data.

4.3.3 Robustness Comparison Between Our Approach and the State-of-the-art NLP-based Approach

Table VIII and Fig. 6 show the performance of our approach and a representative state-of-the-art NLP-based approach, the SCC-BiLSTM algorithm [tian2020smart], under the different test sets mentioned in Section 5.2. For verified contracts, the SCC-BiLSTM algorithm has a slightly better performance than the proposed BPSO-Adaboost algorithm, but the results are very close. For unverified contracts, the BPSO-Adaboost algorithm retains its high performance; in contrast, the SCC-BiLSTM algorithm degenerates into a random classifier due to a lack of code features. The performance of the SCC-BiLSTM algorithm is also poor on adversarial examples, but the BPSO-Adaboost algorithm is barely affected. The reason is that the key features of the NLP-based approach are mainly semantic features that are not robust at all once attacked. For our bytecode-based approach, the attack on the source code also causes bytecode changes. However, the changes to equivalent opcodes happen only after KECCAK256 [buterin2014next] because the opcodes of all the variable names, function names, and comments need to be computed as keccak-256 hashes. Thus the weights of opcodes after KECCAK256 are significantly lower than opcodes that relate to core operations and have little impact on classification. These results indicate that our bytecode-based approach has similar performance with state-of-the-art NLP-based approach, and has much better robustness.

Fig. 6: The AUC_area of BPSO-Adaboost and SCC-BiLSTM under different test sets.

5 Conclusion

This paper proposes a novel bytecode-based classification approach designed to effectively classify smart contracts of blockchain platforms. Considering traditional classifiers have poor performance on imbalanced data sets, we use a feature selection method to reduce the noise of the samples as well as an ensemble learning approaach to improve the overall performance of the classifier. Comparative experiments prove the superiority of each element in our algorithm. The result of feature selection also reveals why the full-feature model has little improvement over the 0-day model. Compared with a state-of-the-art NLP-based approach, our bytecode-based approach provides good performance and offers two key advantages. First, it dramatically expands the application scenarios for which the classifier can be used (i.e., bytecode for open-source, non-open-source contracts). Second, our method can defend against semantic attacks. These results demonstrate our bytecode-based approach has better robustness than approaches that depend on contract source code.

This paper focuses on demonstrating the bytecode-based model’s advantages in the classification of smart contracts. In the future, we plan to further this study this problem from three aspects. The first is to extend the data set as the number of smart contracts proliferates. We will continually improve the model with more ground truth data, including promote the classification accuracy and increase the number of categories that the model can classify. However, some smart contracts belong to categories which are hard to classify clearly, such as contracts used for identification [8951253] or monitoring [8873576]. We expect there would be more detailed definitions of existing smart contracts. The second is to expand our smart contract classification model to other blockchain platforms. The potential targets are platforms with similar virtual machine architecture to EVM, e.g., Hyperleger Fabric [androulaki2018hyperledger]. Finally, we plan to explore more derivative functions based on the results of smart contract classification. For example, taking the popularity and gas efficiency of contracts into account to do top-k searching, or recommending specific categories of contracts to users based on their preferences.