Recent developments in digital currencies gave birth not only to a completely new way of exchanging value, but also to such areas like distributed trust management. Those advances may replace traditional notary services or payment processing companies in the near future . Such advances are possible to achieve thanks to technology called blockchain that, in its basis, is as an immutable, distributed database. First public blockchain, called Bitcoin, was launched in 2009 and, not surprisingly, from the very beginning attracted fraudulent actors that tried to take advantage of other participants. These actors very often try to convince others to send them digital currency to their accounts by using different techniques like malware or fake emails. Due to the publicly available data, information about account once denoted as fraudulent can be shared and available without limitations. Quite contrary to traditional financial systems, all the transfers to and from such account can be freely viewed and analyzed. The availability of this data gives us an opportunity to verify if there is a meaningful relation between operations done on the account and this account being fraudulent.
In this paper, we propose a novel approach for detecting fraudulent accounts on Ethereum network. Ethereum is a blockchain that has some significant improvements over Bitcoin . Those improvements allow to write and execute contracts (called smart contracts) more easily. These contracts give an opportunity for many different actors to engage in complex agreements that are fully executable and can be verified with the use of the underlying protocol. More details on Ethereum can be found in .
In the first stage, we automatically gathered available data about accounts and transactions. Then, we created explanatory variables out of raw data. They represent aggregates and statistics computed over volumes and time. In the next stage, we tested three classifiers and compared their results in the context of possible applications. They can strongly depend on different use cases that may put more importance on precision than on recall or the other way round. The contribution of this study can be summarized as follows:
We proposed a novel approach for identifying fraudulent accounts on Etherum blockchain that is easily transferable to other blockchains, like Bitcoin.
We conducted a thorough analysis of three different machine learning algorithms for the task of classification accounts to “fraudulent” or “not fraudulent” class.
We conducted a sensitivity analysis in order to verify how much we depend on particular explanatory variables. This is a test that allow us to address the potential problem of a look-ahead bias that may or may not exist within the data that we gathered.
2 Related work
Detecting fraudulent activity in financial operations is a well known problem. Both researchers and practitioners put a lot of attention to developing new tools that would correctly identify new attack vectors. This is an endless battle in which both sides use their creativity and new technologies. A comprehensive survey on fraud detection techniques can be found in . More recent surveys on fraud prevention systems and detecting financial fraud through data mining algorithms can be found in  and  respectively.
Quah and Sriganesh 
used Self Organizing Maps (SOM) to detect credit card frauds. They took an approach that if a transaction is similar to all transactions in a set of genuine transactions, it is also considered genuine. On the other hand, if it looks like any of the transactions in a set of fraudulent, then it is also considered fraudulent. In addition to the basic task of clustering input data, Self Organizing Maps are also used to detect and extract hidden patterns. According to the authors, in real financial systems that verify each transaction on multiple layers, SOM may also serve as a filter for the layers following it. In the case described by the authors, SOM receives an input data vector consisting of client, account and transaction features.
authors used supervised learning methods to tackle similar problem. They used logistic regression, Support Vector Machine (SVM) and random forest. Apart from using typical transaction features as an algorithm’s input (e.g. order value, type of items ordered, payment method), through abstraction and combination they engineered several new variables such as binary evaluated compliance of the country of the card transaction with the country to which the purchased items are to be delivered. Eventually, the authors used 71 features to describe each transaction. The best results were obtained using random forest method, which is why it was used in further analysis. As it turned out, despite quite good results in recognizing frauds, they were not good enough to fully automate verification of transactions.
In case of transfers done through blockchain transactions, fraud detection can be a more complicated task as most of the time we are not in possession of geographical and personal data of participants. Pham and Lee 
in their article dealt with detecting frauds in the Bitcoin network. The network data was modeled as two graphs: a user graph and a transaction graph which were used to detect anomalies (e.g. fraudulent and suspicious users). They had information about 30 cases of theft in the Bitcoin network, which were later used to verify their results. In both graphs, each vertex was represented with 12 features, such as the input and output stage, the average time between transactions, the creation date and activity time. As the first step in the analysis they applied k-means algorithm to group all graph nodes. As the authors pointed out, this algorithm is not used to find anomalies, but it may be useful, because the points that diverge from the rest are expected to be found far from the centroids calculated with k-means algorithm. They wanted to investigate if anomalies in user graph, clearly refer to anomalies in the transaction graph, i.e. whether ”suspicious” users were involved in ”suspicious” transactions. To find anomalies in these groups authors used a method based on the Mahalanobis distance and Support Vector Machine (SVM). Suspected users and transactions indicated by both algorithms overlapped to a large degree. In both methods extreme values were indicated as suspicious, i.e. vertices with the largest or smallest degrees. That approach allowed to detect two authentic anomalies: one theft (detected by the Mahalanobis distance based method) and one loss caused by a corruption in a hashing function (detected by the SVM). These results do not seem to be statistically significant primarily due to a limited number of known thefts (or anomalies in general).
3.1 Data preparation
The data used in the analysis came from the Etherscan.io website, which is one of the most popular Ethereum blockchain browsers. It provides information about all transactions in the network, mined blocks and user accounts. Over 2 500 wallets were reported by the users as related to illegal activities and marked as ”Hack/Phishing”. Using the Etherscan API it was possible to download information about all transactions in which given wallet participated. Some of the wallets tagged as fraudulent had no transactions at all or were involved mostly in ERC20 token trade. They were not included in the dataset. After this correction we analyzed 2 200 wallets marked as involved in illegal activity. In addition to fraudulent transactions data, we also collected information about transactions from 349 999 randomly selected wallets out of the 65 564 460 existing (as of 28th May 2019) in the Ethereum network. They were not marked as suspicious and were considered non-fraudulent.
|Variable name||Variable description|
|IT||amount of incoming transactions|
|OT||amount of outgoing transactions|
|UIT||amount of unique incoming transactions|
|UOT||amount of unique outgoing transactions|
|AVIT||average value of the incoming transaction|
|AVOT||average value of the outgoing transaction|
|VIT||total value of all incoming transactions|
|VOT||total value of all outgoing transactions|
|ATIT||average time between incoming transactions|
|ATOT||average time between outgoing transactions|
|AGP||average gas price|
|AGL||average gas limit|
|DUR||active duration (time in days since the first until the last transaction)|
The dataset was divided into two parts: a training set with 281 760 samples and a validation set with 70 439 samples.
3.2 Experiment setup
The prediction problem definition here is a classic example of a binary classification. We examined following classifiers: Random Forests, Support Vector Machines and XGBoost in order to determine their capabilities of making accurate predictions for a given dataset. Figure 1 presents data and system architecture for the conducted experiment. As a first step we downloaded data using the Etherscan API, which then was aggregated to create 13 variables presented in the Table 1. In the next step, using grid search with 10-fold cross-validation we tried to find set of parameters that could give the best results for the three supervised learning algorithms that we chose.
Data gathered from Etherscan did not allow to accurately determine the moment of marking particular account as a fraudulent one. It can be possible that certain aggregates that we use for training are biased and data used to compute them was gathered after the moment of marking a particular account as fraudulent. It is possible that some of the transactions can be a result of the public exposure of an account. This would not be a problem if were only interested in devising a method for simple classification of account. However, if we would like to use proposed method as an early warning system then we will have to take a moment of an exposure into consideration. We address this issue by conducting performance analysis after removing most important explanatory variables. As the final step we did a validation check on a part of a dataset that was not used for the training purposes. Result from this step were reported in the following sections.
3.3 Prediction models
The Support Vector Machine (SVM) classifier is a binary classifier algorithm that looks for an optimal hyperplane as a decision function in a high-dimensional space. Having a training dataset where are the training examples and are the class labels at first we map into a higher dimensional space via a function , then computing a decision function in the form of:
by maximizing the distance between the set of points to the hyperplane parameterized by . The class label of is given by the sign of . The optimization problem for the SVM classifier with penalized misclassified examples can be written as:
With variables defined such that:
by solving for the Lagrangian dual of the problem 2, we obtain the simplified problem:
Random Forest is a classifier consisting of a collection of tree-structured classifiers where the are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input. 
For each tree in the random forest new training set is generated, by drawing with replacement from the original training set. Tree is grown on the new training set using random feature selection at each node. The resulting trees are not pruned.
XGBoost is a scalable machine learning system for tree boosting proposed by Chen and Guestrin . The impact of this system has been lately recognized in a number of machine learning and data mining challenges. For example, among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost.
Considering training dataset where are the training examples, are the class labels and is number of features, the output of model is voted or averaged by a collection of regression trees:
Each regression tree contains a continuous score on each of the leaves ( represents score on the -th leaf). To learn the set of functions used in the model, the following objective needs to be minimized
is the training loss function which measures how well the model fits on training data. The second termpenalizes the complexity of the model and is defined as:
where the is the complexity of each leaf,
is the number of leaves in a decision tree andis a parameter to scale the penalty. If we apply the second-order Taylor expansion to the loss function and remove the constant terms we obtain the objective at the -th iteration in the form of:
where and are respectively first and second derivative of the loss function.
4 Empirical results
Our objective was to find a prediction model that could be used as a real-world fraud detection system. Due to the high class imbalance we decided to focus our assessment of a particular algorithm on analyzing recall and precision statistics. For different parameters configurations we obtained results with either high recall and low precision or low recall and high precision. The former one has an obvious advantage of capturing most of the frauds that were present in a dataset. On the other hand, it is completely useless for a real world applications in which all the alerts have to be manually analyzed by a human being.
As we included almost all of the fraudulent transaction and only minor sample of non-fraudulent, we had distribution in which probability of a random account being a fraudulent one was significantly higher than in the real-world. Because of that, we could not rely on precision statistic as it is vulnerable to this problem. Instead of using precision as a false alarm verification cost estimator we decided to use false positive rate. It fits our purpose since it does not depend on the total amount of frauds in the dataset.
4.1 Random forest results
For random forest we decided to tune number of variables randomly sampled as candidates at each split (mtry), minimum size of terminal nodes (min.node.size) and different cut-off probabilities i.e. probability above which sample is actually predicted as a non-fraud.
As we can see in Table 2 biggest impact on the results has threshold which determines final predicted class. Larger threshold causes less samples to be classified as non-fraud and therefore an increase of recall and at the same time increase in FPR which we would like to keep low.
Instead of choosing one configuration which would be a trade-off between recall and false positive rate, we decided to distinguish classifiers able to find as many actual fraudulent accounts as possible (maximizing recall) and a classifiers that make as few mistakes in predicting fraud class as possible (minimizing false positive rate). Validation results presented in Table 2, are similar to the ones we got with cross-validation and confirm, the best configurations are: Conf. 3 in terms of FPR and Conf. 19 in terms of recall. For chosen configurations of random forest we created confusion matrices (presented in tables 4 and 4) that help to better analyze performance of this classifier on the dataset that is highly imbalanced.
|Configuration Value||Cross-validation results [%]|
4.2 Support Vector Machine results
For the purpose of training Support Vector Machines we chose the radial basis function as a kernel and additionally we increased cost of misclassifying samples to better address the problem of class imbalance in the dataset. The tuned parameters were: cost of constraints violation (cost) and kernel parameter gamma. As shown in Table5 SVM achieved high recall, but with quite low precision for almost all configurations. If we only consider recall, Conf 1. was better than random forests’ Conf 19. with significantly higher false positive rate. Actually, no set of parameters was able to get false positive rate lower than 10%. If we also had to choose configuration with the lowest FPR, Conf. 20 would be the best candidate.
|Configuration Value||Cross-validation results [%]|
4.3 XGBoost results
In case of XGBoost we analyzed following hyperparameters in different configurations: maximum depth of a tree (max.depth), minimum sum of instance weight needed in a child (min.child.weight), subsample ratio of columns when constructing each tree (colsample) and, as in random forests, cut-off probability. As for the training itself, we set maximum number of iterations to 2000 with learning rate parameter set to 0.1 using early stop if error does not decrease in 100 consecutive iterations.
Even though we built classifiers for 240 combinations of hyperparameters we decided to present only 20 most interesting. In Table 6 Conf. 1 - Conf. 10 have the smallest false-positive rate and the other 10 configurations have significantly larger recall. Looking at the classification results we can draw a similar conclusion as in the case of random forest - cut-off probability is the most important parameter for the outcome. After examining the other parameters we were not able to clearly describe their exact impact for the results. As shown in Table 6 validation results confirmed, Conf. 1 and Conf. 16 being the best in their categories, but slightly worse than the best two random forest configurations.
|Configuration Value||Cross-validation results [%]|
4.4 Sensitivity analysis
Decision to conduct sensitivity analysis was motivated by our inability to indicate the exact moment of the marking any particular account as fraudulent and thus aggregated transactions data might be contaminated with transactions that happened after an alert on Etherscan has been raised for a particular account. This may lead to look-ahead bias since we are using data that was unknown at the moment of detecting a fraudulent account. In our approach we investigated what impact on the quality of the classifiers excluding the most important and potentially biased variables might have.
Importance of considered variables is not as easily determined when using SVM as in random forest or XGBoost. Furthermore, none of the SVM results was as satisfactory (in terms of recall) as the best of random forests or XGBoost. These two observations led to omission of SVM in our sensitivity analysis.
Explanatory variables importances were calculated separately for each of the best configurations and are presented in the Figure 2.
Considering random forests variable importance (sometimes called ”gini importance”) is defined as the total decrease in node impurity weighted by the probability of reaching that node averaged over all trees in the forest. Impurity is defined as:
with being the number of classes and being the probability of picking a datapoint with class .
In case of XGBoost relative variable importance is measured as the Gain which is contribution of the corresponding feature to the model calculated by taking each feature’s contribution for each tree in the model. If we define and (based on the Equation 11) where is the set of indices of data points assigned to the -th leaf, we can express Gain as:
This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf.
As we can see in Fig. 2 the most important variables for each classifier are usually connected with the incoming and the least important with the outgoing transactions. The only variable that is either first or second in terms of importance for all three classifiers is average time between incoming transactions. For the XGBoost we decided to apply a minor change to the chosen configurations. Instead of stopping after having no decrease of error in 100 consecutive iterations, XGBoost would do 2000 iterations regardless of the results.
|Validation results [%]|
|Conf. 3 ( = 2)||99.98||15.55||81.71||0.02||26.12|
|Conf. 3 ( = 4)||99.98||14.62||84||0.02||24.90|
|Conf. 3 ( = 8)||99.98||7.66||71.74||0.02||13.84|
|Conf. 19 ( = 2)||89.52||82.37||4.62||10.48||8.74|
|Conf. 19 ( = 4)||89.38||81.67||4.52||10.62||8.57|
|Conf. 19 ( = 8)||88.66||68.91||3.60||11.34||6.86|
|Validation results [%]|
|Conf. 1 ( = 2)||99.95||26.68||75.16||0.05||39.38|
|Conf. 1 ( = 4)||99.95||17.63||68.46||0.05||28.04|
|Conf. 1 ( = 8)||99.98||2.78||54.55||0.02||5.30|
|Conf. 16 ( = 2)||92.66||76.33||6.02||7.34||11.15|
|Conf. 16 ( = 4)||90.69||71.46||4.51||9.31||8.49|
|Conf. 16 ( = 8)||87.03||62.41||2.88||12.97||5.50|
5 Conclusions and future work
Due to the significant developments in blockchain technology, dedicated fraud prevention systems are an important area of research. We proposed a machine learning based method for predicting whether a particular account on Ethereum blockchain might be fraudulent.
Three different classifiers were analyzed and out of them Random Forest obtained the best results in terms of recall and false positive rate separately, having the other statistics at the reasonable level (in one of the configurations SVM had the best recall for the validation set but at the same time it had three times worse false positive rate).
Best recall for Random Forest was 84.92%. It did not justify using this model in any real-world anti-fraud system. The reason was significant amount of type I error being made by that classifier where almost 10% percent of all accounts would be alerted.
Configuration 3 for Random Forest that achieved 0.02% of false positive rate was still able to detect 23.67% of all frauds. This result can be perceived as a good candidate for an automated anti-fraud system. If we would like to deploy such a system on any cryptocurrency exchange or within cryptocurrency wallet we will mark as fraudulent one in five thousands accounts.
As for future work, we would like to obtain data from exchanges that will help determine whether proposed method can be applied in the current form or is needing further enhancements.
Conducted sensitivity analysis showed that proposed model are not too sensitive for particular explanatory variables but one of future research directions may include estimating exact moments of marking particular account as fraudulent. Then, we would not take a risk of our training set being vulnerable to look-ahead bias.
-  (2016) Fraud detection system: a survey. Journal of Network and Computer Applications 68, pp. 90–113. Cited by: §2.
-  (2016) Financial frauds: data mining based detection–a comprehensive survey. International Journal of Computer Applications 156 (10). Cited by: §2.
A training algorithm for optimal margin classifiers.
Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152. Cited by: §3.3.
-  (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.3.
-  (2014) A next-generation smart contract and decentralized application platform. white paper. Cited by: §1.
-  (2017) A data mining based system for credit-card fraud detection in e-tail. Decision Support Systems 95, pp. 91 – 101. External Links: Cited by: §2.
-  (2016) XGBoost: A scalable tree boosting system. CoRR abs/1603.02754. External Links: Cited by: §3.3.
-  (2004) Survey of fraud detection techniques. In IEEE International Conference on Networking, Sensing and Control, 2004, Vol. 2, pp. 749–754. Cited by: §2.
-  (2016) Anomaly detection in bitcoin network using unsupervised learning methods. CoRR abs/1611.03941. External Links: Cited by: §2, §3.1.
-  (2008) Real-time credit card fraud detection using computational intelligence. Expert Systems with Applications 35 (4), pp. 1721 – 1732. External Links: Cited by: §2.
-  (2014) Ethereum: a secure decentralised generalised transaction ledger. Ethereum project yellow paper 151, pp. 1–32. Cited by: §1.
-  (2016) The bitcoin ecosystem: disruption beyond financial services?. Cited by: §1.