1 Introduction
The past decade has witnessed an explosive growth of graph data, and analysis of largescale networks has attracted increasing attention from both academia and industry [Volpp2006]. However, as a kind of networks that exists widely in the real world, there are relatively few analytical studies on financial transaction networks because the transaction data are usually private for the sake of security and interest. Fortunately, the recent emergence of blockchain technology makes transaction data mining more feasible and reliable. Generally speaking, blockchain is an open and distributed ledger technology managed by a peertopeer network through a special consensus mechanism, and all transaction records on blockchain are publicly accessible [Swan2015]. The open nature of blockchain data provides researchers with unprecedented opportunities for data mining in this area [Tasca et al.2018, Feder et al.2018, Atzei et al.2017, Möser et al.2013].
Being the largest public blockchainbased platform that supports smart contract, Ethereum [Wood2014] has attracted wide attention and its market capitalization has reached 20 billion USD [Chen et al.2018]. To facilitate the implementation of smart contracts, Ethereum introduces the concept of account, which is formally an address^{2}^{2}2Ethereum accounts/addresses are composed of the prefix ”0x”, a common identifier for hexadecimal, concatenated with the rightmost 20 bytes of the public key. One Example is “0x00b2ed34791c97206943314ee9cbd9530762a320” , but adds storage space for recording account balances, transactions, codes, etc. The corresponding cryptocurrency on Ethereum, known as Ether, can be transferred between accounts and used to compensate participant mining nodes. Since its debut in 2014, Ethereum has accumulated a large number of user transaction records. Utilizing these records, [Chen et al.2018] conducts the first systematic study to characterize Ethereum and obtain new observations via traditional network analysis. Different from other largescale complex networks, Ethereum transaction network, where each edge represents a particular Ether transaction, contains some unique information such as the directions, amount values and timestamps of the transactions. It is essential to incorporate such information for accurate modeling, characterization, and understanding of transaction network data. In addition, multiple transactions between two users are expected and it is more comprehensive to model a transaction network as a multidigraph^{3}^{3}3In graph theory, a multigraph (in contrast to a simple graph) is a graph which is permitted to have selfloops and multiple edges (also called parallel edges). A multidigraph is a directed multigraph. rather than a simple graph. Therefore, in this work, we model the Ethereum transaction network as a Temporal Weighted Multidigraph where a node is a unique address and an edge represents a transaction weighted by amount and assigned with timestamp.
In recent years, researchers have extensively investigated a variety of machine learning applications on largescale complex networks, and the performance of these machine learning tasks is heavily dependent on the choice of data representation. Graph embedding is an effective method to represent node features in a low dimensional space for network analysis and downstream machine learning tasks
[Cai et al.2018]. Among various graph embedding methods, a series of random walk based approaches have been proposed to learn a mapping function from an original graph to a low dimensional vector space by maximizing the likelihood of cooccurrence of neighbor nodes
[Perozzi et al.2014, Grover and Leskovec2016]. Inspired by the algorithm [Mikolov et al.2013a]proposed for natural language processing, these random walk based embedding methods are especially useful when the network is too large to be measured entirely
[Goyal and Ferrara2018]. Recently, to better extract the temporal information from dynamic networks, [Nguyen et al.2018] proposed a general framework called ContinuousTime Dynamic Network Embeddings (CTDNE) to incorporate temporal dependencies into existing random walk based network embedding models.Taking the realistic rules and features of transaction networks like the Ethereum, the challenges of transaction network embedding are listed as follows: (1) Transaction networks evolve continuously over time with additions of links, which is overlooked in most of the existing graph embedding algorithms; (2) The practical meaning of connections between accounts is not a oneoff established relationship but a timedependent event. Hence multiple edges need to be considered in transaction network embedding; (3) Unlike social network, random walks on Ethereum transaction network are concrete, which represent money transfer flows in the real world; (4) The amount value of transaction reflects the similarity between two accounts to some extent. In most cases, the larger amount of transaction, the closer relationship between two accounts. Figure 1 is a microcosm of transaction activities on Ethereum.
To this end, we propose a novel framework named Temporal WEighted MultiDiGraph Embedding (TEDGE), which aims to capture the nonnegligible temporal properties and important moneytransfer tendencies of timesensitive transaction networks. For the transaction networks discussed here, existing methods that ignore temporal information may sample a large number of invalid transaction sequences to derive node embeddings. For example in Figure 1, is a possible random walk sequence in traditional methods. However, it is not practical in a temporal graph as the transaction from to happens earlier. While in CTDNE [Nguyen et al.2018], although temporal information is considered, the existence of multiple edges between points is neglected. For instance, according to CTDNE, the temporal walk from to is represented as a sequence of nodes . However, whether is possible for the next walk depends on whether the transaction path 1⃝ or 3⃝ is sampled by the previous walk from to .
In this work, we represent a length temporal walk as a sequence of nodes together with a sequence of edges traversed in nondecreasing timestamps. This kind of temporal walk represents an actually feasible path for money flow in the transaction network. Therefore, the proposed method is expected to learn more meaningful and accurate timedependent node embeddings that capture more comprehensive properties from dynamic transaction networks.
The main contributions of our paper are as follows:

To the best of our knowledge, this is the first work to understand Ethereum transaction records via graph embedding. In particular, we consider two important and practical machine learning tasks, namely link prediction and node classification.

We refine the definition of a temporal walk for transaction networks by considering temporal dependencies and multiplicity of edges. This kind of random walk sequences contains the practical meaning of money flow in transaction networks.

We propose a novel graph embedding method called Temporal Weighted Multidigraph Embedding (TEDGE) which incorporates transaction information from both time and amount domains, and experiments on realistic Ethereum data demonstrate its superiority over existing methods.
2 Framework
Figure 2 demonstrates the four main steps of the proposed framework for Ethereum transaction network analysis, including data collection, network construction, graph embedding and downstream applications. The parts of network construction and graph embedding are described in the rest of this section, and the parts of data collection and applications will be explained later in Section 3.
2.1 Network Construction
Ether transfer is one of the major activities happening on Ethereum. Here we abstract an Ether transfer transaction as a fourtuple (src, dst, w, t), which means the sender src transfers w Ether to the recipient dst at time t. To investigate the Ether transfer on Ethereum, we abstract the Ethereum transaction network as a Temporal Weighted Multidigraph:
Definition 1 (Temporal Weighted Multidigraph (TWMDG)).
Given a graph , let be the set of nodes and be the set of edges. Each edge is unique and is represented as , where is the source node, is the target node, is the weight value and is the timestamp. For the sake of simplicity, we define mapping functions , , , for .
Based on collected fourtuples from Ethereum transaction records, we can build a Temporal Weighted Multidigraph, where each node represents a unique account and each edge represents a unique Ether transfer transaction.
2.2 Temporal Weighted Multidigraph Embedding
We now define the problem of Temporal WEighted MultiDiGraph Embedding (TEDGE) as follows: Given a temporal weighted multidigraph , our principal goal is to learn an embedding function () which preserves original network information including node similarity, as well as temporal and weighting properties specifically for financial transaction networks, thus enhancing predictive performance on downstream machine learning tasks. The proposed method aims to learn more appropriate and meaningful dynamic node representations using a general embedding framework consisting of two main parts. The first part is a random walk generator, which samples a set of walks with the temporal constraint and flexible biased strategies; the second part is an update procedure based on SkipGram [Mikolov et al.2013a, Mikolov et al.2013b], which learns node embeddings as a maximum likelihood optimization problem.
Random walk mechanism has been widely proved to be an effective technique to measure local similarity of networks for a variety of domains [Spitzer2013]. For a temporal weighted multidigraph discussed here, we define the concept of a Temporal Walk as follows:
Definition 2 (Temporal Walk).
In TWMDG, a temporal walk from node to is an length path traversed in nondecreasing timestamps. Such a temporal walk is represented as a sequence of nodes together with a sequence of edges , where , , and . We define that nodes and are temporally connected if there exists a temporal path from to .
In order to sample valid random walks which obey the temporal constraint, we introduce a new concept called Temporal Successive Edges in TWMDG.
Definition 3 (Temporal Successive Edges).
Given a temporal weighted multidigraph , the temporal successive edges of a node at time is defined as follows:
For instance, in Figure 1, let , then . The set of temporal successive edges plays the role of candidate for walkers to select possible successors.
Apart from the temporal constraint, we further develop biased searching strategies by considering more detailed transaction information. For the Ethereum transaction network discussed here, we abstract the transaction time and amount as the temporal and weighted information of a TWMDG. Consider a random walk that just traversed edge , and is now stopping at node at time . The next node of the random walk is decided by selecting a temporally valid edge
. We describe different sampling biases by formulating the selection probability for each temporal successive edge
.From the perspective of temporal domain, we consider both unbiased and biased sampling strategies as follows.

Temporal Unbiased Sampling (TUS). This is the default setting in the time domain, which assumes that each temporal successive edge of node at time has the same probability to be selected:
(1) 
Temporal Biased Sampling (TBS). For financial transaction networks, the similarity between accounts is timedependent and dynamic.
On the one hand, the accounts with frequent interactions are supposed to have a stronger relationship. Therefore, we let be a function that maps the timestamps of edges to a descending ranking. In this case, each edge will be assigned with a selection probability:
(2) where denotes the timestamp of the edge . This sampling method biases the selection towards edges that are closer in time to the previous edge.
On the other hand, sampling the interactions among accounts in a large time interval may also be important for different domains of networks for the purpose of preserving global similarity in time domain. For such scenarios, we propose another strategy that favors edges appearing later to the previous timestamp. Let be a function that maps the timestamps of edges to an ascending ranking. The probability of selecting each edge can be given as:
(3)
Apart from the transaction time, the amount values of the edges (edge weights) also plays an essential role in financial transaction networks. In the following, we present unbiased and biased strategies from a weighted domain.

Weighted Unbiased Sampling (WUS). Similar to TUS, this is the default setting in the amount domain and each edge has the same probability to be sampled:
(4) 
Weighted Biased Sampling (WBS). As illustrated in the Introduction, the weight value of each transaction indicates the significance of interactions between the two accounts involved. For most instances, a higher value of transaction amount implies a larger similarity between the two accounts. Thus each edge can be assigned the selection probability:
(5) To prevent the extreme situation where edges with small weights would never be sampled, we consider a linear mapping function to weakens the effects of edge weights. Thus we have
(6)
Furthermore, we combine the aforementioned sampling probabilities from both temporal and weighted domains, i.e., and , by for . Here is the default value for balancing between time domain and amount domain. Note that TEDGE, with default settings TUS and WUS, can be regarded as a specific version of DeepWalk for temporal and directed multigraphs like the transaction networks. In other words, under the temporal constraint, all candidate edges (temporal successive edges) are equally likely to be selected by TEDGE, while TEDGE (TBS), TEDGE (WBS) and TEDGE (TBS+WBS) select the edges with temporal or/and weighted biases.
Given the sampling results of temporal random walks, we formulate the task of learning time and weight dependent graph embedding in a TWMDG as an optimization problem. This optimization aims to maximize the logprobability of observing a node’s neighborhood conditioned on its embedding vector:
(7) 
where is the window size which restricts the size of random walk context. According to the conditional independent assumption in SkipGram, Eq. 7 can be transformed to
(8) 
3 Experiments on Ethereum
3.1 Data Collection
On Ethereum, accounts can be divided into two categories, external owned accounts (EOA) which are similar to general bank accounts [Weili and Zibin2018]; and smart contract accounts which are source code files. In this work, we focus on the transactions among EOAs for the reason that the Ether transfer records between them are publicly available in the blockchain. Besides, we only include the successful transactions among EOAs with nonzero amount value into our dataset.
Since it is extremely timeconsuming to process the whole Ethereum transaction network with more than two million EOAs [Chen et al.2018], here we ascertain a number of objective accounts and then obtain their transaction data through APIs of Etherscan (https://etherscan.io/). Centered by each objective account, we obtain a directed order subgraph (See an example in Figure 4). in and out are two parameters to control the depth of sampling inward and outward from the center, respectively.
On Ethereum, various related information of Ether transactions is stored as data packages. In details, the TxHash field is a unique identification of a transaction, the Value field in a transaction refers to the amount of money transferred, and the Timestamp field indicates when the transaction happens. Besides, the From and To field denote the sender and recipient of the transaction. With the collected fourtuples , we can easily construct a temporal weighted multidigraph.
3.2 Link Prediction
Link prediction problem predicts the occurrence of links in a given graph on the basis of observed information. In this work, we first evaluate performance of the proposed TEDGE method on a temporal directed link prediction task based on binary classification.
First of all, we sort all the collected edges according to their timestamps and assume the earlier edges (with a smaller value of timestamp) as the known links, and denotes the nodes involved in . Node set and edge set constitute the current network . Then we can learn node representations of the current network for
via graph embedding methods. Secondly, for the binary classifier, node pairs
existing in act as positive samples of the training set. Then we randomly sample an equal number of node pairs with no link as negative samples. We obtain features of a directed link from nodes to by concatenating their node embeddings, i.e., . If , . Finally, we train a support vector classifier to classify the links in the test set where the remainder (links with a larger value of timestamp) are treated as the positive samples.Dataset  Current network  Node pairs split for classification  

#train  #test  test/train  
EthereumG1  3,832  208,927  13,658  1,140  8.35% 
EthereumG2  10,628  208,533  26,958  7,510  27.86% 
EthereumG3  26,175  677,785  66,102  11,502  17.40% 
Metrics(%)  EthereumG1  EthereumG2  EthereumG3  

AUC  AP  AUC  AP  AUC  AP  
DeepWalk  82.71  76.69  85.91  82.13  79.92  77.72 
node2vec  83.03  76.94  86.30  82.47  82.20  79.99 
TEDGE  87.73  83.73  92.85  90.29  93.00  90.78 
TEDGE(TBS+WBS)  89.55  85.58  93.36  90.94  93.83  91.89 
Dataset
In this work, we collect three subgraphs with different size from Ethereum for experiments. EthereumG1 is centered by account “0x51faeda318982f439e80012fb45d2b017ddccdbe” with in = out = 3; EthereumG2 is centered by account “0x5e247060f48eeb64367250ed03ff5091bba47fd1” with in = out = 4; EthereumG3 is centered by the same account as EthereumG1 with in = out = 4. A summary of the dataset is listed in Table 1.
Settings
In the experiments, we compare the proposed TEDGE with two baseline random walk based graph embedding methods, DeepWalk [Perozzi et al.2014] and node2vec [Grover and Leskovec2016]. To ensure a fair comparison, we implement the directed version of DeepWalk and node2vec using OpenNE [THUNLP2017]
, an open source toolkit for graph embedding. For these random walk based embedding methods, we have several hyperparameters: the node embedding dimension
, the size of window , the length of walk , and walks per node . In general, we set , and . Specifically, we set , for EthereumG1, , for EthereumG2, , for EthereumG3. For node2vec, we grid search over according to [Grover and Leskovec2016]. For DeepWalk, we set as it is a special case of node2vec.Discussion of results
Table 2 compares the performance of various methods on temporal directed link prediction in terms of Area Under Curve (AUC) and Average Precision (AP). For a clearer illustration, we only demonstrate two extreme sampling strategies of proposed algorithm: TEDGE, which does not apply any bias, and TEDGE (TBS+WBS), which combines biases from both timedomain and amountdomain with default . As discussed in Section 2.2, we have two kinds of TBS defined in Eqs. 2 and 3 as well as two kinds of WBS defined in Eqs. 5 and 6. Here we implement all the four possible combinations for TEDGE (TBS+WBS), and report the best result in Table 2.
According to Table 2, we have the following observations: (1) TEDGE without any bias overwhelmingly outperforms DeepWalk and node2vec, which manifests that the temporal information as well as the multiplicity characteristic of edges in TWMDG are very important and meaningful for analysis and understanding of financial transaction networks; (2) With biases of both time and amount domains, TEDGE (TBS+WBS) attains better performance than unbiased TEDGE, demonstrating that the rich information from time and amount domains does help us obtain a more comprehensive representation for predictive tasks.
To further illustrate the superiority of TEDGE methods, we compare the performance of the embedding methods on EthereumG1 with varying value of node embedding dimension , walk length , walks per node and window size . Results in Figure 5 point out that: (1) TEDGE with or without additional biases consistently outperform DeepWalk and node2vec under different circumstances of , , ; (2) DeepWalk and node2vec are more sensitive to two hyperparameters, walk length and walks per node , while TEDGE methods can always achieve promising results with a wide range of both and ; (3) Interestingly, with an increase of , the performance of TEDGE methods monotonically improves but performance of DeepWalk and node2vec degrades with larger than 64, which implies that TEDGE methods can embed richer helpful information and thus requiring a larger value of for data representation.
To further investigate the effects of different sampling strategies on TEDGE methods, we provide results of all possible combinations of three time domain strategies defined in Eqs. 1, 2, 3 and three amount domain strategies described in Eqs. 4, 5, 6. Figure 6 shows that averagely, the biased methods {TEDGE (TBS), TEDGE (WBS), TEDGE (TBS+WBS)} outperform the unbiased method TEDGE; Methods adding bias in both time and amount domain TEDGE (TBS+WBS) surpass methods adding only one bias {TEDGE (TBS), TEDGE (WBS)}.
3.3 Node Classification
Phishing scam is a new type of cybercrime which arises along with the emergence of online business [Liu and Ye2001]. It is reported to accounts for more than 50% of all cybercrimes in Ethereum since 2017 [Konradt et al.2016]. To further evaluate the performance of the proposed TEDGE strategies, we also conduct node classification experiments on Ethereum to classify labeled phishing nodes and unlabeled nodes (treated as nonphishing nodes). In this part, we consider 445 phishing nodes labeled by Etherscan and the same number of randomly selected unlabeled nodes as our objective nodes, and a detailed list of these nodes is given in [Authors2019]. We make an assumption that for a typical Ether transfer flow centered on a phishing node, the previous node of the phishing node may be a victim, and the next one to three nodes may be the bridge nodes with money laundering behaviors. Therefore, we collect subgraphs with in = 1, out = 3 for each of the 890 objective nodes and then splice them into a largescale network with 86,623 nodes.
Training Ratio  60%  70%  80%  

Metrics(%)  MiF1  MaF1  MiF1  MaF1  MiF1  MaF1 
DeepWalk  79.33  79.17  80.30  80.19  80.79  80.67 
node2vec  79.72  79.56  80.15  80.05  80.56  80.36 
TEDGE  81.97  81.95  82.17  82.15  82.81  82.78 
TEDGE(TBS+WBS)  81.97  81.94  83.37  83.37  85.06  85.05 
For all embedding methods, we utilize the same hyperparameter setting (, , , ), and the specific settings for node2vec are the same as that in link prediction experiments. To make a comprehensive evaluation, we randomly select {60%, 70%, 80%} of objective nodes as training set and the remaining objective nodes as test set respectively. We use fivefold cross validation to train the classifier and evaluate it on the test set. The results of microF1 (miF1) and MarcoF1 (maF1) are shown in Table 3. These results further verify our assumption and motivation in Section 1 that, with consideration of temporal properties and moneytransfer information, we can obtain a more meaningful representation of transaction networks which can effectively boost predictive performance.
4 Conclusion
In this work, we proposed a novel framework for Ethereum analysis via network embedding. Particularly, we constructed a temporal weighted multidigraph to retain information as much as possible and present a graph embedding method called TEDGE which incorporates temporal and weighted information of financial transaction networks into node embeddings. We implemented the proposed and two baseline embedding methods on realistic Ethereum network for two predictive tasks with practical relevance, namely, temporal link prediction and phishing/nonphishing node classification. Experimental results demonstrated the effectiveness of the proposed TEDGE embedding method, meanwhile indicating that a temporal weighted multidigraph can more comprehensively represent the temporal and financial properties of dynamic transaction networks. For future work, we can use the proposed embedding method to investigate more applications of Ethereum or extend the current framework to analyze other largescale temporal or domaindependent networks.
References
 [Atzei et al.2017] Nicola Atzei, Massimo Bartoletti, and Tiziana Cimoli. A survey of attacks on ethereum smart contracts (sok). In Principles of Security and Trust, pages 164–186, Berlin, Heidelberg, March 2017. Springer Berlin Heidelberg.
 [Authors2019] Anonymous Authors. Objective accounts in node classification. https://anonfiles.com/3cl8X9ufb4/nodeClassification_xlsx, 2019.
 [Cai et al.2018] Hongyun Cai, Vincent W Zheng, and Kevin ChenChuan Chang. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616–1637, 2018.
 [Chen et al.2018] Ting Chen, Yuxiao Zhu, Zihao Li, Jiachi Chen, Xiaoqi Li, Xiapu Luo, Xiaodong Lin, and Xiaosong Zhange. Understanding ethereum via graph analysis. In IEEE INFOCOM 2018IEEE Conference on Computer Communications, pages 1484–1492, Honolulu, HI, USA, April 2018. IEEE.
 [Feder et al.2018] Amir Feder, Neil Gandal, JT Hamrick, and Tyler Moore. The impact of ddos and other security shocks on bitcoin currency exchanges: Evidence from mt. gox. Journal of Cybersecurity, 3(2):137–144, 2018.
 [Goyal and Ferrara2018] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems, 151:78–94, 2018.
 [Grover and Leskovec2016] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864, New York, NY, USA, August 2016. ACM.
 [Konradt et al.2016] Christian Konradt, Andreas Schilling, and Brigitte Werners. Phishing: An economic analysis of cybercrime perpetrators. Computers & Security, 58:39–46, 2016.
 [Liu and Ye2001] Jiming Liu and Yiming Ye. Introduction to ECommerce Agents: Marketplace Marketplace Solutions, Security Issues, and Supply and Demand. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
 [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119, Lake Tahoe, Nevada, USA, December 2013. Curran Associates, Inc.
 [Möser et al.2013] Malte Möser, Rainer Böhme, and Dominic Breuker. An inquiry into money laundering tools in the bitcoin ecosystem. In 2013 APWG eCrime Researchers Summit, pages 1–14, San Francisco, CA, USA, September 2013. IEEE.
 [Nguyen et al.2018] Giang Hoang Nguyen, John Boaz Lee, Ryan A. Rossi, Nesreen K. Ahmed, Eunyee Koh, and Sungchul Kim. Continuoustime dynamic network embeddings. In Companion Proceedings of the The Web Conference 2018, pages 969–976, Republic and Canton of Geneva, Switzerland, April 2018. International World Wide Web Conferences Steering Committee.
 [Perozzi et al.2014] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 701–710, New York, NY, USA, August 2014. ACM.
 [Spitzer2013] Frank Spitzer. Principles of Random Walk. Springer Science & Business Media, 2013.
 [Swan2015] Melanie Swan. Blockchain: Blueprint for a new economy. O’Reilly Media, Inc., Cambridge, Massachusetts, 2015.
 [Tasca et al.2018] Paolo Tasca, Adam Hayes, and Shaowen Liu. The evolution of the bitcoin economy: Extracting and analyzing the network of payment relationships. The Journal of Risk Finance, 19(19):94–126, 2018.
 [THUNLP2017] THUNLP. Openne: An open source toolkit for network embedding. https://github.com/thunlp/openne, 2017.
 [Volpp2006] Leti Volpp. Complex networks: structure and dynamics. Physics Reports, 424(4):175–308, 2006.
 [Weili and Zibin2018] Chen Weili and Zheng Zibin. Blockchain data analysis: A review of status, trends and challenges. Journal of Computer Research and Development, 55(9):1853–1870, 2018.
 [Wood2014] Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger. Ethereum project yellow paper, 151:1–32, 2014.
Comments
There are no comments yet.