The narrative around cryptocurrency privacy provisions has dramatically changed since the inception of Bitcoin . Initially many, especially criminals, thought Bitcoin and other cryptocurrencies provide privacy to hide their illicit business activities . The first extensive study about Bitcoin’s privacy provisions was done by Meiklejohn et al , in which they provide several powerful heuristics allowing one to cluster Bitcoin addresses. The revelation of Bitcoin’s privacy shortcomings spurred the creation and implementation of many privacy-enhancing overlays for Bitcoin [55, 9, 47, 64]. As of today, several Bitcoin wallets, e.g. Wasabi and Samourai wallets, provide privacy-enhancing solutions to their users.
Previous work has focused on assessing the privacy guarantees provided by several UTXO-based (unspent transaction output) cryptocurrencies, such as Bitcoin [2, 32], Monero [13, 33, 7] or Zcash [6, 5, 7, 24, 53].
However, perhaps surprisingly, until today there were no similar studies on account-based cryptocurrency privacy provisions. Therefore in this work, we put forth the problem of studying the privacy guarantees of Ethereum’s account-based model. Assessing and understanding the privacy guarantees of cryptocurrencies is essential as the lack of financial privacy is detrimental to most cryptocurrency use cases. Furthermore, there are state-sponsored companies and other entities, e.g. Chainalysis , performing large-scale deanonymization tasks on cryptocurrency users.
In contrast to the UTXO-model, many cryptocurrencies apply the account model. In an account-based cryptocurrency, users store their assets not in UTXOs but in accounts. Already in the Bitcoin whitepaper, Nakamoto  suggested that “a new key pair should be used for each transaction to keep them from being linked to a common owner.” Despite this suggestion, account-based cryptocurrency users tend to use only a handful of addresses for their activities. The account-based model reinforces address-reuse on the protocol level. This behavior practically makes the account-based cryptocurrencies inferior to UTXO-based currencies from a privacy point of view.
Previously, several works have identified the privacy shortcomings of the account-based model, specifically in Ethereum. Those works have proposed trustless coin mixers [31, 49, 51] and confidential transactions [61, 10, 12]. However, until recently, none of these schemes has been deployed on Ethereum. Even today, Ethereum’s privacy-enhancing overlays are still in a nascent, immature phase especially in comparison with Bitcoin’s well-established coin mixer scene.
We identify and apply several quasi-identifiers stemming from address reuse (time-of-day activity, transaction fee, transaction graph), which allow us to profile and deanonymize Ethereum users.
We establish several heuristics to decrease the privacy guarantees of non-custodial mixers on Ethereum.
We describe a version of the Danaan-gift attack  applicable in Ethereum.
We collect and analyze a wide source of Etherum related data, including Ethereum name service, Etherscan blockchain explorer, Tornado Cash mixer contracts, and Twitter.
We release the collected data as well as our source code for further research111https://github.com/ferencberes/ethereum-privacy.
The rest of this paper is organized as follows. In Section 2, we review related work. In Section 3, we give a brief background on Ethereum and its inner workings. In Section 4, we describe our collected data. In Section 5
, we overview the literature on quantifying deanonymization methods and propose our evaluation metrics. In Section6 and 7, we describe our main methods to pair Ethereum accounts that belong to the same user and link Tornado deposits and withdrawals. We describe a variant of the Danaan-gift attack in Section 8. Finally, we conclude our paper in Section 9 by pointing out promising directions for future work.
2 Related Work
First results on Ethereum deanonymization  attempted to directly apply both on-chain and peer-to-peer (P2P) Bitcoin deanonymization techniques. The starting point of our work is that common deanonymization methods for Bitcoin may not be applicable to Ethereum due to differences in Ethereum’s P2P stack and account-based model.
The relevant body of more recent literature takes two different approaches. The first analyzes and clusters Ethereum smart contracts with unsupervised clustering techniques . Kiffer et al.  assert a large degree of code reuse which might be problematic in case of vulnerable and buggy contracts.
The second and more relevant branch of literature analyzes and clusters addresses in Ethereum. A crude and initial analysis had been made by Payette et al., who clusters the Ethereum address space into only four different clusters . More interestingly Friedhelm Victor proposes address clustering techniques based on participation in certain airdrops and ICOs . These techniques are indeed powerful, however, they do not generalize well as it assumes participation in certain on-chain events. Our techniques are more general and are applicable to all Ethereum addresses. Victor et al. gave a comprehensive measurement study of Ethereum’s ERC-20 token networks, which further facilitates the deanonymization of ERC-20 token holders .
A completely different and unique approach is taken by , which uses stylometry to deanonymize smart contract authors and their respective accounts. The work had been used to identify scams on Ethereum.
In this section we provide some background on cryptocurrency privacy-enhancing technologies. We provide more elementary preliminaries on Ethereum and its applied gas mechanism in Appendix A.
3.1 Non-custodial mixers
Coin mixing is a prevalent technique to enhance transaction privacy of cryptocurrency users. Coin mixers may be custodial or non-custodial. In case of custodial mixing, a user wishing to enhance her privacy sends her “tainted” coins to a trusted party, who in return sends back “clean” coins after some timeout. This solution is not satisfactory as the user does not retain ownership of her coins during the course of mixing. Hence, the trusted mixing party might just steal funds, as it already happened with custodial mixers .
Motivated by the drawbacks of custodial mixers, recently there have been several proposed non-custodial mixers in the literature [31, 60, 49, 51]. The recurring theme of non-custodial mixers is to replace the trusted mixing party with a publicly verifiable transparent smart contract or with secure multi-party computation (MPC). Non-custodial mixing is a two-step procedure. First, users wishing to mix coins deposit equal amounts of ether or other tokens into a mixer contract from an address , see Figure 1. After some user-defined time interval, they can withdraw their deposited coins with a withdraw transaction to a fresh address . In the withdraw transaction, users can prove to the mixer contract that they deposited without revealing which deposit transaction was issued by them by using one of several available cryptographic techniques, including ring signatures , verifiable shuffles , threshold signatures , and zkSNARKs .
3.2 Ethereum Name Service
Ethereum Name Service (ENS) is a distributed, open, and extensible naming system based on the Ethereum blockchain. In spirit it is similar to the well-known Domain Name Service (DNS). However, in ENS the registry is implemented in Ethereum smart contracts222See: https://docs.ens.domains, hence it is resistant to DoS attacks and data tampering. Like DNS, ENS operates on a system of dot-separated hierarchical names called domains, with the owner of a domain having full control over subdomains. ENS maps human-readable names like alice.eth to machine-readable identifiers such as Ethereum addresses. Therefore, ENS provides a more user-friendly way of transferring assets on Ethereum, where users can use ENS names (alice.eth) as recipient addresses instead of the error-prone hexadecimal Ethereum addresses.
4 Data collection
We collected addresses presumably belonging to regular users and not automatic (trader or exchange) bots from the following publicly available data sources. Twitter: By using the Twitter API, we were able to collect 890 ENS names included in Twitter profiles, and discover the connected Ethereum addresses, see Figure 2. Humanity DAO:333See: https://www.humanitydao.org/humansA human registry of Ethereum users, which can include a Twitter handle in addition to the Ethereum address. Tornado Cash mixer contracts: We collected all Ethereum addresses that issued or received transactions from Tornado Cash mixers. Table 1 shows the total number of addresses collected from each data source as well as addresses with at least sent transactions. We note that there are overlaps between the three address groups.
|Source||Total||At least||Used as ground|
|sent txs||truth pairs|
By using the Etherscan blockchain explorer API, we collected 1,155,188 transactions sent or received by the addresses in our collection. The final transaction graph contains 159,339 addresses. The transactions span from 2015-07-30 till 2020-04-04. The distribution of the number of transactions sent by each Ethereum address follows a power-law distribution. Figure 3 shows the average number of transactions sent and received in the three data sources. Addresses collected from Twitter and Humanity DAO have similar behavior, while Tornado accounts have fewer transactions since Tornado Cash has only recently been launched.
Finally, using the Etherscan Label Word Cloud, we manually collected service category labels (e.g. exchange, gambling, stablecoins) related to the most popular addresses in our data set. We summarize the fraction of ENS names in our collection that interacted with the given services in Figure 4. We observed that the publicly revealed ENS names already expose sensitive activities such as gambling and adult services. Therefore, users should avoid sensitive activities on addresses easily linkable to their public identities, such as ENS name or their Twitter handle.
5 Evaluation measures
). To establish an appropriate measure for evaluating our methods, we face the diversity and complexity of estimates of the adversary’s success to breach privacy. In the literature, the adversary’s output takes the form of a posterior probability distribution, see the survey.
The simplest metrics consider the success rate of a deanonymizing adversary. Metrics such as accuracy, coverage, fraction of correctly identified nodes [3, 37, 35] are applicable only when the attack has the potential to exactly identify a significant part of the network.
Exact identification is an overly ambitious goal in our experiments, which aim to use very limited public information to rank candidate pairs and quantify the leaked information as risk for a potential systematic deanonymization attack. For this reason, we quantify non-exact matches, since even though our deanonymizing tools might not exactly find a mixing address, they can radically reduce the anonymity set, which is still harmful to privacy. We want to quantify the information leaked from network structure, time-of-day activity, and gas price usage to assess the implications for the future privacy  of the account owners.
In our first two deanonymization experiments, our algorithms will return a ranked list of candidate pairs for each account in our testing set. Based on the ranked list, we propose a simple metric, the average rank of the target in the output.
Recent results consider deanonymization as a classification task and use AUC for evaluation . In our experiments, we will compute AUC by the following claim:
Consider a set of accounts , each with a set of candidate pairs such that exactly one in is the correct pair of . Let an algorithm return a ranked list of all sets . The AUC of this algorithm is equal to the average of over all , where is the rank of the correct pair of in the output.
Follows since AUC is the probability that a randomly selected correct record pair is ranked higher than another incorrect one . ∎
Finally, we consider evaluation by variants of entropy, which quantify privacy loss by the number of bits of additional information needed to identify a node. Defining entropy is difficult in our case for two reasons. First, our algorithms provide a ranked list and not a probability distribution. Second, for Tornado mixer deanonymization, the anonymity set size is dynamic, as users can freely deposit anytime they wish, hence increasing the anonymity set size.
In the literature, entropy based evaluation considers the a priori knowledge without a deanonymization method and the a posteriori knowledge after applying one . Several papers compute the entropy of the a posteriori knowledge [50, 15, 36], however they assume that the deanonymizer outputs a probability distribution of the candidate records .
The information the attacker has learned with the attack can be expressed as the difference of the a priori and a posteriori entropy. We call this difference the entropy gain, denoted as gain where and are the anonymity set size and probability distribution, respectively. The a priori entropy of the target record is typically the base-2 logarithm of the a priori anonymity set size. The problem with varying a priori anonymity set size is that while correctly selecting ten candidate users from a pool of a million is a great achievement, the same entropy of is achieved without deanonymization if the initial pool size, for example in a low-utilization mixer, is only 10. We note that in , the authors also divide the entropy gain to normalize the value.
Next, we describe a new method to infer the a posteriori distribution given varying a priori knowledge and appropriately normalize with respect to the a priori entropy. More precisely, first we give a heuristic argument that the a priori anonymity set size has little effect on the entropy gain, and hence we can compare and average across different measurements. In the formula below, given an a priori anonymity set size vs. , we compare the entropy gain of the same distribution , gaingain. In the formula below, denotes the probability .
Since , we may group the terms to obtain the difference in the entropy gain as the sum for of
which can be bounded from above by using as
If the probability distribution is smooth with little density changes in a neighborhood, the above value is very small. For example, the value is small if is monotonic in , which at least approximately holds in our experiments.
Based on the above argument, we may infer an empirical probability distribution of the candidates ranked by an algorithm. For each a priori size and rank for the ground truth pair of a target record, we define the distribution to be uniform in , and 0 elsewhere, in accordance with formula (5). The empirical probability distribution of an algorithm will be the average of over all the output of the algorithm. In the discussion, we will use the entropy gain of the above empirical probability distribution to quantify the deanonymization power of our algorithms.
6 Finding Ethereum accounts of the same user
In this section, we introduce our approach to identify pairs of Ethereum accounts that belong to the same user. In our measurements, we investigated three quasi-identifiers of the account owner: the active time of the day, the gas price selection, and the location in the Ethereum transaction graph.
We evaluate our methods by using the set of address pairs in our collection that belong to the same name in the Ethereum Name Service (ENS), see Figure 2. We consider 129 ENS names with exactly two Ethereum addresses to avoid the possible validation bias caused by ENS names with more than two addresses. We also note that Ethereum addresses connected to multiple ENS names were excluded from our experiments.
6.1 Time-of-day transaction activity
Ethereum blockchain transaction timestamps reveal the daily activity patterns of the account owner, see Figure 6. In the top row of Figure 8, we show time-of-day profiles for two ENS names that are active in different time zones.
6.2 Gas price distribution
Ethereum transactions also contain the gas price, which is usually automatically set by wallet softwares. Users rarely change this setting manually. Most wallet user interfaces offer three levels of gas prices, slow, average, and fast where the fast gas price guarantees almost immediate inclusion in the blockchain.
The changes in daily Ethereum traffic volume sometimes cause temporary network congestion, which affect user gas prices. Hence we normalized the gas price by the daily network average. In Figure 7, the two peaks of the normalized gas price around and correspond to the slow and average gas price options. On the other hand, users only occasionally charge more than three times the daily average gas price. The combination of these gas price levels forms the so-called gas price profile for each Ethereum user.
Given the normalized gas prices of the transactions sent, an account is represented by the vector including the mean, median and standard deviation, as well as the histogram divided into bins.
6.3 Transaction graph analysis
The set of addresses used in interactions characterize a user. Users with multiple accounts might interact with the same addresses or services from most of them. Furthermore, as users move funds between their personal addresses, they may unintentionally reveal their address clusters.
To exploit the transaction network for deanonymization, we constructed a transaction graph with nodes as Ethereum addresses and edges as transactions in our data collection. To find similar node pairs in this network, we use node embedding methods that map graph vertices into a Euclidean space in a way that nodes with similar neighbourhood are close in the embedded space. To the best of our knowledge, we are the first to apply node embedding for Ethereum user profiling. In our measurements, we used the graph embedding library444https://github.com/benedekrozemberczki/karateclub of Rozenberczki et al. , which includes ten graph neighbourhood preserving embedding methods [28, 63, 46, 44, 43, 40, 11, 42, 52, 4] as well as two structural ones [16, 46].
We applied the embedding methods after the following preprocessing steps. First, we considered transactions as undirected edges and removed loops and multi-edges. We also removed nodes with degree one as well as vertices that are not present in the largest connected component. The resulting graph has 16,704 nodes and 132,231 edges. We generated -dimensional representations for the accounts. In order to combine with timestamp and gas price representations, we assign the overall average of the network embedding vectors to the removed nodes.
Based on timestamp and gas price distributions as well as network embedding, we generate Euclidean feature vectors for Ethereum addresses with each having at least five transactions sent, see Table 1. Given a target address, we order all other addresses in our set by their Euclidean distance from the target. We consider multiple representations by concatenating the vectors of the timestamp, gas price and network embedding representations.
In the evaluation, we use 129 address pairs for testing that belong to the same ENS name. The accuracy metrics of Section 5 for identifying accounts of the same user by using time-of-day activity and normalized gas price is given in Figures 9–11. While time-of-day representation works best with to 6 (four to six hour long bins), normalized gas price representation performs weaker and the histogram gives only very small improvement with over mean, median and standard deviation.
The performance of different node embedding algorithms is shown in Figures 12–14 based on independent experiments. Diff2Vec , a neighbourhood preserving embedding technique performed best, followed by Role2Vec , which captures the structural node properties in the graph. Reciprocal rank combination of Diff2Vec and Role2Vec gives the best performance.
In Figure 15, we show the fraction of pairs where the rank of the ground truth pair is not more than a given value. Surprisingly, Diff2Vec and Role2Vec find the corresponding ENS address pairs within closest representations by almost more likely than time-of-day activity and gas price statistics. The combination of Diff2Vec and Role2Vec further improves the performance.
7 Deanonymizing trustless mixing services on Ethereum
As the Ethereum community realises the consequences of the lack of privacy on Ethereum, more and more emphasis is put on increasing transaction privacy [31, 49, 51]. Hence, privacy-enhancing tools became crucially important gadgets in the Ethereum ecosystem. Without doubt, the most popular is Tornado Cash (TC), a non-custodial zkSNARK-based mixer. It allows its users to enhance their anonymity by hiding their identity among a set of participating users. In this section, we provide techniques and heuristics allowing one to decrease the anonymity achieved in a TC mixer.
The Tornado Cash (TC) Mixers are sets of trustless Ethereum smart contracts allowing Ethereum users to enhance their anonymity.
A TC mixer contract holds equal amounts of funds (ether or other ERC-20 tokens) from a set of depositors. One mixer contract typically holds one type of asset. In case of the TC mixer, anonymity is achieved by applying zkSNARKs . Each depositor inserts a hash value in a Merkle-tree. Later, at withdraw time, each legitimate withdrawer can prove unlinkably with a zero-knowledge proof that they know the pre-image of a previously inserted hash leaf in the Merkle-tree. Subsequently, users can withdraw their asset from the mixer whenever they consider that the size of the anonymity set is satisfactory.
Cryptocurrency mixers typically provide -anonymity (also known as plausible deniability) to their users . Generally speaking, a -anonymized dataset has the property that each record is indistinguishable from at least others. Specifically, if a mixer contract holds deposits out of which had already been withdrawn, then the next withdrawer will be indistinguishable among at least those users who have not withdrawn from the mixer yet. Hence each withdrawer can enhance their transaction privacy and make their identity indistinguishable among at least addresses. We call the set containing the indistinguishable addresses the anonymity set of the user.
In Figure 16, we show the changes in the anonymity set size over time for four TC mixer contracts ( ETH, ETH, ETH, ETH) respectively. Since TC was launched in December 2019, hundreds of deposits were placed in the mixers as more and more user interacted with this service. In general, we observe orders of magnitude lower activity for the ETH mixer, thus it does not provide as much anonymity as mixers with lower values (ETH, ETH, ETH).
7.1 Heuristics for linking mixer deposits and withdraws
Unfortunately, careless usage easily reveals links between deposits and withdraws and also impact the anonymity of other users, since if a deposit can be linked to a withdraw, it will no longer belong to the anonymity set. Next, we list three usage patterns that can be used to link deposits and withdraws. The simplest careless usage is applying the same address for deposit and withdraw transactions as well:
Heuristic 1. If there is an address from where a deposit and also a withdraw has been made, then we consider these deposits and withdraws linked.
The next heuristic is based on salient gas price settings. Most wallet softwares, e.g. Metamask or My Ether Wallet, automatically sets gas prices as multiples of Gwei ( wei, i.e. giga wei). However, one can observe gas prices whose last 9 digits are non-zero, hence those gas prices are likely set by the transaction issuer manually. These custom-set gas prices can be used to link deposits and withdraw transactions. For instance, one might observe the deposit transaction555Depositor: at block height with Gwei gas price. Later on, there is a withdraw transaction666Withdrawer: at block height in the Ethereum blockchain with exactly the same custom-set gas price. This deposit and withdraw pair can be linked.
Heuristic 2. If there is a deposit-withdraw pair with unique and manually set gas prices, then we consider them as linked.
Frequently, users reveal links between their deposit and withdraw addresses if they sent transactions from one of their addresses to another address belonging to them. We conjecture that users falsely expect that withdraw addresses are clean, therefore they can send transactions from any address to their clean withdraw addresses. However, if the withdraw address can be linked to one of their deposit addresses, then they effectively lose all privacy guarantee accomplished by the fresh withdraw address. Express differently, if users run out of clean funds at their fresh addresses, they might feel tempted to move ”dirty” assets to their ”clean” addresses. Again, such a transaction links ”clean” and ”dirty” addresses which is captured by the following heuristic.
Heuristic 3. Let be a deposit and a withdraw address in a TC mixer. If there is a transaction between and (or vice versa), we consider the addresses linked.
One could easily generalize Heuristic 3 by requiring transactions to be sent from not only a depositor address , but rather from any address in the cluster of addresses containing . However, we leave the implementation of this generalization for future work.
|Mixer||Heuristic 1||Heuristic 2||Heuristic 3||Total||Withdraws|
Applying Heuristics 1–3, we found , , , and withdraws linked in the four mixer contracts ( ETH, ETH, ETH, ETH) respectively, see Table 2. We note that withdraws identified by Heuristic 2 can also overlap with other withdraws identified by Heuristic 1 or 3. Hence the number of total linked withdraws are less than the sum of all withdraws individually identified by each heuristic.
7.2 Elapsed time between deposit and withdraw
In Figure 17, we observe that most users of the linked deposit-withdraw pairs leave their deposit for less than a day in the mixer contract. This user behavior can be exploited for deanonymization by assuming that the vast majority of the deposits are always withdrawn after one or two days.
Even worse, in Figure 19 we observe several addresses receiving more than one withdraws from the ETH mixer contract. For instance, there are addresses with two withdraws and addresses with three withdraws. Withdraw clusters cause privacy risk not just for the owner but for all other mixer participants as well. Note that proper usage requires withdraw always to fresh addresses.
7.3 Deanonymization performance
Next we measure how well the techniques of Section 6 identify the linked withdraw-deposit address pairs. We build ground truth by using Heuristics 2–3 of Section 7.1. We define three different ground truth sets, one when the deposit is within the past day of the withdraw, another when within the past week, and the unfiltered full set, see Fig. 18.
Note that our ground truth sets are compiled by using Heuristics 2–3, and hence are correct up to our best knowledge on the data. Since in Heuristic 2 we used gas prices and in Heuristic 3 an edge between the two addresses, in this section, we show gas price only as reference, and omit the edges used by the heuristic for the network analysis algorithms. As we will see, gas price distribution performs weak for finding the account pairs identified by the Heuristics despite that Heuristic 2 is based on gas price, adding the edges between accounts identified by Heuristic 3 would yield overly strong deanonymization results, since the same information is used for deanonymization and testing.
Figure 20 shows that an address with withdraw within a day or week has significantly smaller anonymity set size, on average, since we only search for the corresponding deposit in a smaller set. For example, for the ETH mixer the original average anonymity set size of could be reduced to almost by assuming that the deposit occurred within one day of the withdraw.
We note that in Figure 20 and all other measurements over the filtered ground truth sets, we do not discount for the withdraw addresses that are not included in the filtered set. For example, as seen in Figure 17, for 80 0.1-Ether withdraw transactions, we list candidate deposits, but for the remaining 20, we make no deanonymization attempt. To normalize the results by considering these withdraws, we have to assume that the corresponding deposit is not in the 80-element candidate set but in the remaining 320, thus giving an average rank contribution of 160 for 20% of the data. Hence average rank for 0.1-Ether withdraws with deposit within a week have an additional correction of 32 for average rank; by similar calculations, the correction for transactions within a day is 63.
Daily activity and Diff2Vec have similar performance while their concatenated feature vectors proved to be the best address representation; for the smaller ground truth sets, they identify related deposit addresses within the 20 and 5 closest representations on average. Withdraw linking performance is further improved by concatenating the two models. Entropy gain is shown in Figure 22 and the number of withdraws linked to deposits within a given rank of the output for the best methods are in Figure 21.
In Figure 23, we show the withdraw linking performance over time. As the number of active deposits increases, it becomes harder to link withdraws to any of the past deposits. However withdraws that follow the deposit after a few days are still much easier to deanonymize.
7.4 Maintaining privacy
We do believe if users were using the technology in a sound way or a privacy-focused wallet software would have helped them and abstracted away potential privacy leaks, then TC mixers could possibly achieve higher degrees of anonymity.
7.4.1 Randomized mixing intervals
7.4.2 Fresh withdraw addresses
Currently, many users apply the same withdraw addresses across several withdraws, see Figure 19. This behavior greatly decreases the complexity of linking deposits and withdraws. Therefore users must use fresh withdraw addresses for each of their withdraws. This issue could have been easily fixed on the user interface level.
7.4.3 Mixer usage and user behaviors
Mixers mainly attempt to break the link between sets of transaction graphs associated with Ethereum accounts. As such, users need to ensure that their on-chain behaviors are unlinkable between uses of the TC mixers. Therefore, to ensure maximal privacy, users should use the TC mixers after every transaction. However, this decreases the user experience and ability to use applications on Ethereum.
8 Danaan-gift attack in Ethereum
The Danaan-gift attack, also known as malicious value fingerprinting, was introduced in . In a value fingerprinting attack, an adversary sends a cryptocurrency transaction with a crafted amount to add a fingerprint to the receiver’s account balance. Although value fingerprinting was originally introduced in the context of Zcash, we notice that these attacks are applicable to Ethereum as well. Most wallet software denominates gas prices in multiples of gwei ( wei where ), hence transaction fees overwhelmingly (in ) do not change the last digits of an account balance. Albeit, users might set transaction fees manually, potentially changing their own fingerprint (in ). The last digits of an account balance have no economic significance (1 gwei but could be used as a fingerprint by an adversary.
First, we measure the fraction of ether transfer transactions that modify the account fingerprint (). For the sake of robustness of the measurements, we chose fingerprints with the last eight digits. As seen in Figure 24, account balances are mostly integer values. However, the rest of the fingerprint values modulo
are moderately uniformly distributed. The entropy of the account balance fingerprints iswith a entropy gain. These results suggest that account balances might be easily fingerprinted. In the sequel, we estimate the average fingerprint survival probability.
Let denote the event that a fingerprint of an address remains unchanged. To approximate the event probability , let denote the probability that a transaction modified the fingerprint and let denote the number of transactions sent or received by the given address in our dataset. By assuming that each transaction is independent from all others, the fingerprint survival probability of this address is .
We observe that the distribution of the number of transactions sent and received by an address follow power-law distribution with . The average survival probability of all addresses can hence be approximated by the following integral, where we group by , the number of transactions of an address:
which can be computed in a closed formula. The numerical values are summarized in Table 3.
As the number of transactions sent follow a power-law distribution, the average value is skewed by the tail of the distribution. Therefore it makes sense to calculate the average survival probability for several cutoffs of the tail, see Table3. Namely, in each cutoff we only consider addresses in our data set that sent less number of transactions than the cutoff value. One can observe how fingerprint survival probability increases among users with a small number of transactions. For example, an adversary could successfully fingerprint of the addresses that send not more than transactions. This result is comparable to the fingerprint survival probability observed in Zcash .
8.1 Danaan-gift attack for confidential transaction overlays
We foresee that a prominent application of Danaan-gift attacks in Ethereum might be linking confidential transactions in privacy-enhancing overlays like the AZTEC protocol .
In a confidential transaction overlay, users can convert public amounts into confidential notes. Subsequently, they can send confidential notes to intended recipients by splitting and or joining their confidential notes. The amount of confidential notes is hidden, yet publicly verifiable by applying range proofs. Users can also convert their confidential tokens back to public amounts.
In this scenario, an adversary can fingerprint unsuspecting users inside a confidential transaction overlay, see Figure 25. Whenever a user deposits a public amount to the confidential asset pool an adversary could fingerprint her account. Subsequently, the user might issue several confidential transactions in this privacy-enhanced overlay. If the victim’s balance fingerprint survives during the course of issued confidential transactions, the adversary can observe when the user withdraws funds from the confidential asset pool by inspecting the fingerprint on the withdrawn amount. Thus the fingerprinting adversary can assess how much money the unsuspecting user paid in the confidential asset pool.
9 Future directions
We expect that in the near future more potent and powerful deanonymization tools and techniques will emerge. In this work, we solely applied on-chain data for deanonymizing Ethereum users. Subsequent tools will likely use a combination of on-chain and off-chain data. Therefore we deem the following directions would be extremely valuable for future work for the broader cryptocurrency research community.
9.1 Further quasi-identifiers
In this work we identified several quasi-identifiers of Ethereum accounts, such as time-of-day activity, gas price profile and position in the Ethereum transaction graph. However, we forecast that many more quasi-identifiers can be used for further profiling and deanonymizing Ethereum users. One such potential quasi-identifier is wallet fingerprints. One could establish which wallet a certain user employs by assessing how transaction gas prices are calculated. Different wallet softwares use different methods to compute suggested gas prices .
9.2 Network-level privacy
Assessing Ethereum’s privacy provisions entirely can only be established if one considers the full life-cycle of a transaction. Specifically, one also needs to understand how much privacy is lost when users interact with full nodes or wallet providers.
As the history of Bitcoin and other cryptocurrencies showed, full nodes and wallet providers can deanonymize regular users and light clients already on the network layer [6, 7, 18, 19, 54, 33]. An attacker could establish many well-connected nodes in the peer-to-peer layer to log the timing information of transactions. Due to the symmetry of broadcast, the adversary could infer the origin of the transaction [19, 6]. Yet, there are solely measurement studies on Ethereum’s P2P network structure [26, 20]. Therefore, it would be worthwhile to conduct a study on Ethereum’s P2P network, but from a privacy point of view. Fortunately, several proposals had been made to enhance network-level privacy for cryptocurrencies [8, 17].
Additionally, in Ethereum, special nodes called relayers gain more and more popularity. Relayers allow senders to issue feeless transactions, i.e. users can send transactions from addresses that do not hold ether yet. Such relayer nodes can also easily deanonymize their users. This is especially problematic in case of non-custodial mixers, like Tornado Cash.
9.3 Wallet and Browser Privacy
It has been shown how online trackers and cookies can aid the deanonymization of cryptocurrency users even when their coins were mixed through the use of a mixer . Many users of the Ethereum blockchain make use of a tool called MetaMask, a browser extension available in most desktop browsers. As such, for future research, it would be fascinating to analyze how the use of this extension affects the privacy of Ethereum users, even with the use of mixers. It may be possible to use the techniques presented in  to deanonymize users. Furthermore, as many Ethereum users also make use of mobile wallets, it may be useful to investigate how mobile phones can affect cryptocurrency users’ privacy and assess the privacy guarantees of these mobile wallet providers .
9.4 Privacy of UTXO-based cryptocurrencies
We note that the deanonymizing power of quasi-identifiers (e.g. temporal activity, wallet fingerprints etc.) is also applicable to UTXO-based cryptocurrencies. Even though in that case deanonymization is slightly more involved as one need to apply our techniques not to individual addresses but rather to clusters of UTXOs. We do foresee that more potent agencies can and will engage in such deanonymization campaigns. We believe that in practice, due to the aforementioned quasi-identifiers, also Bitcoin non-custodial mixers provide drastically less privacy and fungibility than what currently the community expects from those privacy-enhancing technologies.
We thank Daniel A. Nagy, David Hai Gootvilig, Domokos M. Kelen and Kobi Gurkan for conversations and useful suggestions. Support from Project 2018-1.2.1-NKP-00008: Exploring the Mathematical Foundations of Artificial Intelligence and the “Big Data—–Momentum” grant of the Hungarian Academy of Sciences.
-  (2018) Learning role-based graph embeddings. In StarAI workshop, IJCAI 2018, pp. 1–8. Cited by: §6.4.
-  (2013) Evaluating user privacy in bitcoin. In International Conference on Financial Cryptography and Data Security, pp. 34–51. Cited by: §1.
-  (2007) Wherefore art thou r3579x? anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on World Wide Web, pp. 181–190. Cited by: §5.
-  (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), pp. 585–591. External Links: Cited by: §6.3.
-  (2019) Privacy aspects and subliminal channels in zcash. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1813–1830. Cited by: 3rd item, §1, §8, §8, §9.3.
-  (2018) Deanonymization of hidden transactions in zcash. Cited by: §1, §9.2.
-  (2019) Security and privacy of mobile wallet users in bitcoin, dash, monero, and zcash. Pervasive and Mobile Computing 59, pp. 101030. Cited by: §1, §9.2.
-  (2017) Dandelion: redesigning the bitcoin network for anonymity. Proceedings of the ACM on Measurement and Analysis of Computing Systems 1 (1), pp. 1–34. Cited by: §9.2.
-  (2014) Mixcoin: anonymity for bitcoin with accountable mixes. In International Conference on Financial Cryptography and Data Security, pp. 486–504. Cited by: §1.
-  (2019) Zether: towards privacy in a smart contract world.. IACR Cryptology ePrint Archive 2019, pp. 191. Cited by: §1.
-  (2015) GraRep: learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, New York, NY, USA, pp. 891–900. External Links: Cited by: §6.3.
-  (2019) PGC: pretty good decentralized confidential payment system with auditability. Note: Cryptology ePrint Archive, Report 2019/319https://eprint.iacr.org/2019/319 Cited by: §1.
-  (2019) FloodXMR: low-cost transaction flooding attack with monero’s bulletproof protocol.. IACR Cryptology ePrint Archive 2019, pp. 455. Cited by: §1.
-  (2013) Traveling the silk road: a measurement analysis of a large anonymous online marketplace. In Proceedings of the 22nd international conference on World Wide Web, pp. 213–224. Cited by: §1.
-  (2002) Towards measuring anonymity. In International Workshop on Privacy Enhancing Technologies, pp. 54–68. Cited by: §5, §5.
-  (2018) Learning structural node embeddings via diffusion wavelets. In International ACM Conference on Knowledge Discovery and Data Mining (KDD), Vol. 24. Cited by: §6.3.
-  (2018) Dandelion++ lightweight cryptocurrency networking with formal anonymity guarantees. Proceedings of the ACM on Measurement and Analysis of Computing Systems 2 (2), pp. 1–35. Cited by: §9.2.
-  (2017) Anonymity properties of the bitcoin p2p network. arXiv preprint arXiv:1703.08761. Cited by: §9.2.
-  (2017) Deanonymization in the bitcoin p2p network. In Advances in Neural Information Processing Systems, pp. 1364–1373. Cited by: §9.2.
-  (2018) Decentralization in bitcoin and ethereum networks. In International Conference on Financial Cryptography and Data Security, pp. 439–457. Cited by: §9.2.
-  (2017) When the cookie meets the blockchain: privacy risks of web payments via cryptocurrencies. Proceedings on Privacy Enhancing Technologies 2018, pp. 179 – 199. Cited by: §9.3.
-  (2016) On the size of pairing-based non-interactive arguments. In Annual international conference on the theory and applications of cryptographic techniques, pp. 305–326. Cited by: §7.
-  (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §5.
-  (2018) An empirical analysis of anonymity in zcash. In 27th USENIX Security Symposium (USENIX Security 18), pp. 463–477. Cited by: §1.
-  (2018) Analyzing ethereum’s contract topology. In Proceedings of the Internet Measurement Conference 2018, pp. 494–499. Cited by: §2.
-  (2018) Measuring ethereum network peers. In Proceedings of the Internet Measurement Conference 2018, pp. 91–104. Cited by: §9.2.
-  (2018) Deanonymisation in ethereum using existing methods for bitcoin. Cited by: §2.
-  (2018) Multi-level network embedding with boosted low-rank matrix approximation. CoRR abs/1808.08627. External Links: Cited by: §6.3.
-  (2019-10) Exploring ethereum’s blockchain anonymity using smart contract code attribution. pp. . Cited by: §2.
-  (2017) . IEEE Access 6, pp. 10139–10150. Cited by: §5.
-  (2018) Möbius: trustless tumbling for transaction privacy. Proceedings on Privacy Enhancing Technologies 2018 (2), pp. 105–121. Cited by: §1, §3.1, §7.
-  (2013) A fistful of bitcoins: characterizing payments among men with no names. In Proceedings of the 2013 conference on Internet measurement conference, pp. 127–140. Cited by: §1, §1, §3.1.
-  (2018) An empirical analysis of traceability in the monero blockchain. Proceedings on Privacy Enhancing Technologies 2018 (3), pp. 143–163. Cited by: §1, §9.2.
-  (2019) Bitcoin: a peer-to-peer electronic cash system. Technical report Manubot. Cited by: §1, §1.
Link prediction by de-anonymization: how we won the kaggle social network challenge.
The 2011 International Joint Conference on Neural Networks, pp. 1825–1834. Cited by: §5.
-  (2008) Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pp. 111–125. Cited by: §5, §5.
-  (2009) De-anonymizing social networks. In 2009 30th IEEE symposium on security and privacy, pp. 173–187. Cited by: §5.
Inside chainalysis’ multimillion-dollar relationship with the us government.
with-the-us-government Cited by: §1.
-  (2017) Automated labeling of unknown contracts in ethereum. In 2017 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–6. Cited by: §2.
-  (2016) Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 1105–1114. External Links: Cited by: §6.3.
-  (2017) Characterizing the ethereum address space. Dec. Cited by: §2.
-  (2014) DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 701–710. External Links: Cited by: §6.3.
-  (2016) Walklets: multiscale graph embeddings for interpretable network classification. CoRR abs/1605.02115. External Links: Cited by: §6.3.
-  (2017) Network embedding as matrix factorization: unifying deepwalk, line, pte, and node2vec. CoRR abs/1710.02971. External Links: Cited by: §6.3.
-  (2020) . External Links: Cited by: §6.3.
-  (2018) Fast sequence based embedding with diffusion graphs. In International Conference on Complex Networks, pp. 99–107. Cited by: §6.3, §6.4.
-  (2014) Coinshuffle: practical decentralized coin mixing for bitcoin. In European Symposium on Research in Computer Security, pp. 345–364. Cited by: §1.
-  (1998) Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Cited by: §7.
-  (2019) Mixeth: efficient, trustless coin mixing service for ethereum. In International Conference on Blockchain Economics, Security and Protocols (Tokenomics 2019), Cited by: §1, §3.1, §7.
-  (2002) Towards an information theoretic metric for anonymity. In International Workshop on Privacy Enhancing Technologies, pp. 41–53. Cited by: §5.
-  (2019) ShareLock: mixing for cryptocurrencies from multiparty ecdsa. Cryptol. ePrint Arch., Tech. Rep 563, pp. 2019. Cited by: §1, §3.1, §7.
-  (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6201–6205. Cited by: §6.3.
-  (2019) PING and reject: the impact of side-channels on zcash privacy. Cited by: §1.
-  (2020) Remote side-channel attacks on anonymous transactions. Cited by: §9.2.
-  (2015) Blindcoin: blinded, accountable mixes for bitcoin. In International Conference on Financial Cryptography and Data Security, pp. 112–126. Cited by: §1.
-  (2019) Measuring ethereum-based erc20 token networks. In International Conference on Financial Cryptography and Data Security, pp. 113–129. Cited by: §2.
-  Address clustering heuristics for ethereum. Cited by: §2.
-  (2018) Technical privacy metrics: a systematic survey. ACM Computing Surveys (CSUR) 51 (3), pp. 1–38. Cited by: §5.
-  (2020) Step on the gas? a better approach for recommending the ethereum gas price. arXiv preprint arXiv:2003.03479. Cited by: §9.1.
-  (2018) Miximus: zksnark-based trustless mixing for ethereum. Note: githubhttps://github.com/barryWhiteHat/miximus Cited by: §3.1.
-  (2018) The aztec protocol. URL: https://github. com/AztecProtocol/AZTEC. Cited by: §1, §8.1.
-  (2014) Ethereum: a secure decentralised generalised transaction ledger. Ethereum project yellow paper 151 (2014), pp. 1–32. Cited by: Appendix A.
-  (2019) NodeSketch: highly-efficient graph embeddings via recursive sketching. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’19, New York, NY, USA, pp. 1162–1172. External Links: Cited by: §6.3.
-  (2015) Coinparty: secure multi-party mixing of bitcoins. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, pp. 75–86. Cited by: §1.
Appendix A Ethereum basics
Ethereum is a cryptocurrency built on top of a blockchain . There are two types of accounts in Ethereum: externally owned accounts (EOAs) and contract accounts, also known as smart contracts. The global state of the system consists of the state of all different accounts. EOAs are controlled by an asymmetric cryptographic key pair, while smart contracts are controlled by their code stored in persistent, immutable storage. EOAs can issue transactions, which might alter the global state. Transactions can either create a new contract account or call existing accounts. Accounts have balances in ether, the native currency of Ethereum, and are denominated in wei where .
Calls to EOAs can transfer Ether to the callee, while contract calls execute the code associated with the smart contract. The contract execution might alter the storage of the account, moreover can call to other accounts - these are called internal transactions. Contract code is executed in the Ethereum Virtual Machine (EVM).
a.1 Gas mechanism
A crucial aspect of the EVM is the gas mechanism. To every EVM opcode, there is a gas amount assigned, which is deemed to price the computational complexity of that opcode. For instance, adding two elements on top of the stack consumes only gas, but storing a non-zero stack element in the persistent storage burns gas. The base gas fee for every transaction is gas, which is not paid for internal transactions. Therefore, whenever one executes a smart contract code in the EVM, the execution consumes a certain amount of gas. At each transaction, the sender needs to define the maximum number of gas, called gas limit, they allow their transaction to consume. Usually, due to the dynamic nature of the state, one does not know statically how much gas would her transaction burn. If a transaction does not consume all the gas assigned to it, then surplus gas is refunded to the caller, however, if a transaction runs out of gas, then all state changes are reverted and assigned gas is taken from the caller.
As of now, gas can only be purchased by Ethereum’s native currency, ether, at a dynamically changing price, called gas price. Miners are naturally incentivised to insert transactions with higher gas prices into their blocks to increase their collected transaction fees.