The Bitcoin transaction graph is a public data structure organized as transactions between addresses, each associated with a logical entity. In this work, we introduce a complete probabilistic model of the Bitcoin Blockchain. We first formulate a set of conditional dependencies induced by the Bitcoin protocol at the block level and derive a corresponding fully observed graphical model of a Bitcoin block. We then extend the model to include hidden entity attributes such as the functional category of the associated logical agent and derive asymptotic bounds on the privacy properties implied by this model. At the network level, we show evidence of complex transaction-to-transaction behavior and present a relevant discriminative model of the agent categories. Performance of both the block-based graphical model and the network-level discriminative model is evaluated on a subset of the public Bitcoin Blockchain.READ FULL TEXT VIEW PDF
Bitcoin (BTC) is probably the most transparent payment network in the wo...
Bitcoin is the first implementation of what has become known as a 'publi...
How crypto flows among Bitcoin users is an important question for
Blockchain technology is ushering in another break- out year, the challe...
Bitcoin is a cryptocurrency that features a distributed, decentralized a...
Bitcoin is the most popular cryptocurrency used worldwide. It provides
The first six months of 2018 have seen cryptocurrency thefts of 761 mill...
Analysis of the Bitcoin Blockchain  is an area of intense activity [20, 1], and one which has witnessed an explosion of interest as the value of the Bitcoin cryptocurrency has skyrocketed. Research areas include explorations of address clustering techniques to identify logical agents [11, 21, 11, 7], de-anonymization using side-channel attacks [8, 13].
An understanding of the properties of Bitcoin transactions is paramount to the legitimation of the cryptocurrency economy; it constitutes a building block to the conception of effective and adequate regulations , and to the design of novel and integrated services benefiting society as a whole.
As of , with more than million address nodes, the Bitcoin graph is comparable in size to a large social network. Yet while probabilistic models of social networks have received considerable attention, from community detection  to diffusion models and influence maximization , to probabilistic graph modeling , probabilistic models of the Bitcoin Blockchain network have not.
Bitcoin transactions are tantamount to a partially observed social network, within which participants can have multiple seemingly independent aliases. This distinguishes our work from classical studies on partially observed social networks, typically focused on partial observations of interactions due to sampling , and makes it closer to the vast body of work on entity resolution [31, 4].
A second challenge associated with modeling the Bitcoin Blockchain transaction network consists of capturing the complexity of the hidden structure associated with entity transactions, together with the fine-grained block-level specificities implied by the Bitcoin protocol. In particular, Bitcoin is based on an unspent transaction output (UTXO) model, which distinguishes suitable Bitcoin Blockchain models from prior studies on credit card transactions [6, 18], since the proper generative structure needs to account for the underlying UTXO creation and deletion process.
In this work, we propose a first attempt at a comprehensive model of the Bitcoin transaction graph using a hybrid generative-discriminative model attempting to draw strengths from both approaches . We first define pragmatic conditional independence assumptions underlying the Bitcoin protocol, and formulate a generative model of the Bitcoin Blockchain block. In this context, we analyze the revealed entity behavior, both theoretically and from a data perspective. We then turn to network level modeling, present a discriminative model of transaction-transaction behavior, and analyze the associated medium-term categorical agent behavior.
A Bitcoin transaction consists of a set of input addresses transferring BTC to a set of output addresses. More specifically, in the context of a transaction, each input address contributes a possibly fractional subset of its UTXOs to the creation of the set of UTXOs associated with output addresses, for the same total amount (minus a fee). Each UTXO is associated with an address, and each address is associated with a logical agent, who may hold an arbitrary number of addresses, see Figure 1.
We embed the Bitcoin Blockchain transaction graph in a directed bipartite graph structure , with the following vertex and edge features:
address vertex : number of UTXO , and out-degree ,
transaction vertex : transaction value and fee ,
directed address-transaction edge : outgoing value from address via transaction ,
directed transaction-address edge : incoming value to address via transaction .
Since the Bitcoin protocol specifies that transactions should be validated in blocks and the proof-of-work consensus protocol incentivizes validators to agree on a single block-chain, we ignore transient disagreements and assume a discrete-time simple path structure of blocks.
We propose a stationary graphical model  of a Bitcoin Blockchain block. First we develop a fully observable block-transaction, address (BT-A) model, illustrated in Figure 2, that we then augment with entity attributes into a block-transaction, entity-address (BT-EA) model with more complex structure.
A block is composed of the set of transactions validated by the peer node who solved the cryptographic challenge the fastest. With the approximation of stationary inter-block time, and assuming independence between the ability of solving the cryptographic challenge and the selection of transactions, we model the number
of transactions per block as a Poisson distribution. Similarly assuming stationary and independent address usage, we model the number of input addressesand output addresses per transaction as a Poisson random variable.
where is the normalized Poisson distribution on .
On the receiving end of a transaction, it is possible to generate a new address. In the Bitcoin pseudonymous context, this reduces the traceability of the full set of transactions associated with an entity. Considering the set of output addresses as a whole, we model the conditional distribution of the number of new addresses given the number of output addresses as a Binomial random variable.
In the interest of a tractable inference procedure, and in the absence of an informative prior, in this work we focus our efforts on maximum-likelihood estimation, and assume uniform prior.
We now proceed to describe the generative model of the input and output addresses. A natural choice for the generative hierarchical model is the LDA or Dirichlet-multinomial model used in topic modeling [5, 24]. Here, given the full observability of the model variables and decomposability of the likelihood, motivated by topological social network analysis, we use the Albert-Barabasi preferential attachment model 
, which can be seen as the posterior probability of an LDA model in the appropriate feature space.
Specifically, we consider that the probability of the address to be a given address is proportional to the number of available UTXO of the address. The model reads as follows.
where is the set of available addresses, is the number of unspent outputs of the address.
The output address model is similar, except that the attachment model is now considered a function of the out-degree of the address, i.e. while the inclination of the address to be part of the inputs (i.e. to spend) is considered to be a function of the number of UTXOs it has still available, the inclination of the address to be part of the outputs (i.e. to accumulate) is considered to be a function of the number of distinct UTXOs it has already spent.
where denotes a new address.
For each input address, since empirically we observe that the distribution is concentrated around , we model the conditional distribution of the number of UTXOs used given the number of UTXOs available as a geometric random variable with uniform prior. We then draw the UTXOs uniformly from the available set.
We obtain the total transaction value as the sum of the input UTXOs.
A fee is paid to the miners to reward their validation work and higher fees may nudge their selection of transactions when creating blocks. We thus model the fee associated with a Bitcoin transaction as a normalized Gaussian distribution. The number of output UTXOs and their values is modeled similarly to the input UTXOs.
where denotes the Gaussian distribution normalized over the interval , and where denotes the normalized uniform distribution (the are also normalized in order to sum to ).
We now turn to a more complex variant of the proposed model meant to capture categorical behavior of the unobserved entities transacting on the Blockchain.
An entity is associated with a Bitcoin user and fully characterized by a set of addresses . In this section we extend the BT-A model to take into account categorical entity behavior. We assume that entities belong to different categories , with potentially different behaviors.
We first model the fact that the hyper-parameters and associated with the number of input and output addresses, depend on the category of the associated entity, and are noted and . Similarly the parameter associated with the number of new addresses in the output , and the number of UTXO in the input and output are category-dependent.
This dependency structure intending to capture the behavior of distinct categories of entities is illustrated in Figure 3.
We assume a known dependency structure and estimate the model parameters. Since the prior is decomposable over nodes, and since all variables are observed in the BT-A model, the MLE inference amounts to local computation over each node and its parents.
Regarding the BT-EA model, while the hidden entity variables make the inference more complex in general, here we assume that a separate heuristic such as the multi-input heuristic allows associating each address with an entity, hence the inference process over the labeled set reduces to the scalable process used for the BT-A model.
In this section we present an analysis of address re-use behavior in the context of the probabilistic model introduced in the previous section, as well as implications of these results for Bitcoin transaction anonymity.
We model an attacker, attempting to identify the full set of addresses associated with an entity . We assume that the attacker uses the standard multi-input heuristic , which associates the full set of address inputs for each transaction to a single entity and applies transitive closure. From the perspective of the external attacker, the true set of addresses of an entity is partitioned into aliases, a-priori seen as distinct entities;
where denotes the address set associated with alias of entity . In this setting, when participating in a transaction on the input side, we consider that the targeted entity selects addresses from its available set following a generic multinomial distribution with parameters , which includes the special case for which the alias distribution is a linear function of .
This models the typical Bitcoin user who, while being concerned by his privacy, is not particularly careful about address selection, and uses multiple distinct aliases with distinct address sets, but sometimes mixes these address in the same transaction input, leading to a privacy collapse.
Given the multi-input heuristic, it is indeed sufficient for an attacker to observe two addresses from distinct aliases and to associate these two aliases to the same entity using the multi-input heuristic. Formally, upon observing the input addresses from a transaction associated with input entity , the attacker is able to associate the following address set with entity :
In the following for simplicity we consider a one-step iteration and assume that the attacker is only aware of the set of addresses associated with alias . In this sense the control parameter plays the role of 1- from the BT-EA model. We analyze the number of addresses from entity that the attacker is able to discover after seeing the addresses involved in transaction, expressed as:
We can express the number of discovered addresses as a function of the alias addresses selection probabilities .
By definition of , we have
Let be the second factor in the summation term, by marginalizing over
and using the chain rule, we can write:
Letting denote the first factor in the summation term above, we have:
where the last equality is obtained by definition of the multinomial distribution. Similarly since the number of input addresses
follows a binomial distribution we have:
and combining this expression with the expression of , we can simplify the expression of to finally obtain equation (9), which concludes the proof. ∎
With as the control parameter, the expression states that the attacker information gain is an exponential function of the probability of using addresses already identified (i.e. address re-use). The asymptotic behavior of a privacy-conscious user is described next.
If we have:
This result shows that the one-step information gain from the attacker is a linear function of the probability of using already-used addresses, and also linear in the number of addresses typically used as input. This result at the transaction level can be readily extended to a chain-length estimate by accounting for the probability of an entity to transact, as provided explicitly in equation (8) of the BT-EA model. We also highlight that while a low models a privacy-conscious user, the user strategy is non-adaptive, in the sense that the user does not try to adjust his strategy based on the attacker strategy.
We now consider the behavior of entities across transactions, and assume that entity categories exhibit different behaviors. Given the lack of a-priori underlying modeling structure to this behavior, and given the combinatorial nature of such behavior, we propose a discriminative framework in which model selection can be carried out more efficiently based on a possibly large set of relevant features. We rely on the classical multi-input heuristic 
for defining entities, and formulate a decision-tree based classification problem in the following feature space.
We consider the following five feature classes, and for continuous features explicitly consider the feature mean and standard deviation; address features, entity features, temporal features, graph centrality metric features, motif features.
Address-specific features include attributes such as the total BTC received, the total BTC balance, the number of input/output transactions, etc. Analogous features are defined at the entity level as well as the number and proportion of Coinbase transactions (indicative of BTC creations).
Temporal features are those such as the number of weeks, months, years of activity. the number of entity traded with per week, month, year, the number of receiving/sending/receiving sending days, the activity period duration, and the active day ratio.
In this section we present numerical results of our probabilistic Bitcoin Blockchain model. We first describe the training procedure for the generative block model and discuss obtained model parameters. We then turn to the transaction-to-transaction discriminative model results and analyze the properties revealed by the joint analysis.
We consider the set of blocks of height inferior or equal to , corresponding to blocks created before March 24th 2018, 15:19:02, which contains about addresses. Address labels, revealing entity identifiers, are obtained from WalletExplorer https://www.walletexplorer.com/. The set of address entity label pairs used has been made available at https://github.com/Maru92/EntityAddressBitcoin.
We interact with the Blockchain via the BlockSci toolbox v.0.4.5 released on March 16th 2018 , on a 64 GB machine. The final labeled dataset used in numerical experiments consists of addresses, associated with entities representing entity categories in the following proportions:
Exchange (E): 108 entities, 7.892.587 addresses,
Service (S): 68 entities, 17.606.608 addresses,
Gambling (G): 65 entities, 2.775.810 addresses,
Mining Pool (M): 19 entities, 78.488 addresses.
When training the probabilistic model, we restrict ourself to the period from January 1st 2016 to March 16th 2018, where overall patterns are relatively stationary. Indeed since the proposed model is static we do not attempt to study its ability to model transient regimes. We observe statistics in Table 1 and distribution in Figure 5, showing wide variability across multiple scales.
Since we consider a subset of the transaction graph, we need to model transactions originating from our subset and directed outside it, or vice-versa. We follow the proposed model structure and model the number of external output addresses as a Poisson distribution . Transactions from unknown addresses towards known input addresses are modeled with no known input and a number of transactions per block following a Poisson distribution . Coinbase transactions are created in a similar manner: no inputs, number of addresses in the outputs drawn following a Poisson distribution of parameter , with new addresses, , and several UTXOs created per addresses, .
We train the model using data from the period January 1st 2016 to March 16th 2018 consisting of about million addresses. We first verify the main independence assumption, between the number of input addresses and the number of output addresses. Since , we consider the marginal independence hypothesis validated.
The inference produces a value for both models. In Table 2 we present the model parameter results from the model training for the BT-A and BT-EA models.
The results reflect the idiosyncratic properties of Bitcoin Blockchain transactions, with for instance the need to gather UTXOs from various addresses, which is illustrated by the fact that . It is also clear from the UTXO parameters that the input parameters are more discriminative than the output parameters, which reflect transfers from other parties from the perspective of the entity concerned.
Lastly we observe significant address generation distinctions across entity categories, with Gambling and Mining Pools seemingly more privacy-conscious given their higher probability of generating new addresses. They also transact less frequently, using more input addresses. Detailed impact of entity behavior on privacy properties is analyzed subsequently.
In order to assess the model performance, we now evaluate out-of-sample model accuracy. Starting from scratch, we train the model on blocks corresponding to the period from January 1st 2017 to January 31st, 2017, and evaluate the model on blocks associated with the period from February 1st, 2017, to February 14th, 2017.
The results from Table 3
illustrate that given the multi-scale nature of the underlying distributions, the model estimates are relatively close on average, i.e. well within an order of magnitude. Furthermore, the BT-EA model significantly reduces the bias (MSE) as well as the variance (RMSE) for most categories. The Exchange category is the only one for which both bias and variance increase, suggesting a fundamental modeling limitation.
The error terms are relatively large in absolute terms for both models, which is largely explained by the inherent variance in the data, both at the population level and at the class level. Indeed, the bias is low and most of the data variance is explained, with a N-RMSE ranging between and .
Given the calibrated model parameters, we now validate experimentally the theoretical privacy properties of Bitcoin Blockchain transactions expressed by equation (9). We leverage the generative model and attacker model described above to simulate transaction traces and evaluate the proportion of the addresses that are re-identified for distinct categories, as a function of the number of transactions.
Figure 6 shows agreement between the analytics results and the simulation of the block model. The figure also illustrates that Exchanges and Services typically are less privacy-conscious (lower probability of generating new addresses, frequent transactions), and hence for an equivalent number of Blockchain transactions, typically reveal a greater proportion of their address set.
Transaction anonymity however depends also on the transaction-to-transaction behavior. Indeed, it is conceivable that certain entities, while not following best block-level practices on address re-use, hence easily identifiable as entities, could be transacting in a way that little information is gathered from their network level transaction structure. In order to assess the latter, we now turn to the numerical results of our proposed network transaction model.
We use the Python LightGBM implementation of the gradient boosted decision tree model with a trainingtest partition of our dataset. A Gaussian Process (GP)-based optimization procedure for hyper-parameter optimization is implemented using the Python skopt library https://scikit-optimize.github.io/ with initial parameter values obtained from a coarse random search. The learning rate hyper-parameter is optimized over the interval with early stopping after having done a random search over ; the resulting value is . The GP procedure is used with 50 iterations.
We make use in total of 10 address features, 8 entity features, 16 temporal features, 42 centrality features, 44 1-motif features, 81 2-motif features, and 114 3-motif features. We present in Table 4 the F1, Accuracy and Precision results over the entire dataset and for each category.
The results illustrate that the model is able to very well capture the behavior of most entity categories. Furthermore, the network-level privacy analysis confirms the prior block-level analysis, with Mining Pools being the most privacy-conscious. Indeed, considering the most relevant features of the LightGBM model, in a 1 vs. all setting, it appears that for most categories except the Mining Pool, motif features are the most informative, indicating that the LightGBM model is not able to leverage the transaction sub-graph for identification of the Mining Pool category.
Analysis of the Bitcoin protocol in the context of attacks have been proposed, for instance inference of peer-to-peer communication structure, in , statistical analysis of bloom filters in , and analysis of Bitcoin minting patterns in  with application to de-anonymization. Flow-based address-transaction graph studies can be found in [23, 12, 29]. The obfuscation of Bitcoin transactions traceability has been considered in .
Several studies have applied discriminative models to the problem of de-anonymizing Bitcoin transactions, with for instance the use of transaction-specific features in , able to achieve
accuracy for classifying entities into several types. In, the authors introduce transactions paths with application to the detection of Bitcoin exchanges, and achieve greater than accuracy. Similar transactions paths features are used in  for a 5-class classification problem with above accuracy results.
In this work, we proposed a probabilistic model of the Bitcoin Blockchain which accounts for the complex Bitcoin protocol features. The model consists of a hierarchical structure from unspent transaction output (UTXO), to address, transaction, and block. We take into account entity modeling, including features relevant for robustness to de-anonymization attacks, namely address re-use patterns. We also propose a discriminative model of transaction-to-transaction behavior and show its effectiveness in practice.
We analyzed the accuracy of the generative model using a large Bitcoin dataset of more than million address vertices, discussed the significant block-level heterogeneity of the model parameters across entity categories, and provide a complementary analysis of transaction-to-transaction behavior using the discriminative model. We consider in particular the de-anonymization properties of certain behaviors, which is one of the main focus areas of Bitcoin studies.
Extensions of this work may include the design of more complex graphical models including latent variables for modeling transaction intent, and shared side-information across entities, inducing multivariate preferential attachment. A significant challenge for such models with more complex dependency structure and hidden variables is the design of a tractable training and inference procedure given the large-scale nature of such public cryptocurrency transaction graphs.
Journal of Machine Learning Research, 3:993–1022, 2003.
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2015.
Danny Yuxing Huang, Maxwell Matthaios Aliapoulios, Vector Guo Li, Luca Invernizzi, Elie Bursztein, Kylie McRoberts, Jonathan Levin, Kirill Levchenko, Alex C Snoeren, and Damon McCoy.Tracking ransomware end-to-end. In 2018 IEEE Symposium on Security and Privacy (SP), pages 618–631. IEEE, 2018.
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 537–546, Oct 2016.