Deciphering Bitcoin Blockchain Data by Cohort Analysis

02/27/2021 ∙ by Yulin Liu, et al. ∙ 0

Bitcoin is a peer-to-peer electronic payment system that popularized rapidly in recent years. Usually, we need to query the complete history of Bitcoin blockchain data to acquire variables with economic meaning. This becomes increasingly difficult now with over 1.6 billion historical transactions on the Bitcoin blockchain. It is thus important to query Bitcoin transaction data in a way that is more efficient and provides economic insights. We apply cohort analysis that interprets Bitcoin blockchain data using methods developed for population data in social science. Specifically, we query and process the Bitcoin transaction input and output data within each daily cohort, which enables us to create datasets and visualizations for some key indicators of Bitcoin transactions, including the daily lifespan distributions of spent transaction output (STXO) and the daily age distributions of the accumulated unspent transaction output (UTXO). We provide a computationally feasible approach to characterize Bitcoin transactions, which paves the way for the future economic studies of Bitcoin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Background and Summary

Bitcoin (BTC) is a peer-to-peer electronic payment system that popularized rapidly in recent years [nakamoto_2008_bitcoin, bhme_2015_bitcoin]. Bitcoin relies on recording the Unspent Transaction Outputs (UTXO) to efficiently verify newly generated transactions, thus eliminating the needs of intermediaries like banks and reducing transaction costs [delgadosegura_2019_analysis, zahnentferner_2018_chimeric, urquhart_2016_the, chakravarty_2020_the, prezsol_2019_another, konrad_2015_bitcoin]. A UTXO is generated either as block rewards or the output of a transaction. The timestamp is recorded when a UTXO is generated. A UTXO is spent and converted into a Spent Transaction Output (STXO) when it is used as the input of a transaction. The timestamp is again recorded when an UTXO is spent. Each UTXO can only be spent once. Such a unique feature allows us to calculate the age of each UTXO and the lifespan of each STXO as population data.

Noticing the unique structure of the Bitcoin blockchain data, we apply cohort analysis [glenn_2005_cohort, mason_1973_some, breslow_1983_multiplicative, jiang_2016_cohort, omidvartehrani_2018_cohort] developed for population data to analyze it. To analogize with the population data, we say a UTXO is born when it is generated as block rewards or the output of a transaction, and we say a UTXO is dead when it is spent as the input of another transaction. In this way, all UTXOs generated on the same day form a daily birth cohort, while all UTXOs spent on the same day form a daily death cohort. We define the age of a UTXO to be the difference between “now” (the date on which we are working on) and the time when it was born, and the lifespan of an STXO to be the difference between the time when it was dead and the time when it was born. Thus, all UTXOs within an age range form an age cohort, and all STXOs within a lifespan range belong to a lifespan cohort. With this framework, we naturally replicate in Bitcoin blockchain data the trinity of birth, death, and age cohorts from population cohort analysis.

Usually, we need to query the complete history of Bitcoin blockchain data to acquire variables of economic meaning. With over 1.6 billion historical transactions on the Bitcoin blockchain, it becomes increasingly difficult and computation-intensive now. It is thus important to query Bitcoin transaction data in a way that is more efficient and provides economic insights [liu_2020_cryptocurrency]. Cohort analysis provides a new perspective from which we can analyze data within each cohort separately before integrating them into a time series with economic meanings.

We query and process Bitcoin transaction input and output data within each daily cohort. With this, we successfully create datasets and visualizations for some key indicators of bitcoin transactions, including the daily lifespan distributions of STXO in percentage (Figure 1) and the daily age distributions of the accumulated UTXO in BTC (Figure 2). The visualizations can be used to study the functions of bitcoin as a currency. The three functions of a currency include store of value, unit of account, and medium of exchange. For example, Figure 2 shows the number of bitcoins in UTXOs (i.e. bitcoins that have not been spent) by age distribution. By the end of 2020, around 2 million bitcoins have not been transacted for more than 10 years. There are also 2 million, 4.5 million, and 3 million bitcoins that remained inactive in the past 5-10 years, 2-5 years, and 1-2 years, respectively. This sums up to around 11.5 million bitcoins not having been transacted for more than 1 year. These bitcoins serve as a time deposit and play the role as a store of value. Moreover, around 5 million bitcoins are in the range of 1 month-1 year. These bitcoins are similar to a demand deposit. The frequently transacted bitcoins are those with ages between 1 day-1 month (2 million) and less than 1 day (0.2 million). These bitcoins play the role of a medium of exchange.

Our final datasets include one dataset that characterizes STXOs and one that characterizes UTXOs, which are both smaller than 1MB. Moreover, Cohort analysis keeps data querying and processing to the minimum for future updates and enables automation of the updates. We thus provide a computationally feasible approach to characterize bitcoin transactions, which paves the way for the future economic studies of Bitcoin. Our methods can be generally applied to other cryptocurrencies that adopt UTXO protocols, including Litecoin, Dash, Zcash, Dogecoin, Bitcoin Cash.

Figure 1: Lifespan Distribution of BTC STXOs
Figure 2: Number of BTC UTXOs by Age

Methods

While the Bitcoin transaction output data is publicly available on its blockchain, we find the size of the raw data (about TB) overwhelming to process even with cloud computing platforms. To improve the efficiency of computation, we first retrieve the part of data relevant to the study to create a more manageable data table of GB. By partitioning this data table into daily birth and death cohorts, we analyze the STXOs and UTXOs in each cohort separately to summarize the daily characteristics of transaction outputs and create visualizations based on the cohort summary. Our method can adapt to the creation of future blocks - we only need to process the transaction output data from the latest cohort and append the summary to the current one.

Creating Partitioned Tables

Our primary workplace is Google Colaboratory (Colab), a Jupyter Notebook hosted environment from Google, and BigQuery, a data warehouse from Google Cloud Platform. We first query the columns of interest from the public dataset crypto-bitcoin on BigQuery[kaggle], which includes the input and output data of Bitcoin. We then join the data queried from input and output data to create a data table that includes the value of UTXO (value), the timestamp when the UTXO was created (block_timestamp), and the timestamp when the UTXO was spent as an input of another transaction (spent_block_timestamp) (this column is left as null if the transaction output is unspent). As the UTXO in a transaction is counted in satoshi ( satoshi BTC), the actual number of UTXO in BTC can be computed by , where the represents the number of UTXO in satoshi. We rely on this derived data table ( billion rows, GB) to conduct further analysis.

To save the cost of query, we create two partitioned tables based on the derived data table, one by the date in block_timestamp and one by the date in spent_block_timestamp. This means that the data entries are partitioned either by the date when the UTXOs were created or by the date when the UTXOs were spent. In this way, the program only queries the entries with timestamps in a specific range, which saves a huge amount of computation power. This step could significantly improve the query performance and reduce query cost [a2021_introduction].

Querying and Processing Cohort Data

The data structure of partitioned tables coincides with our need of processing cohort data. The table partitioned by date in block_timestamp naturally divides the derived data into birth cohorts that include the segment of transaction outputs created on the same date, and the one by date in spent_block_timestamp divides the derived data into death cohorts that include the segment of transaction outputs spent on the same date.

We query and process each birth cohort and each death cohort with a loop program. For each specific date since 2009-01-03, when the first block of Bitcoin was created, the birth cohort data and the death cohort data of that date were queried and imported to Colab from BigQuery. The total number of UTXOs in BTC created and spent on that date can be computed by summing up the number of UTXOs in BTC in the birth cohort data and the death cohort data respectively. The weighted average lifespan (WAL) on the date, as defined in Table 1, can be computed from the death cohort data by the formula

where . The distribution of lifespan can be computed with death cohort data on that date by first categorizing UTXOs based on life length and then summing up the number of UTXOs in BTC in each category.

We apply a more complicated partitioning trick to compute the age distribution for each specific date. The age of a UTXO is defined as , where working date means the date of interest for the data cohort being studied. The UTXOs that remained alive on a specific date must satisfy both conditions: a) its block_timestamp must be smaller than the end of the working date, which means the UTXO was created sometime before or on the date; b) its spent_block_timestamp must either be null, which means the UTXO hasn’t been spent until 2021-02-12, or be larger than the end of the working date, which means the UTXO was spent sometime later than the working date but before 2021-02-12. Thus, we cannot simply interpret it as either birth or death cohort data. Instead, we first query the part of data needed to compute age distribution for a twelve-month or six-month period depending on the size of data in each year and then split the queried data into daily cohorts in Colab. We can compute the age distribution of each daily cohort by categorizing the age of each UTXO and summing up the number of UTXOs in BTC in each category.

Visualizing the Time Series

Result of our analysis is condensed into a time-series data that includes the number of UTXOs in BTC created and spent, weighted average lifespan, life length distribution, and age distribution for each date from 2009-01-03 to 2021-02-12. Many visualizations can be potentially generated from this informative time series. For example, as visualized in Figure 3, we can compute the circulating supply of bitcoins by computing the cumulative net new UTXOs with the formula

Bitcoin token velocity, which we define below as the amount of bitcoins spent in the last 30 days divided by its circulating supply, can be computed by

Our method can adapt to the creation of future blocks. The time series data for the past dates are not subject to changes as new blocks are created. As time goes on, we only need to query and process the latest data cohorts to extend the time series. We will update the visualizations according to the latest development of Bitcoin, and researchers may easily repeat our work in part or in whole based on their needs.

Data Records

The final data records are stored and published on Harvard Dataverse [liu_2021_replication]. The records consist of the UTXO and the STXO datasets in csv format. Tables 1 and 2 represent the metadata information of the two datasets.

Name Description
date Date on which cohort data were queried, in the format “%Y/%m/%d”
newborn Number of UTXOs in BTC created in bitcoin transactions on the date
dead Number of UTXOs in BTC spent in bitcoin transactions as inputs on the date
WAL Weighted Average Lifespan of the UTXOs spent on the date, defined as the average lifespan (the difference between the time when the output was spent and the time when the output was created) weighted by the number of UTXOs in BTC contained in the transaction outputs.
-9 Number of UTXOs in BTC spent on the date that was created less than one day (< 1d) before
-7 Number of UTXOs in BTC spent on the date that was created more than one day but less than one month (1d 1m) before
-5 Number of UTXOs in BTC spent on the date that was created more than one month but less than three months (1m 1q) before
-3 Number of UTXOs in BTC spent on the date that was created more than three months but less than six months (1q 6m) before
-1 Number of UTXOs in BTC spent on the date that was created more than six months but less than one year (6m 1y) before
1 Number of UTXOs in BTC spent on the date that was created more than one year but less than two years (1y 2y) before
3 Number of UTXOs in BTC spent on the date that was created more than two years but less than three years (2y 3y) before
5 Number of UTXOs in BTC spent on the date that was created more than three years but less than four years (3y 4y) before
7 Number of UTXOs in BTC spent on the date that was created more than four years but less than five years (4y 5y) before
9 Number of UTXOs in BTC spent on the date that was created more than five years but less than ten years (5y 10y) before
11 Number of UTXOs in BTC spent on the date that was created more than ten years (> 10y) before
Table 1: Meta Data for STXO Dataset
Name Description
date Date on which cohort data was queried, in the format “%Y/%m/%d”
-9 Number of UTXOs in BTC still alive by the end of the date that was created less than one day (< 1d) before
-7 Number of UTXOs in BTC still alive by the end of the date that was created more than one day but less than one month (1d 1m) before
-5 Number of UTXOs in BTC still alive by the end of the date that was created more than one month but less than three months (1m 1q) before
-3 Number of UTXOs still alive by the end of the date that was created more than three months but less than six months (1q 6m) before
-1 Number of UTXOs in BTC still alive by the end of the date that was created more than six months but less than one year (6m 1y) before
1 Number of UTXOs in BTC still alive by the end of the date that was created more than one year but less than two years (1y 2y) before
3 Number of UTXOs in BTC still alive by the end of the date that was created more than two years but less than three years (2y 3y) before
5 Number of UTXOs in BTC still alive by the end of the date that was created more than three years but less than four years (3y 4y) before
7 Number of UTXOs in BTC still alive by the end of the date that was created more than four years but less than five years (4y 5y) before
9 Number of UTXOs in BTC still alive by the end of the date that was created more than five years but less than ten years (5y 10y) before
11 Number of UTXOs in BTC still alive by the end of the date that was created more than ten years (> 10y) before
Table 2: Meta Data for UTXO Dataset

Technical Validation

To further verify the validity of our methods, we calculate other variables using our data and check whether they are consistent with descriptions in the Bitcoin white paper [nakamoto_2008_bitcoin]. Figure 3 shows the visualization of the block rewards, circulating bitcoin supply and total spent bitcoins. Block rewards are the bitcoins rewarded to the miner who wins the right to record a block of transactions by proof-of-work. Supply of the bitcoin sources from the rewards of mining blocks, so the accumulated sum of block rewards is the total number of UTXOs in BTC, i.e. the circulating supply of bitcoins. The Bitcoin block reward was initially set at 50 BTC per block in 2009, which means roughly 7,200 newly minted BTC every 24 hours. The block reward halves every 210,000 blocks, roughly every four years, until the total bitcoin supply reaches 21 million[schr_2020_understanding]. As to date, the daily block reward amounts to around 900, and the circulating BTC supply reaches 18.6 million. In addition, the number of STXOs in BTC represent the total number of bitcoins that were transacted in the past. While the number of bitcoins in the form of STXOs increases with every transaction, the number of bitcoins in the form of UTXOs does not change before and after the transaction except those generated from block rewards. Therefore, the number of bitcoins in the form of STXOs dwarfs the number in the form of UTXOs (i.e. the circulating bitcoin supply) as bitcoins are frequently transacted. All these patterns are accurately illustrated in Figure 3.

Figure 3: Block reward, circulating BTC supply, and total spent BTCs

Usage Notes

Our datasets can be combined with other bitcoin transaction datasets for further research. For example, Figure 4 superimposes the BTC price data from CoinMetrics[coinmetrics] to our weighted average lifespan (WAL) data. We notice that the WAL of BTC in STXOs attains peak values when the bitcoin price changes abruptly. For example, the 2014 peak of WAL closely followed the rocketing of BTC price from to and the subsequent price collapse. This implies that bitcoins with large ages become more active in market turmoils.

Figure 5 presents the Price Data from CoinMetrics and BTC Token Velocity calculated using our datasets. BTC velocity is defined as the number of bitcoins spent monthly divided by the circulating supply. It shows how frequently bitcoins are transacted during that period. The maximum value occurred at the end of 2013. It implies every bitcoin was spent on average more than 35 times in that month.

Figure 4: Daily Weighted Average Lifespan of BTC STXOs and BTC Price
Figure 5: BTC Token Velocity and BTC Price

Acknowledgements

We have benefited from the comments by discussants at the SciEcon Research Accelerator Seminar.

Author contributions statement

Each author contributed equally to this research. Yulin Liu led the discussion on the blockchain mechanism and the economic meanings of UTXO datasets. Luyao Zhang designed the cohort analysis. Yinhong Zhao queried and processed the data. Each author contributes significantly to the draft of the manuscript.

Competing interests

There is no competing interests.

References