BlockTag: Design and applications of a tagging system for blockchain analysis

09/17/2018 ∙ by Yazan Boshmaf, et al. ∙ 0

Annotating blockchains with auxiliary data is useful for many applications. For example, e-crime investigations of illegal Tor hidden services, such as Silk Road, often involve linking Bitcoin addresses, from which money is sent or received, to user accounts and related online activities. We present BlockTag, an open-source tagging system for blockchains that facilitates such tasks. We describe BlockTag's design and present three analyses that illustrate its capabilities in the context of privacy research and law enforcement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Public blockchains contain valuable data describing financial transactions. For example, Bitcoin’s raw blockchain data alone is 160 GB as of March 2018, and is growing rapidly. This data holds the key to understanding different aspects of cryptocurrencies, including their privacy and market dynamics. Blockchain analysis systems, such as BlockSci (Kalodner et al., 2017), have enabled blockchain science by addressing three pain points, namely poor performance, limited capabilities, and a cumbersome programming interface.

Overview

We present BlockTag, a tagging system for blockchains. BlockTag uses vertical crawlers to automatically annotate blockchain data with tags, mappings between block, transaction, or address identifiers and auxiliary data describing the tagged identifiers. For example, the system can tag Bitcoin address with the Twitter user account of its owner. BlockTag also provides a novel blockchain query interface with advanced capabilities, such as clustering, linking and searching, which are important for privacy research and law enforcement. For example, BlockTag provides best-effort answers to high-level queries in Bitcoin e-crime investigations, such as “which Twitter user accounts paid to Silk Road in 2014.”

Design

We start with the observation that most blockchain analysis systems transform raw blockchain data into a stripped-down, simple data structure that can fit in or map to memory. Therefore, information auxiliary to core transaction data, such as scripts, hashes, or annotations in general, cannot be part of this data structure and must have their own mappings. This naturally leads to a layered system architecture, where a tagging layer sits on top of an analysis layer, with a well-defined and extendable interface between them, as shown in Figure 1. In our implementation, we chose BlockSci as a blockchain analysis system because it is hundreds times faster than its contenders. Moreover, BlockSci exposes a programming interface in C++ to extend its core analysis library, along with a Python wrapper for defining high-level analytical tasks.

Figure 1. Layered blockchain system architecture.

BlockTag is shipped with batteries included. First, it implements four vertical crawlers that are configured to annotate Bitcoin addresses with three types of tags: user tags representing BitcoinTalk111https://bitcointalk.org and Twitter user accounts, service tags representing service providers that are indexed by Ahmia222https://ahmia.fi search engine, and text tags representing user-generated textual labels submitted to Blockchain.info.333https://blockchain.info Second, BlockTag is not limited to Bitcoin. The vertical crawlers can be configured to scrape auxiliary data of other cryptocurrencies, including Litecoin, Namecoin, and Zcash. Smart contract platforms, such as Ethereum and EOS, are outside our scope. Third, BlockTag extends BlockSci’s analysis library and implements a programming interface that enables analysts to query transactions by their properties, including tags. Fourth, BlockTag allows analysts to manually annotate blockchains with custom tags at the block, transaction, and address level.

Analysts start blockchain investigations using a Jupyter notebook that imports BlockTag’s Python package. The package exposes a chain object representing the blockchain. Each block, transaction, and address has a tags object mapping it to some JSON-serializable auxiliary data. Selecting, grouping, and aggregating transactions is straightforward and is provided through a simple query interface.

Deployment

We deployed BlockTag on a single, private, server-grade machine in January 2018 for about three months. As of March 2018, the crawlers have ingested about 5B tweets, 2.2M BitcoinTalk user profiles, 1.5K Tor onion pages, and 30K Blockchain.info labels. This has resulted in 45K user, 88 service, and 29K text tags.

Applications

We demonstrate BlockTag’s novel capabilities with three applications, focusing on Bitcoin and Tor hidden services.

1) Linking: We show it is relatively easy to link users of social networks to Tor hidden services through Bitcoin payments. We were able to link 125 user accounts to 20 service providers, which include illegal and controversial ones, such as Silk Road and The Pirate Bay. While one may expect a better level of privacy when using Bitcoin, we recall that it is pseudo-anonymous by design and lacks retroactive operational security, as described by Satoshi (Nakamoto, 2008). From a law enforcement perspective, BlockTag offers a valuable capability that is useful in e-crime investigations. In particular, showing a provable link between a user account on a website and illegal activities on the Dark Web can be used to secure a subpoena to collect more information about the user from the website’s operator (Theymos, 2014).

2) Market economics: We analyze the market of Tor hidden services by calculating their “balance sheets.” We show that WikiLeaks is the highest receiver of payments in terms of volume, with 26.4K transactions. In terms of total value of incoming payments, however, Silk Road tops the list with more than K received on its address. We also observe that the money flowing in and out of service addresses is nearly the same. This suggests that service providers do not keep their Bitcoins on the same address they use for receiving payments, but rather distribute them to other addresses. Third, from the last transaction dates of these addresses, we found that all but three of the top-10 revenue making service providers are active in 2018. This, however, does not mean the three services have stopped making Bitcoin transactions, as the service providers might have used different addresses that the crawlers have not found.

3) Forensics: We link 24.2K users and 202 labels to MMM, which is one of the world’s largest Ponzi scheme. All of these users are BitcoinTalk users who are mostly male, 20–40 years old, and located worldwide in more than 80 countries. We found that only 313 users have made one or more activities and engaged with the forum once a day, on average. After further analysis, it turned out these user accounts were created as part of the “MMM Extra” scheme, which promises “up to 100% return per month for performing simple daily tasks that take 5–15 min.” We also used BlockTag to retrieve and model MMM transactions as a graph. This graph consisted of 14.3K addresses and 32.K transactions. We found that two of the top-10 ranked addresses, in terms of their PageRank, have been flagged on BitcoinTalk as scammer addresses. As of March 21, these addresses has received more than    B    2M combined.

2. Background and related work

Research in blockchain and cryptocurrencies has gained a significant momentum over the years (Bonneau et al., 2015). In what follows, we present the background and related work and contrast it to ours.

Analysis systems

Blockchain analysis systems parse and analyze raw transaction data for many applications. Recently, Kalodner et al. proposed BlockSci (Kalodner et al., 2017), an open-source, scalable blockchain analysis system that supports various blockchains and analysis tasks. BlockSci incorporates an in-memory, analytical database, which makes it several hundred times faster than its contenders. While there is a minimal support for tagging in its programming interface, BlockSci is designed for analysis of core blockchain data. At the cost of performance, annotation and tagging can be integrated into the analysis pipeline through a centralized, transactional database. For example, Spagnuolo et al. proposed BitIodine (Spagnuolo et al., 2014), an open-source blockchain analysis system that supports tagging through address labels. However, BitIodine, relies on Neo4j (Miller, 2013), a general-purpose graph database that is not designed for blockchain data and its append-only nature, which makes it inefficient for common blockchain analysis tasks, such as address linking. In contrast, BlockTag is the first open-source tagging system that fills this role.

Linking

The impact of Bitcoin address linking on user anonymity and privacy has been known for a while now (Reid and Harrigan, 2013; Jordan et al., 2013; DuPont and Squicciarini, 2015; Fleder et al., 2015). Fergal and Martin (Reid and Harrigan, 2013) showed that passive analysis of public Bitcoin information can lead to a serious information leakage. They constructed two graphs representing transactions and users from Bitcoin’s blockchain data and annotated the graphs with auxiliary data, such as user accounts from BitcoinTalk and Twitter. The authors used visual content discovery and flow analysis techniques to investigate Bitcoin theft. Alternatively, Fleder et al. (Fleder et al., 2015) explored the level of anonymity in the Bitcoin network. The authors annotated addresses in the transaction graph with user accounts collected from BitcoinTalk in order to show that users can be linked to transactions through their public Bitcoin addresses. These studies show the value of using public data sources for Bitcoin privacy research and law enforcement, which is our goal behind designing BlockTag.

Tor hidden services and black markets

Tor hidden services have become a breeding ground for black markets, such as Silk Road and Agora, which offer illicit merchandise and services (Biryukov et al., 2014b; Moore and Rid, 2016). Moore and Rid (Moore and Rid, 2016) studied how hidden services are used in practice, and noted that Bitcoin was the dominant choice for accepting payments. Although multiple studies (Fleder et al., 2015; Meiklejohn et al., 2013) showed that Bitcoin transactions are not as anonymous as previously thought, Bitcoin remains the most popular digital currency on the Dark Web (Castillo, 2016), and many users choose to use it despite its false sense of anonymity. Recent research explored the intersection between Bitcoin and Tor privacy (Biryukov et al., 2014a; Biryukov and Pustogarov, 2015), and found that legitimate hidden service users and providers are one class of Bitcoin users whose anonymity is particularly important. Moreover, Biryukov et al. (Biryukov et al., 2014b) found that hidden services devoted to anonymity, security, human rights, and freedom of speech are as popular as illegal services. While BlockTag makes it possible to link users to such services, we designed it to help analysts understand the privacy threats, identify malicious actors, and enforce the law.

Forensics

Previous research showed that cryptocurrencies, Bitcoin in particular, have a thriving market for fraudulent services, such as fake mining, wallets, exchanges, and Ponzi schemes (Vasek and Moore, 2015; Bohr and Bashir, 2014). Recently, Bartoletti et al. (Bartoletti et al., 2018)

proposed a data mining approach to detect Bitcoin addresses that are involved in Ponzi schemes. The authors manually collected and labeled Bitcoin addresses from public data sources, defined a set of features, and trained multiple classifiers using supervised machine learning. The best classifier correctly labelling 31 addresses out of 32 with 1% false positives. Interestingly, MMM was excluded because it had a complex scheme. In concept, BlockTag complements such techniques by providing an efficient and easy way to collect and explore data that is relevant to the investigation. This data can be then analyzed using different techniques with the help of existing tools 

(Vasek and Moore, 2018).

3. Design and architecture

BlockTag is designed for a layered system architecture. As depicted in Figure 1, each layer in the blockchain stack is responsible for a separate set of tasks and can interact with other layers through programmable interfaces. We present a high-level view of BlockTag’s design, and leave the details in the technical report.

3.1. Tags

In BlockTag, a tag is a mapping between a block, a transaction, or an address identifier and a list of JSON-serializable objects. Each object specifies the type, the source, and other information representing auxiliary data describing the tagged identifier. As raw blockchain data is stored in a format that is efficient for validating transactions and ensuring immutability, the data must be parsed and transformed it into a simple data structure that is efficient for analysis. For example, BlockSci uses a memory-mapped data structure to represent core transaction data as a graph. All other transaction data, such as hashes and scripts, are stored separately as mappings that are loaded when needed. BlockTag follows this design choice, and uses a persistent key-value database, such as RocksDB (Facebook, 2014), with an in-memory cache in order to store and manage blockchain tags, as they can grow arbitrarily large in size.

BlockTag defines four types of tags, namely user, service, text, and custom tags. A user tag represents a user account on an online social network, such as BitcoinTalk and Twitter. A service tag represents an online service provider, such as Tor hidden services like Silk Road and The Pirate Bay. A text tag represents a user-generated textual label, such as address labels submitted to Blockchain.info. A custom tag can hold arbitrary data, including other tags, and is usually used when creating tags manually by analysts.

In BlockTag, tags are created, updated, and removed at the block, transaction, or the address level. Listing 1 shows how to create a user tag mapping Bitcoin’s genesis address to Satoshi’s BitcoinTalk user account. The append flag indicates whether the value defined in this tag should be appended to the existing list, as the address can have other tag values defined already.

import blocktag
chain = blocktag.Blockchain(’/path/to/blockchain/data/’)
chain.tag(
    level=’address’,
    key=’1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa’,
    value=[{
        ’type’: ’user’,
        ’source’: ’bitcointalk’,
        ’info’: {
            ’id’: 3,
            ’account’: ’satoshi’,
            ’num_posts’: 575,
            ’num_activities’: 364,
            ’position’: ’founder’,
            ’date_registered’: ’2009-11-19 19:12:39’,
            ’last_seen’: ’2010-12-13 16:45:41’
        }
    }],
    append=False
)
Listing 1: Creating a tag.

A direct, read-only access to tags is possible at any level through the tags object of a block, a transaction, and an address. By default, BlockTag returns the tag of an identifier at a given level along with the tags of identifiers from lower levels. This means it is sufficient to tag only addresses in order to annotate the whole blockchain.

3.2. Vertical crawlers

In BlockTag, a vertical crawler is used to scrape a data source, typically an HTML website or a RESTful API, in order to automatically create block, transaction, or addresses tags of a particular type using a website-specific parser. A crawler can be configured to run according to a crontab-like schedule, and to bootstrap on the first run with previously crawled raw HTML/JSON data, which can also be used to initialize blockchain tags. Listing 2 shows how to run a BitcoinTalk user crawler at the address level everyday at midnight.

chain.crawl(
    level=’address’,
    config={
        ’type’: ’user’,
        ’source’: ’bitcointalk’,
        ’schedule’: ’0 0 * * *’,
        ’data’: ’/path/to/bitcointalk/data/’
    }
)
Listing 2: Scheduling a crawler.

BitcoinTalk is one of the most popular Bitcoin forums with more than 2.2M users. In fact, as of July 2018, the forum contained about 42.2M posts, which makes it a good data source to collect public Bitcoin addresses and their associated user accounts. Behind the scene, chain.crawl() in Listing 2 spawns a crawler that downloads user account pages through a URL that is unique for each user account. The HTML pages are then parsed to find Bitcoin addresses using regular expressions. As a Bitcoin address is a base58 encoded identifier of 26–35 alphanumeric characters, beginning with the number 1 or 3, the crawler uses the regex *[13][a-km-zA-HJ-NP-Z1-9]25,34 and eventually creates or updates a user tag for the matched address.

In addition to a BitcoinTalk user crawler, BlockTag implements a Twitter user crawler that consumes Twitter’s API, a Tor hidden service crawler that scrapes onion landing pages of Ahmia-indexed service providers, and a Blockchain.info text crawler that scrapes textual labels that are self-signed by address owners or submitted by arbitrary users. By default, the vertical crawlers create Bitcoin address tags, but can be configured to scrape auxiliary data of other cryptocurrencies, including Litecoin, Namecoin, and Zcash.

3.3. Query engine

BlockTag query engine is inspired from NoSQL document databases, such as MongoDB (Chodorow, 2013), where queries are specified using a JSON-like structure. Selecting, grouping, and aggregating transactions is provided through a simple query interface.

To write a query, the analyst starts with specifying block, transaction, or address properties to which the results should match using the where parameter. BlockTag treats each property as having an implicit boolean AND. It natively supports boolean OR queries, but the analyst should use a special $or operator to achieve it. In addition to exact matches, BlockTag has operators for string matching, numerical comparisons, etc. The analyst can also specify the properties by which the results are grouped using the group_by parameter. Finally, the analyst can specify which properties to return per result with the select parameter. While this query interface is suitable for many tasks, BlockTag’s Python package also exposes lower-level functionality to analysts who have tasks with more sophisticated requirements. Listing 3 shows how to finds Twitter user accounts who paid to Silk Road in 2014.

accounts = chain.query(
    level=’transaction’,
    select= ’input.address.tag.info.account’,
    where={
        ’input’: {
            ’address’:
                ’tag’: {
                    ’type’: ’user’,
                    ’source’: ’twitter’
                }
            }
        },
        ’output’: {
            ’address’: {
                ’tag’: {
                    ’type’: ’service’,
                    ’source’: ’tor’,
                    ’info’: {
                        ’provider’: { ’$like’: ’silkroad’ }
                    }
                }
            }
        },
        ’time’: ’2014’
    },
    group_by=’input.address.tag.info.id’,
    having=’sum(input.value) >= (10.0 * 10**7)’,
    clustering={ ’source’: ’inputs’, ’method’: ’original’ }
)
Listing 3: Querying a blockchain.

One important capability of BlockTag’s query engine is address clustering (Meiklejohn et al., 2013), which can be configured to operate on a particular source, namely inputs, outputs, or both, using one of the supported clustering methods, all through the clustering parameter. Address clustering expands the set of Bitcoin addresses that are mapped to a unique user, service, or text tag through a technique called closure analysis. As a result, this allows the analyst to identify more links between different tags by considering a larger number of transactions in the blockchain.

BlockTag supports multiple address clustering methods. The first method is the original

closure heuristic proposed by Meiklejohn et al. 

(Meiklejohn et al., 2013), which works as follows: If a transaction has addresses and as inputs, then and belong to the same cluster. The rationale behind this heuristic is that such addresses are highly likely to be controlled by the same entity, as they are signed by the private keys whose owner performed the transaction. While efficient, this method can result in large clusters that include addresses which belong to different entities, due to mixing services and CoinJoin transactions. In order to tackle this issue, BlockTag implements a novel minimal clustering method that prematurely terminates the original clustering method before the clusters grow to their maximum size. Minimal clustering includes a final trimming phase to find clusters that share at least one address and consequently merges them, after which they are removed. Doing so ensures that the clusters are mutually-exclusive and likely to belong to separate entities, but also means the clusters are smaller than usual, reducing the chance of linking different tags as a result.

4. Real-world deployment

We now describe our experience in deploying BlockTag.

4.1. Ethical considerations

BlockTag’s functionality depends on tags that map blocks, transactions, and addresses to user accounts, service providers, text labels, and other types of tags. This allows BlockTag to link tags to each other by findings blockchain transactions involving tag identifiers. For example, it is sufficient to show a transaction from Alice’s address to Bob’s address to link them together. Tag values represent auxiliary data that is collected from public sources, which include social networks, Tor hidden services, and blockchains. As such, we are faced with two privacy-related ethical concerns, namely linking and data collection. In what follows, we discuss the actions we took to address them, as we worked with our institution’s IRB board to approve BlockTag deployment.

First of all, the information gathered from anonymous cryptocurrency payments without linking is often limited and non-actionable for privacy research and law enforcement. BlockTag is designed to address this limitation, building on top of previous studies that showed the feasibility, utility, and value of linking users through Bitcoin transactions and public data sources (Reid and Harrigan, 2011; Meiklejohn et al., 2013). BlockTag does not put users at any additional risk, but rather exposes existing ones and corrects common misconceptions, such as Bitcoin being a private or anonymous online payment system. When needed, we reached out to legitimate users whose privacy is at risk, and informed them about how their Bitcoin transactions link to their online activities and what they can do about it. We also posted a notice on BitcoinTalk forum concerning deanonymizating Tor hidden service users.444https://bitcointalk.org/index.php?topic=2602885

Concerning data collection, our deployment uses crawlers which target solely public data sources. The crawlers are polite, passive, and respect robots.txt instructions. This means we do not collect data from sources that require authentication, payment, or email exchange. Also, all collected data is secured and stored on our private infrastructure whose access is restricted to authorized researchers.

Finally, we have shared our deployment plans with a few stakeholders in order to get an early feedback. In response, we engaged with the U.S. Federal Trade Commission, a national financial regulatory authority, two law firms, and an international news agency which were interested in BlockTag and its potential for protecting users, enforcing the law, and uncovering cyber criminals, respectively. This also indicates that evidence acquired through BlockTag is admissible in the court of law.

4.2. Setup

We deployed BlockTag on a single machine from Jan 1 to March 21, 2018. The machine was running Ubuntu v16.04.4 LTS, Bitcoin Core v0.16.0, and BlockSci v0.5.0 on two 2GHz quad-core CPUs, 128GB of system memory, and 2TB of network-attached storage.

We used BlockTag to tag Bitcoin’s blockchain at the address level. As of March 2018, the crawlers have ingested nearly 5B tweets, 2.2M BitcoinTalk profiles, 1.5K Tor onion pages, and 30K Blockchian.info labels, resulting in 45K user, 88 service, and 29K text tags. We used a previously collected dataset consisting of 4.8B tweets, which were posted in 2014, to bootstrap Twitter user tags. Moreover, for the first application where we link users to services, we configured address clustering for inputs from user tags using the minimal clustering method. We summarize the created tags in Table 1.

# addresses
Source Type Original Clustering
BitcoinTalk User 40,970 19,213,141
Twitter User 4,183 623,189
Tor Network Service 88
Blockchain.info Text 29,643
Table 1. Summary of created tags.

5. Applications

We demonstrate the capabilities of BlockTag in the following.

5.1. Linking users to services

In e-crime investigations of illegal Tor hidden services, such as Silk Road, analysts often try to link cryptocurrency transactions to user accounts and activities. This can start with a known transaction that is part of a crime, such as a Bitcoin payment to buy drugs on Silk Road. Instead, a wider search criteria can be used to understand the landscape of activities of illegal services, such as finding service providers that receive the most payments. Either way, the analysts need to link users to services. In BlockTag, this can be achieved in a single query, as shown in Listing 4.

user_service_txes = chain.query(
    level=’transaction’,
    select= [’input.address.tag.info.account’, ’output.address.tag.info.provider’, ’self.txes’],
    where={
        ’input’: {
            ’address’: {
                ’tag’: { ’type’: ’user’ }
            }
        },
        ’output’: {
            ’address’: {
                ’tag’: {
                    ’type’: ’service’,
                    ’source’: ’tor’
                }
            }
        }
     },
    group_by=[’input.address.tag.info.id’, ’output.address.tag.info.id’],
    clustering={ ’source’: ’inputs’, ’method’: ’minimal’ }
)
Listing 4: Linking different tags via transactions.

This resulted in linking 28 Twitter user accounts to 14 service providers via 167 transactions and 97 BitcoinTalk user accounts to 20 service providers via 115 transactions. Some of these users were linked to multiple service providers. In total, 125 users were linked to 20 services. The results suggest that although Twitter users are smaller in number compared to BitcoinTalk users, they are more active and have a larger number of transactions with services. In fact, some of these users are “returning customers,” as they have performed multiple transactions with the same service provider.

# linked users
Name Twitter BitcoinTalk Total
WikiLeaks 11 35 46
Silk Road 4 18 22
Internet Archives 3 13 16
Snowden Defense Fund 3 8 11
The Pirate Bay 3 7 10
DarkWallet 9 1 10
ProtonMail 1 7 8
Darknet Mixer 1 2 3
Liberty Hackers 0 2 2
CryptoLocker Ransomware 1 0 1
Table 2. Top-10 linked service providers.

Another way to present these results is from the standpoint of services. Table 2 lists the top-10 service providers sorted by how many users were linked to them. The list is topped by WikiLeaks, which is a service that publishes secret information provided by anonymous sources, with 46 linked users. This is followed by Silk Road, the famous black market, with transactions from 22 users whose input coins have been seized by the FBI. Although the payment address of Silk Road was seized, it still appears in transactions until recently. However, based on further analysis, we found that a number of transactions were performed prior to the seizure. Ranked fifth, The Pirate Bay, which is known for infringing IP and copyright laws by facilitating the distribution of protected digital content, was linked to 10 users. As the linked users have accounts with various personally identifiable information (PII), they are vulnerable to the threat of deanonymizing their true identities. We next focus on two case studies that illustrate this threat in more detail.

Actionable links

Purchasing products and services of black markets is generally considered illegal and calls for legal action. Some of the 22 users who are linked to Silk Road through transactions with seized coins shared enough PII to completely deanonymoize their identity. For example, one user is a 16 years old male from Crossville, Tennessee, U.S. The user has been a registered BitcoinTalk member since 2013, and has a transaction with Silk Road in October 2013, the takedown year, when he was around 13 years old. The corresponding user account points to his personal website, which contains links to his user profiles on Facebook, Twitter, and Youtube. Even if users do not share PII or use fake identities on their accounts, simply having an account on social networks is enough to track them online, or even secure a subpoena to collect identifiable information, such as login IP addresses. For example, three out of the 18 BitcoinTalk users recently logged in to the website.

A matter of jurisdiction

One of the users who are linked to The Pirate Bay is a 36 years old male from Sweden. The Pirate Bay was founded by a Swedish organization called Piratbyrån. Furthermore, the original founders of the website were found guilty in the Swedish court for copyright infringement activities. Since then, the website has been changing its domain constantly, and eventually operated as a Tor hidden service. Consequently, having such a link to The Pirate Bay through recent transactions in Sweden can lead to legal investigation, at least, and potentially be incriminating.

5.2. Market economics

Keeping track of market statistics describing Tor hidden services is useful for identifying thriving services, measuring the impact of law enforcement, and prioritizing e-crime investigations. As such, an analyst may start with calculating a financial “balance sheet” for service providers, which typically includes the number of transactions with which a service is involved (i.e., volume), the amount of coins a service has received or sent (i.e., money flow), and the difference between the timestamps of the last and first transactions (i.e., operation lifetime). These statistics can be calculated in BlockTag using two queries, as shown in Listing 5.

balance_sheet = chain.query(
    level=’transaction’,
    select= [’output.address.tag.info.provider as @name’,
        ’count(self.txes) as volume’,
        ’sum(input.value) as incoming’,
        ’min(time) as first_tx’,
        ’max(time) as last_tx’,
        ’date_diff(max(time), min(time)) as num_days’],
    where={
        ’output’: {
            ’address’: {
                ’tag’: {’type’:’service’, ’source’:’tor’}
            }
        }
    },
    group_by=’output.address.tag.info.id’
)
balance_sheet.join(
    results=chain.query(
        level=’transaction’,
        select= [’input.address.tag.info.provider as @name’,
            ’sum(input.value) as outgoing’],
        where={
            ’input’: {
                ’address’: {
                    ’tag’: {’type’:’service’, ’source’:’tor’}
                }
            }
        },
        group_by=’input.address.tag.info.id’
    ),
    on=’@name’
)
Listing 5: Calculating a balance sheet for service tags.

BlockTag supports joining queries via results.join() method of a query’s results object. The join method operates on properties that can be aliased and referenced across queries using the @alias operator. In Listing 5, the two queries are joined in order to calculate the money flow, as an address of a service tag can be an input or an output of a transaction, depending on whether it is an incoming or outgoing payment. Table 3 shows the market statistics for the top-10 service providers ranked by incoming coins.

Volume Flow of money (

  

B

  

)
Lifetime (dd/mm/yyyy)
Name (# txs) Incoming Outgoing First tx Last tx # days
Silk Road 1,242 29,676.99 29,658.80 02/10/2013 19/03/2018 1,628
WikiLeaks 26,399 4,043.00 4,040.74 15/06/2011 21/03/2018 2,470
VEscudero Escrow Service 192 842.42 842.42 27/05/2012 20/08/2017 1,910
Internet Archives 2,957 775.86 746.89 06/09/2013 21/03/2018 1,656
Freenet Project 280 691.87 687.62 23/02/2011 16/03/2018 2,577
Snowden Defense Fund 1,722 218.95 218.95 11/08/2013 18/03/2018 1,680
ProtonMail 3,096 208.40 208.36 17/06/2014 18/03/2018 1,369
Ahmia Search Engine 1,423 176.51 176.50 27/03/2013 06/03/2018 1,652
DarkWallet 983 114.62 97.40 16/04/2014 02/11/2016 931
The Pirate Bay 1,214 76.80 76.80 29/05/2013 21/08/2017 1,544
Table 3. Balance sheet of top-10 service providers ranked by incoming coins.

Volume

While the number of created service tags is small, the corresponding service providers have been involved in a relatively large number of transactions. For example, WikiLeaks tops the list with 26.4K transactions. The Darknet Mixer, which did not make it to the top-10 list in Table 3, has a volume of 22.1K transactions that is greater than the remaining services combined. One explanation for this popularity is that users are actually aware of the possibility of linking, and try to use mixing services in order to make traceability more difficult and improve their anonymity.

Money flow

One interesting observation is that service providers have a nearly zero balance, which means almost the same amount of money comes in and goes out of their addresses. This indicates that the money is likely distributed to other addresses and is not kept on payment-receiving addresses. One explanation for this behavior is that by distributing funds among multiple addresses, a service provider can reduce coin traceability. Moreover, service providers still need to distribute their revenues among owners, sellers, and other stakeholders. Among all service providers listed in Table 3, Silk Road stands out with an income of K.

Lifetime

The services vary in their lifetime, ranging from two to seven years of operation. The first transaction date does not imply that the service provider began its operation on that date. It merely indicates the date on which the service provider started receiving payments through the tagged addresses. Looking at last transaction dates, all but three services are still active in 2018. For example, Silk Road has been receiving money since October 2013, even after the address has been seized by the FBI and its coins auctioned for sale by the U.S. Justice Department in June, 2014. However, a large number of post-seizure transactions appear to be novelty tips.

5.3. Forensics

Organizations responsible for consumer protection, such as trade commission agencies and financial regulatory authorities, have a mandate to research and identify fraud cases involving cryptocurrencies, including unlawful initial coin offerings and Ponzi schemes. Given the popularity of Ponzi schemes in Bitcoin (Vasek and Moore, 2015, 2018), we focus on this type of fraud and show how BlockTag can help analysts flag users who are likely victims or operators of such schemes.

A Ponzi scheme, also known as a high yield investment program, is a fraudulent financial activity promising unusually high returns on investment, and is named after a famous fraudster, Charles Ponzi, from the 1920s. The scheme is designed in such a way that only early investors will get benefits and once the sustainability of the scheme is at risk the majority of shareholders will lose the money they invested (Artzrouni, 2009). Among various Ponzi schemes in Bitcoin, MMM is considered one of the largest schemes that is hard to detect solely based on blockchain transaction analysis (Bartoletti et al., 2018), highlighting the need for a systematic integration of auxiliary data into blockchain analysis. As such, an analyst can start the investigation with BlockTag using a full-text search query of keywords associated with MMM scheme, such as its name, without requiring prior knowledge of who is involved in the scheme or how it works, as shown in Listing 6.

mmm_tags = chain.query(
    level=’address’,
    select= [’self.address’, ’tag.id’],
    where={
        ’tag’: {
            ’type’: { ’$in’: [’user’, ’text’] },
            ’info’: { ’$like’: ’mmm’ }
        }
    }
)
Listing 6: Searching for tags using keywords.

This resulted in 24.2K user accounts, all of which are BitcoinTalk users, and 202 Blockchain.info text labels. For BitcoinTalk user accounts, the full-text search matched the website property of an account, which contained a URL pointing to the user’s profile on MMM website. As for Blockchain.info text tags, the search matched the self-signed label property, which contained “mmm” substring, as summarized in Table 4. We next analyze the user accounts looking for clues related to MMM operation.

User demographics

Out of 24.2K users, 52.86%, 18.31%, and 12.48% shared their gender, age, and geo-location information, respectively. Based on this data, we found that the users are mostly male (75.44%), between 20–40 years old (average=32), and are located worldwide in more than 80 different countries. However, 70.69% of the users were located in only five countries, namely Indonesia, China, India, South Africa, and Thailand. Interestingly, most of these countries have a corresponding MMM label, as listed partially in Table 4.

Forum activity

We used three properties of a user account that relate to activity on the forum, namely, date_registered, last_seen, and num_activities. We found that 99.44% of the users registered on the forum between August 2015–March 2016. Moreover, 98.21% of the users made their last activity on the forum during the same period. This suggest that users have short-lived accounts. In fact, we found that 94.25% of the users were active for 30 days or less, and that 78.45% of users were dormant, meaning they were active for less than a day after registration. This also suggests that most of the users are not engaged with the forum. Indeed, only 313 users made at least one activity, and even for these users, they never engaged with the forum for more than once a day, on average. After manually inspecting the accounts on the website, we found that most of them were created as part of its “MMM Extra” scheme, which promises “up to 100% return per month for performing simple daily tasks that take 5–15 min,” such as promoting MMM on social networks. This was evident from the accounts’ signatures, which the crawler did not parse, that included messages such as “MMM Extra is the right step towards the goal” and “MMM participants get up to 100% per month.”

Financial operation

As tags are linked through transactions in BlockTag, we can explore how MMM scheme operates financially through transaction graph analysis (Ron and Shamir, 2013). In this analysis, Bitcoin transactions are modeled as a weighted, directed graph where nodes represent addresses, edges represent transactions, and weights represent information about transactions, such as input/output values and dates. Analyzing the topological properties of this graph can provide insights into which addresses are important and how the money flows. For example, having a few “influential” nodes and a small clustering coefficient suggest that most of the money funnels through these nodes and does not flow back to others, which are indicative of a Ponzi operation (Vasek and Moore, 2015, 2018; Bartoletti et al., 2018). In BlockTag, an analyst can easily model case-specific transaction graphs by linking tags based on some search criteria, as shown in Listing 7.

Label Frequency
mmm universe.help 46
mmm global 13
bonus from mmm universe.help 9
mmm indonesia 6
mmm nusantara 4
mmm china 2
mmm india 2
mmm indonesia 2
mmm philippines 2
mmm russia 2
Table 4. Top-10 frequent MMM labels.
mmm_txes = chain.query(
    level=’transaction’,
    select= [’input.address.tag.id’, ’output.address.tag.id’, ’self.txes’],
    where={
        ’input’: {
            ’address’: {
                ’tag’: {
                    ’type’: { ’$in’: [’user’, ’text’] },
                    ’info’: { ’$like’: ’mmm’ }
                }
            }
        },
        ’output’: {
            ’address’: {
                ’tag’: {
                    ’type’: { ’$in’: [’user’, ’text’] },
                    ’info’: { ’$like’: ’mmm’ }
                }
            }
        }
    },
    group_by=[’input.address.tag.type’, ’output.address.tag.type’]
)
Listing 7: Linking tags based on full-text search.
Figure 2. MMM transaction graph.

We used the query in Listing 7 to model and analyze five transaction graphs, one for every combination of tag types, as summarized in Table 5. The MMM transaction graph includes addresses of any type, and consisted of 14.3K addresses (i.e., order) and 32.5K transactions (i.e., size). This graph is also sparsely connected, as suggested by the small-sized largest strongly connected component (LSCC), low clustering, and long distance measures. Moreover, it consists of two subgraphs, the useruser subgraph, which is also sparsely connected, and the labellabel subgraph, which is dense and small. Even though the two subgraphs are loosely connected through only 170 edges, an order of magnitude more money has flown from users to labels than the reverse direction, as shown Figure 2.

Largest component Clustering
Type Weakly connected Strongly connected Triangles Distance (LSCC)
Input Output Order Size #nodes #edges #nodes #edges Average #triangles %closed Diameter Radius
User User 14,227 31,819 13,914 31,631 5,850 17,498 0.11 6,566 0.08 17 7
User Label 129 125 96 103 1 0 0.00 0 0.00 0 0
Label User 64 45 10 9 1 0 0.00 0 0.00 0 0
Label Label 61 508 54 498 20 246 0.64 943 61.04 3 2
Any Any 14,319 32,497 14,002 32,307 5,934 18,128 0.11 7,576 0.09 17 7
Table 5. Properties of MMM transaction (sub)graphs.

To find influential nodes in the graph, we computed their PageRank, where weights represented input address values of transactions. All of the top-10 ranked nodes were located in the useruser subgraph, which mapped to unique BitcoinTalk users. After manually inspecting the corresponding accounts, we found that the first and the third users have been reported as scammers on BitcoinTalk for operating fraudulent services, namely Dr.BTC and OreMine.Org. While the first user has received a total of    B    426.7K on her address, the third has received a staggering total of    B    1.8M on his address that is associated with Huobi wallet address, an exchange service, suggesting that the user has exchanged the received coins.

6. Discussion

In what follows, we discuss the limitations of our work and outline our plan for current and future work.

Limitations

BlockTag’s main limitation is the validity of its tags, since they are created automatically by crawlers from open, public data sources. This limitation is part of a larger problem that is common with Internet content providers, such as Google and Facebook, especially when this content is generated mostly by users (Yin et al., 2008; Li et al., 2016). In general, the validity issue is especially important for user identities, as attackers and fraudsters can always create fake accounts in order to hide their real identity (Ferrara et al., 2016). While doing so improves their anonymity, law enforcement agencies can use the links found through BlockTag to secure a subpoena in order to collect more information about suspects from website operators (Theymos, 2014).

Work in progress

We are designing BlockSearch, an open-source Google-like searching layer that sits on top of BlockTag. BlockSearch allows analysts to search blockchain for useful information in plain English and in real-time, without having to go through the hassle of performing low-level queries using BlockTag. The system also provides in a dashboard for analysts that displays real-time results of important queries, such as the ones we presented in the paper. Based on feedback from trade commission agencies and financial regulatory authorities, such capabilities are extremely helpful to protect customers, comply with know you customer (KYC) and anti-money laundering (AML) laws, and draft new, investor-friendly cryptocurrency regulations.

Future work

In order to address the main limitation of BlockTag, we plan to define confidence scores for tag sources. The scores can be computed using various “truth discovery” algorithms (Dong et al., 2009), which are generally based on the intuition that the more sources confirm a tag the more confidence is assigned to it.

BlockTag is modular by design. This means we can easily enhance or add new capabilities. As such, we plan to implement more vertical crawlers for services such as WalletExplorer,555https://www.walletexplorer.com ChainAlysis,666https://www.chainalysis.co BitcoinWhosWho,777https://bitcoinwhoswho.com and Reddit.888https://www.reddit.com We also plan to support more clustering methods and develop a systematic way to automatically tag clusters, in addition to blocks, transactions, and addresses, based on label propagation algorithms (Gregory, 2010).

7. Conclusion

Blockchain analysis has become a hot topic among researchers and law enforcement agencies for applications that demand more effective tools. While state-of-the-art analysis systems, such as BlockSci, are efficient, they are not designed to annotate and analyze auxiliary blockchain data. We presented BlockTag, an open-source tagging system for blockchains. We used BlockTag to uncover privacy issues with using Bitcoin in Tor hidden services, and flag Bitcoin addresses that are likely to be part of a large Ponzi scheme.

References

  • (1)
  • Artzrouni (2009) Marc Artzrouni. 2009. The mathematics of Ponzi schemes. Mathematical Social Sciences 58, 2 (2009), 190–201.
  • Bartoletti et al. (2018) Massimo Bartoletti, Barbara Pes, and Sergio Serusi. 2018. Data mining for detecting Bitcoin Ponzi schemes. arXiv preprint arXiv:1803.00646 (2018).
  • Biryukov et al. (2014a) Alex Biryukov, Dmitry Khovratovich, and Ivan Pustogarov. 2014a. Deanonymisation of clients in Bitcoin P2P network. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, 15–29.
  • Biryukov and Pustogarov (2015) Alex Biryukov and Ivan Pustogarov. 2015. Bitcoin over Tor isn’t a good idea. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 122–134.
  • Biryukov et al. (2014b) Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Ralf-Philipp Weinmann. 2014b. Content and popularity analysis of Tor hidden services. In Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on. IEEE, 188–193.
  • Bohr and Bashir (2014) Jeremiah Bohr and Masooda Bashir. 2014. Who uses bitcoin? an exploration of the bitcoin community. In 2014 Twelfth Annual Conference on Privacy, Security and Trust (PST). IEEE, 94–101.
  • Bonneau et al. (2015) Joseph Bonneau, Andrew Miller, Jeremy Clark, Arvind Narayanan, Joshua A Kroll, and Edward W Felten. 2015. Sok: Research perspectives and challenges for bitcoin and cryptocurrencies. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 104–121.
  • Castillo (2016) Michael del Castillo. 2016. Bitcoin Remains Most Popular Digital Currency on Dark Web. https://www.coindesk.com/bitcoin-remains-most-popular-digital-currency-on-dark-web/. (2016). [online; accessed 01-July-2018].
  • Chodorow (2013) Kristina Chodorow. 2013. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. ” O’Reilly Media, Inc.”.
  • Dong et al. (2009) Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. 2009. Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment 2, 1 (2009), 550–561.
  • DuPont and Squicciarini (2015) Jules DuPont and Anna Cinzia Squicciarini. 2015. Toward de-anonymizing bitcoin by mapping users location. In Proceedings of the 5th ACM Conference on Data and Application Security and Privacy. ACM, 139–141.
  • Facebook (2014) Facebook. 2014. RocksDB: An embeddable persistent key-value store for fast storage. https://rocksdb.org. (2014). [online; accessed 01-July-2018].
  • Ferrara et al. (2016) Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104.
  • Fleder et al. (2015) Michael Fleder, Michael S Kester, and Sudeep Pillai. 2015. Bitcoin transaction graph analysis. arXiv preprint arXiv:1502.01657 (2015).
  • Gregory (2010) Steve Gregory. 2010. Finding overlapping communities in networks by label propagation. New Journal of Physics 12, 10 (2010), 103018.
  • Jordan et al. (2013) Sarah Meiklejohn Marjori Pomarole Grant Jordan, Kirill Levchenko Damon McCoy, and Geoffrey M Voelker Stefan Savage. 2013. A Fistful of Bitcoins: Characterizing Payments Among Men with No Names. (2013).
  • Kalodner et al. (2017) Harry Kalodner, Steven Goldfeder, Alishah Chator, Malte Möser, and Arvind Narayanan. 2017. BlockSci: Design and applications of a blockchain analysis platform. arXiv preprint arXiv:1709.02489 (2017).
  • Li et al. (2016) Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. 2016. A survey on truth discovery. ACM Sigkdd Explorations Newsletter 17, 2 (2016), 1–16.
  • Meiklejohn et al. (2013) Sarah Meiklejohn, Marjori Pomarole, Grant Jordan, Kirill Levchenko, Damon McCoy, Geoffrey M Voelker, and Stefan Savage. 2013. A fistful of bitcoins: characterizing payments among men with no names. In Proceedings of the 2013 conference on Internet measurement conference. ACM, 127–140.
  • Miller (2013) Justin J Miller. 2013. Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324. 36.
  • Moore and Rid (2016) Daniel Moore and Thomas Rid. 2016. Cryptopolitik and the Darknet. Survival 58, 1 (2016), 7–38.
  • Nakamoto (2008) Satoshi Nakamoto. 2008. Bitcoin: A peer-to-peer electronic cash system. (2008).
  • Reid and Harrigan (2011) Fergal Reid and Martin Harrigan. 2011. An analysis of anonymity in the bitcoin system. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on. IEEE, 1318–1326.
  • Reid and Harrigan (2013) Fergal Reid and Martin Harrigan. 2013. An analysis of anonymity in the bitcoin system. In Security and privacy in social networks. Springer, 197–223.
  • Ron and Shamir (2013) Dorit Ron and Adi Shamir. 2013. Quantitative analysis of the full bitcoin transaction graph. In International Conference on Financial Cryptography and Data Security. Springer, 6–24.
  • Spagnuolo et al. (2014) Michele Spagnuolo, Federico Maggi, and Stefano Zanero. 2014. Bitiodine: Extracting intelligence from the bitcoin network. In International Conference on Financial Cryptography and Data Security. Springer, 457–468.
  • Theymos (2014) Theymos. 2014. DPR subpoena. https://bitcointalk.org/index.php?topic=881488.0. (2014). [online; accessed 01-July-2018].
  • Vasek and Moore (2015) Marie Vasek and Tyler Moore. 2015. There’s no free lunch, even using Bitcoin: Tracking the popularity and profits of virtual currency scams. In International conference on financial cryptography and data security. Springer, 44–61.
  • Vasek and Moore (2018) Marie Vasek and Tyler Moore. 2018. Analyzing the Bitcoin Ponzi scheme ecosystem. In Financial Cryptography.
  • Yin et al. (2008) Xiaoxin Yin, Jiawei Han, and S Yu Philip. 2008. Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering 20, 6 (2008), 796–808.