Data markets connect buyers and sellers of datasets with one another. Such markets may prove a fundamental new primitive for the next stage of the internet, especially as machine learning and AI systems continue to embed themselves at the heart of the modern technology ecosystem. Learning methods are often data hungry, and require access to large datasets in order to make accurate predictions. Unfortunately, such datasets are nontrivial to gather, and existing data markets lack liquidity. Only the largest and most connected organizations have the resources to secure access to the data they require. The construction of liquid data markets would fundamentally shift this distribution of power and facilitate the broad adoption of machine learning methods.
How can such a data market be constructed? One option is to identify a trusted entity to act as a centralized data broker. Such a broker could enable transactions between buyers and sellers of data by storing datasets on-site and transferring them upon payment. Unfortunately, such a model creates a heavy burden of trust; how can buyers and sellers know that the broker is behaving fairly? Centralized cryptocurrency exchanges already have a checkered history of fraud and theft. It seems all too likely a centralized data exchange could fall prey to similar problems. For these reasons, the construction of a decentralized data exchange could prove an enabling technology for liquid data markets. Such an exchange would facilitate transactions of data between buyers and sellers without the need for a trusted third-party broker. Furthermore, tokenization of data offers a powerful new primitive for solving cold-start problems that generally make boostrapping a marketplace difficult. While many might agree that pooling data creates non-zero sum value for all participants, most hesitate to be the first to contribute without some contractual guarantee of value. With decentralized data markets, the earliest contributors see financial incentive because they can receive tangible cryptoeconomic assets (tokens) even before buyers enter the market.
The construction of a decentralized data exchange is not straightforward. How can participants ensure that their datasets are stored and transferred correctly? How can cheaters be caught and removed from the system? These are deep questions which delve into the heart of multiparty protocols. Luckily, the advent of blockchain based systems with associated smart contract platforms [buterin2013ethereum] has triggered significant research into the design of multi-agent systems designed to perform nontrivial work. For example, prediction markets [peterson2015augur], decentralized token exchanges [warren20170x], curation markets [delarouviere2017curationmarkets], token curated registries [goldin2017tcr], storage markets [wilkinson2014storj], and computational markets [teutsch2017scalable] provide various examples of systems designed to perform useful work by coordinating selfish actors. Primitives introduced by such protocols can be repurposed to serve as a foundation for decentralized data markets.
The token curated registry (TCR) [goldin2017tcr] in particular provides a powerful abstraction for how a collection of participants can work together to build a curated list. For example, such a list could contain the names of colleges which enable students to rapidly pay back student debt after graduation. Basic implementations of TCRs in Solidity already exist [goldinTcrImpl]. However these implementations have a number of limitations. For example, storage is typically on-chain for simplicity; this basic design wouldn’t permit for the construction of a list of images since images are too large to be stored on existing smart contract platforms. In addition, the contents of the registry are publicly visible, so sensitive information can’t be assembled.
To overcome these issues, it proves useful to specialize the basic design of token curated registries to fit within a structured framework which explicitly allows for off-chain storage and private data. In addition, we introduce the new notion of recursively nesting TCRs to allow for the construction of more complex data structures. We call this modified mathematical class of structures tokenized data structures. Tokenized data structures allow for a number of improvements over simple on-chain TCR implementations:
Off-chain storage: At present, simple token curated registries cannot hold large datasets since the registry contents are stored on-chain. Tokenized data structures on the other hand allow for the storage of data elements which may be too large to fit on-chain. Such data elements could be stored on IPFS [benet2014ipfs] or similar storage networks. Enabling off-chain storage significantly extends the types of data structures that can be constructed. A decentralized data exchange could store all its datasets off-chain in this fashion. Alternatively, a tokenized map could be constructed to provide an alternative to Google maps. Note that such a data exchange or tokenized map would require the storage of terabytes and perhaps petabytes of data. Coordination mechanisms that enable a tokenized data structure to effectively access distributed off-chain state will prove fundamental for these applications. We discuss such mechanisms later in this work.
Private Data: Decentralized data exchanges will require that only the rightful owners of datasets be able to access data. For this reason, data indexed in the tokenized data structure must be kept private. Similarly, the tokenized map introduced above could have regions of the map restricted to the general public (say for military bases), or the map could cover private property; a token curated Disneyland map may require a payment to Disney in order to access. Tokenized data structures need to allow private data to be maintained as part of its structure. Agents who wish to access data must purchase membership in order to access such data. In a decentralized data exchange, buyers of data must purchase membership in the data in order to access.
Recursive Nesting: Some tokenized data structures could require significant capital expenditure to construct. For the case of a map dataset, it’s possible that mapping a new city might require a mapper to expend capital gathering the mapping information needed to add a new entry to a tokenized map with existing token . Let’s suppose that our mapper lacks the needed funds, but has an entrepreneurial mindset. For this purpose, the mapper can construct a new city token which she can use to fund her data gathering efforts. This token is tied to the broader map token so that our mapper doesn’t need to exit the existing mapping ecosystem. The mapper can sell a fraction of her founder tokens to obtain the funds necessary to gather the first maps for the new city. In order to attract investors to , there must be mechanisms by which token holders can obtain rights to future monetary returns from the new city map. We introduce mathematical structures, namely a membership model, that provide these returns.
We start by reviewing the literature for related ideas, then proceed to provide a number of practical examples of tokenized data structures, culminating with the construction of a decentralized data market via a tokenized data structure. We then use these examples to motivate a mathematical framework for analyzing tokenized data structures, and prove some basic economic theorems governing their behavior. We discuss how decentralized data markets may enable the advancement of machine learning and AI, and conclude by highlighting a few open problems relating to tokenized data structures.
2 Related Work
Bitcoin [nakamoto2008bitcoin] introduced the first broadly adopted token incentivized scheme. Its proof of work mining algorithm provided an incentive for miners to run large computations in return for token rewards. Despite its impact, Bitcoin does not provide an easy way for developers to build applications on top of the core protocol. Ethereum [buterin2013ethereum] extends the Bitcoin design with a (quasi) Turing complete virtual machine on top [wood2014ethereum] capable of executing smart contracts. A number of smart contract systems have been devised which implement powerful incentive systems such as prediction markets [peterson2015augur], decentralized exchanges [warren20170x], and computational markets [teutsch2017scalable].
Both Bitcoin and Ethereum were originally designed to use proof-of-work (PoW) mining. In such systems, teams of miners compete for the right to propose the next cleared set of transactions by solving computational challenge problems (typically hash inversion). PoW has proven a robust and powerful security mechanism, but at the cost of tremendous electricity and resource consumption. For this reason, a parallel line of work has investigated proof-of-stake (PoS) mining algorithms [king2012ppcoin, kiayias2017ouroboros, buterin2017casper]. Such algorithms require that miners hold ”stake” in the form of coins held in the economic system. Miners are selected to propose the next cleared ”block” of transactions according to their stake. To keep miners honest, a number of ”slashing conditions” [buterin2017casper] have been proposed which punish dishonest miners. Although proof-of-stake was originally envisioned as a scheme for securing blockchains, it has become clear that computational stake serves as a powerful scheme to coordinate agents to perform useful work. Many protocols [peterson2015augur, teutsch2017scalable, goldin2017tcr, goldin2018tcr11] rely upon staking mechanisms to coordinate actors to perform useful work and upon slashing conditions to punish dishonest behavior.
Token curated registries [goldin2017tcr] (TCRs) in particular allow for the construction of lists that are maintained by a set of curators. These curators must be bonded into the TCR by placing tokens at stake. The bonding of curators creates natural incentive structures that help the listing take natural form. The original TCR design was subsequently modified to add ”slashing” conditions that punish token holders who don’t participate in votes regularly. [goldin2018tcr11]. A number of related designs to TCRs such as curation markets for coordinating agents around shared goals [delarouviere2017curation, delarouviere2017curationmarkets] have also been proposed. Refinements such as bonding curves [delarouviere2017tokens2] have been proposed which allow for additional flexibility in the choice of how participants are rewarded with tokens for their efforts.
It’s important to note however that unlike PoW algorithms, PoS methods have not been tested yet with large real world deployments. A line of recent work has demonstrated that long-range attacks [gazi2018stake], where miners wait until they can remove stake from the system to launch attacks, may seriously compromise the security of such systems. Nevertheless, the flexibility and energy friendliness of PoS systems means that research into the design of systems continues full steam.
A different line of work has investigated distributed hash tables [stoica2003chord, kaashoek2003koorde, maymounkov2002kademlia], data structures which enable decentralized networks of participants to maintain useful information. Such decentralized data structures form the foundations of modern internet architecture and also feature prominently in the design of many tokenized protocols [wood2014ethereum, wilkinson2014storj]. One way of contextualizing tokenized data structures would be to view them as the blending of ideas from PoS incentive schemes with distributed hash table style decentralized storage. Protocols such as Storj [wilkinson2014storj] and Filecoin [filecoin2017] have explored this design space. Storj proposes a peer-to-peer storage network where availability of data is guaranteed by a challenge response scheme and where storage nodes are rewarded with tokens. The locations of shards of data are stored on an underlying Kademlia distributed hash table [maymounkov2002kademlia].
Unlike systems such as Storj, tokenized data structures introduce the notion of recursive sub-tokens enabling different agents to construct parts the tokenized data structure. These sub-tokens draw from past work on non-fungible tokens [eip721], which create custom tokens tied to particular physical or virtual entities. For example, Cryptokitties [cryptokitties2018] associates separate non-fungible tokens to instances of collectible virtual cats (the aforementioned ”Cryptokitties”).
Tokenized data structures also draw some inspiration from past work on decentralized cryptocurrency exchanges [warren20170x]. However, the needs for a decentralized data exchange to secure large off-chain datasets means that it’s not feasible to directly adopt decentralized exchange protocols for data transactions.
3 Examples of tokenized data structures
Before introducing a formal mathematical definition of tokenized data structures, it will be useful to discuss a number of different types of tokenized data structures to build intuition. We present a series of tokenized data structures of increasing complexity, culminating in the construction of a decentralized data market. An important design theme that will emerge in this discussion is the recursive nature of tokenized data structures, which means that such structures can be fruitfully combined to build more complicated systems.
3.1 Distributed Hash Table
A tokenized data structure with no associated token but with off-chain storage forms a distributed hash table. Assuming that the tokenized data structure is implemented on a smart-contract platform, the lookup table mapping keys to data locations can be implemented as a smart contract data structure stored on-chain as illustrated in Figure 1.
3.2 Token Curated Registry
A simple token curated registry is a special case of a tokenized data structure with no off-chain storage and no private data (Figure 2). Note that the concept of a token curated registry is often disused quite generally, so it would be equally fair to argue that all tokenized data structures are themselves special cases of token curated registries.
3.3 Tokenized Dataset
A tokenized dataset is a distributed hash table that has an associated token . Alternatively, the tokenized dataset can be viewed as token curated registry but with the addition of off-chain storage. Figure 3 illustrates a token curated image dataset with public data visibility while Figure 4 illustrates a token curated image dataset with private data visibility.
It’s illustrative to imagine how a large image dataset like ImageNet[deng2009imagenet] could have been gathered with a private tokenized dataset rather than through Amazon’s Mechanical Turk. Workers who contributed images would be rewarded by being issued tokens of type that could be renumerated at a future date for currency.
In the longer run, tokenized datasets may prove to be a far more powerful tool for incentivizing the construction of large datasets than Mechanical Turk. Unlike Mechanical Turk, tokenized datasets have support for recursive sub-tokens which allows workers to be rewarded with a share of future financial rewards from the dataset. This expectation of future rewards is a powerful economic driver. The modern startup functions because founding employees accept severe risks in expectation of future rewards from their fractional ownership of the company. Similarly, tokenized datasets may enable ”data startups” which work collaboratively to construct datasets of significantly greater scale and utility than ImageNet.
3.4 Tokenized Map
A tokenized map (Figure 6) is a two dimensional grid with off-chain storage for local information at grid points. A tokenized map could be used to incentivize the construction a version of Google Maps. Local businesses could pay for transactions to add their business information to the tokenized map.
3.5 Tokenized Tree
Let’s suppose that adding elements to a tokenized data structure would take significant capital outlay. For example, the tokenized data structure we wish to construct might be a vast phylogenetic tree that holds all the world’s genomic data (Figure 7). In this case, adding one new individual to the phylogenetic tree could take a sizable sum of money to pay for the needed genetic sequencing. Since there may not exist interested individuals who are willing to directly pay for the construction of this tree, the tokenized tree suffers from a severe cold start problem.
More generally, if there exists a substructure in a tokenized data structure that is difficult to construct, a token can be constructed that is tied to this substructure. For example, let’s suppose that the South Asian branch of the phylogenetic tree is sparse. An interested network participant can contribute genetic material in return for SouthAsianBranchTokens (SABTs). If a future agent pays to access the private data on the South Asian branch, payments will be made to SABT holders. The anticipation of these future payments serve as an incentive to encourage contribution of data elements to the South Asian branch. Note that the portions of the South Asian branch must be kept private else there will be no incentive to pay for data access. Conceptually, the SABT holders have a form of ownership in the South Asian branch of the tokenized phylogenetic tree.
Similarly, a PolynesianBranchTokens (PBTs) may incentivize gathering of genomic data for the Polynesian branch of the tokenized phylogenetic tree. But, it’s important to note that entirely different organizations may be involved with this branch of the tree! That is, SABTs and PBTs may be used by different organizations, with their efforts coordinated by the decentralized tokenized phylogenetic tree. This potential for decentralized coordination of disparate organizations could enable complex datasets to be assembled.
3.6 Decentralized Data Markets
A decentralized data market would provide data liquidity by enabling data transactions between buyers and sellers of data. How can such a market be instantiated as a tokenized data structure? Luckily, we’ve already discussed many of the compontent pieces of such a market structure already. Individual datasets can be stored on the market as (private) tokenized datasets. The collection of such tokenized datasets can be organized itself as a simple token curated registry. Put another way, a tokenized data market is defined as a simple token curated registry of (private) tokenized datasets. Figure 8 illustrates a tokenized data market.
A tokenized data market could be used to construct a decentralized data exchange where participants can access various useful types of data by accessing constituent tokenized datasets. Agents would be incentivied to construct new datasets in anticipation of future rewards for token holdings in such datasets via dataset tokens.
4 Mathematical Definitions
In this section, we provide formal mathematical definitions of tokenized data structures and analyze a number of their mathematical properties.
4.1 Formal Definitions
A tokenized data structure is a collection of elements with an optional token type , associated ledger that maps agents to their token holdings , and metadata that annotates elements with additional information. Formally, we can write as a tuple (or simply in the token-free, metadata-free case).
Recursively, each element in this may itself be a tokenized data structure . Alternatively, may be a terminal leaf node containing associated data . Note that may be stored on-chain or off-chain, depending on the capacities of the system on which the tokenized data structure is implemented.
For many proofs, it will be useful to talk about the economic value of a particular token. However, such discussions depend on the choice of base currency. Following conventions from the literature [gorbunov2015democoin], we adopt the notation to denote units of monetary value.
Definition 1 (Economy Size)
Let denote a token type. Let denote the number of such tokens tracked in ledger . Let denote the monetary value of one such token in the base currency and let denote the number of tokens of type available. Then the size of the token economy is defined to be units.
4.2 Operations and Parameters
This section introduces the operations that can be performed on a tokenized data structure and the associated parameters that control the specific behavior of a tokenized data structure under these operations.
There are four classes of operations supported by a tokenized data structure:candidacy, challenges, forks, and queries. Candidacy is the process by which new elements are proposed for addition to the tokenized data structure . Challenges allow for the legitimacy of elements of to be formally challenged. Forks split into two parts. Queries allow for private elements in the tokenized data structure to be viewed. In the remainder of this section, we expand on these brief definitions and discuss the parameters that govern each operation. Table 1 summarizes all operations and parameters.
|Candidacy||Voting Period||3 days|
|Candidacy||Reward Stake Period||-|
|Challenge||Voting Period||5 days|
|Fork||Voting Period||30 days|
Candidacy is the process by which new elements are proposed for addition to a tokenized data structure.
The candidate deposit controls the amount that a token holder must stake in order to propose the addition of an element to the tokenized dataset.
Note that for many practical applications, may be set to 0 in order to lower barriers for potential candidates to participate in the construction of the tokenized dataset. The danger of setting is of course that spamming the market becomes much easier.
Candidacy Voting Period
The candidate voting period controls the amount of time that token holders have to vote on a new data candidate.
The reward issued to a candidate for an accepted addition to the tokenized dataset.
Candidacy Vote Quorum
The percentage of token holders who must vote to authorize the addition of a new candidate to the .
Challenges are the mechanism by which token holders can dispute the suitability of a given element for membership in the tokenized data structure. The challenge mechanism allows token holders to remove data structure elements which no longer add value to the global structure.
is the amount that a token holder must stake in order to issue a challenge to a particular element in the tokenized dataset.
Challenge Voting Period
The challenge voting period controls the amount of time that a challenge for a particular deposit is open for token holders to vote upon.
issued to successful challengers. It is probably appropriate to setequal to since token holders should be incentivized to remove bad entries. However, it might also be reasonable to set in which case seized reward associated with the element in question is awarded to the challenger.
Challenge Vote Quorum
The percentage of token holders who must vote to authorize the removal of an element from .
Forking is the operation by which one tokenized data structure ca nbe split into two tokenized data structures. All of the token holders in the forked structure must pick one of the two structures as legitimate.
The fork deposit controls the amount of stake that must be placed to request a fork of .
Fork Voting Period
The fork voting period is the amount of time token holders can vote on a proposed fork.
The fork threshold is the amount of votes that must be placed in favor a forking operation to trigger a fork.
If the boolean value is true, then the tokenized data structure has leaf nodes which store information off-chain. Tokenized datasets and tokenized data registries rely fundamentally on off-chain storage for example.
Querying is the operation by which stake holders in a tokenized data structure can request to query private data held in leaf nodes of the structure. It’s possible to think of a querying operation as a sort of limited challenge operation.
Stake is the amount of stake required to be able to query private data points stored in leaf nodes of the tokenized data structure.
4.3 Token Issuance Schedule
The creators of a tokenized data structure have broad flexibility to control token ownership, supply, and issuance. For example, the token economy for could have a fixed supply, be inflationary, or even deflationary depending on the needs of the particular application at hand. In this section, we briefly discuss some potential token allocation strategies.
4.3.1 Predetermined Allocation
The creators of could elect to split all tokens amongst themselves in some agreed upon fashion proportional to their expected work contribution. In this case, is fixed and does not change over time.
Tokens can be issued in an on-going fashion to contributors of new elements to . (The act of contributing a quasi-finalized candidate to is deemed mining.) To enable mining rewards, the creators of need to set . In this case, grows with time.
4.4 Network Participants
In this section, we introduce various agents who participate in the construction of a tokenized data structure and the operations they can perform. Table 2 lists the three classes of agents: token holders, makers, and queriers. Figure 9 provides a diagrammatic representation of how a tokenized data structure is constructed by agents in the network. Note that the same entity can play multiple roles. W
|Token Holder||Agent with token holdings|
|Maker||Agent with can submit candidate datapoint to|
|Querier||Agent with can query private data|
We now formally define each participant in the economy.
Definition 2 (Token Holders)
A token holder is an agent which holds a nonzero number of tokens . The token holder’s belief that token holds value (say units of value in a base currency) is the primary economic driver responsible for its participation in the construction of .
Definition 3 (Maker)
A token holder who possesses token holdings in excess of a set minimum stake can propose that a new candidate element should be added to . Such a modification to must be approved by a quorum of token holders during a voting period . For example would set a quorum at of the token economy size.
Definition 4 (Querier)
The querier is any party who is interested in accessing the information stored in . The querier must be prepared to pay to renumerate the token holders who have put forward the effort needed to curate the .
With these definitions in place, we can formalize the actions these agents may take.
Definition 5 (Candidacy)
Let be a tokenized data structure. In a candidacy, agent proposes the addition of element to . Agent must place tokens at stake for the duration of the vote . If a quorum of the token holders recorded in authorize the modification, element is added to the and agent is rewarded with reward . The final structure is
In the definition above, we use the terminology to mean that ledger is modified so that , and terminology to denote that is modified to hold metadata denoting as a candidate.
For makers proposing the addition of a leaf node holding value , they will be responsible for storing the data off-chain since other nodes have the option of issuing a challenge that can remove the element from the dataset. If agent can’t produce upon query, token holders will be incentivized to vote for removal of from .
An important special case for the candidate is the candidacy stake is set to . This setting will be crucial for cases when constructing will be challenging and barriers for candidacy need to be low. For example, a tokenized dataset will likely require the minimum deposit for candidacy to be zero since otherwise it will be challenging to incentivize workers to participate in the project.
All elements that pass candidacy are said to be quasi-finalized.
Definition 6 (Quasi-Finality)
An element is quasi-finalized when it has passed candidacy and its addition to the tokenized dataset has been approved by a quorum of token holders in . Note that quasi-finalized elements may still be challenged by token holders. Upon being quasi-finalized, the metadata associated with is updated.
Note that quasi-finalized elements may still be challenged by token holders. Initially, the proposing agent still has its candidate reward bonded to the . This stake can be seized via challenge. If the challenge succeeds, reward will be seized from agent and the element will be flagged in the metadata as successfully challenged.
Definition 7 (Challenge)
Let be a tokenized data structure. In a challenge, agent proposes the modification of metadata associated with element from to denote that this datapoint has been challenged. Agent must place tokens at stake for the duration of the vote. If a quorum of the token holders recorded in authorize the modification, element is marked as challenged in . Let be the agent who originally added . Then ’s reward is seized and the final structure is
When the reward for for proposing agent is no longer bonded to the network, challenges can no longer seize the reward, and the ledger is not amended upon a successful challenge, but the associated metadata still is.
Definition 8 (Statute Of Limitations)
An element has exceeded its statute of limitations when the reward issued to its proposing agent is no longer bonded to the . Challenges may still be issued against but it will no longer be possible to seize the candidate reward .
4.5 Liveness of Data
Since the data in leaf nodes may be stored off-chain, ensuring liveness of data is critical. Note that when a leaf node with data is accepted into a TD substructure, the leaf node owner is rewarded with minted substructure tokens.
Token holders for the substructure are incentivized to challenge leaf nodes to prove they can produce their data. If the challenge passes muster, the leaf node tokens are burned, implicitly raising the value of existing token holders’ holdings. Recall that token holders are entitled to a fraction of future membership payments in proportion to their fractional ownership of the token supply. Hence, a token holder is incentivized to increase their fractional ownership (and gain rights to a larger share of future returns) by pruning dead leaf nodes.
Theorem 4.1 (Proof of Liveness)
Let be a tokenized dataset. Suppose that the current saleable value of is units in a base currency and that potential for future sales exists. Assume that a liveness check costs units. Then, token holders of a dataset are incentivized to check any element a total of times for data liveness.
Let’s suppose that . On average, each element has saleable value units. Given the potential for future sales, the future value of each data element is units. It follows that a token holder is economically incentivized up to checks for data liveness.
It’s useful to substitute actual numbers to gain some intuition for this result. Let’s suppose that has saleable value units at present and that potential for future sales exists. Let’s suppose that a liveness check costs units and that the dataset has . Then a token holder is incentivized to issue liveness checks for each datapoint in .
4.6 Accessing private data
A tokenized data structure
explicitly allows for the addition of private off-chain data. The introduction of this primitive raises new questions: How can an interested party honestly gain access to this off-chain data in a way that fairly rewards the token holders who have curated this resource? And more importantly, what are the new attack vectors that arise as a result of this new resource?
In this section, we will consider two potential access modes by which interested parties can access private, off-chain data. Namely, membership and transactions.
4.6.1 Membership Model
In the membership model, any interested agent who wishes to access private data must become a token holder who holds stake in . Then requesting to query a private datapoint stored in element requires placing stake within the system. The process of acquiring stake in is referred to as the process of acquiring membership in .
Definition 9 (Membership)
Agent acquires membership in tokenized data structure by acquiring tokens in its token economy.
How does agent acquire tokens? Let’s assume that tokens holds units of value. The payment of tokens is then split out pro-rata (according to ownership share) among all present token holders.
When analyzing behavior, it will be useful to assume that the economy is in steady state, so that tokens are no longer being issued.
Definition 10 (Steady State)
A tokenized data structure is in steady state if can no longer change.
With the definitions of membership and steady state laid down, it becomes possible to analyze the expected rewards for honest and dishonest behavior. As before, let denote the number of tokens of type available. We start with a useful definition.
Definition 11 (Leakage Resistance)
Let be a token holder in tokenized data structure that has present market value units. Let’s suppose that leaks private data from and that the post-leakage market value of is units. Then we say that has leakage resistance .
Definition 12 (Counterfeit Worth)
Let be a token holder in tokenized data structure that has present market value units. Let’s suppose that leaks private data from and that the market value of the leaked information from is units. Then we say that has counterfeit worth .
Leakage resistance range from to . Different tokenized data structures will have different leakage resistance factors depending on the type of data they hold. Data for which establishing provenance is critical (perhaps for regulatory reasons as in health care) may have resistance factors close to . Data for which provenance doesn’t matter (perhaps quantitative trading datasets) may have resistance factors close to .
Theorem 4.2 (Rewards for Honest and Dishonest Behavior)
Let be an agent who buys tokens from tokenized data structure in steady state for units of value. Let be the fractional ownership of required for querying data. Let be the leakage resistance of and let be the counterfeit worth. Suppose that potential for future membership sales and counterfeit sales exists and that a nonzero probability exists for dishonest behavior to be detected by token holders in . Then the expected value for honest behavior and the expected value for dishonest behavior if .
Let suppose that agent pays units to obtain stake in . Then let’s assume that there are sales that will happen in the future. Then the total future return that can expect is units of value. Put another way, the expected reward for honest behavior units of value.
If is dishonest and leaks information, there are two possible outcomes. The first is that the dishonesty is caught and a challenge is issued to causing loss of stake. This outcome has value units. Let’s assume alternatively that ’s dishonest behavior is not caught. In this case, the value of data visibility will drop to units since the leakage resistance of is . The data will also have counterfeit value units. In this case, the expected return is units of value. Then the expected return for dishonest behavior is . This quantity is negative if and only if
This result is a little curious. It indicates that while increasing the required fractional ownership for querying private data increases the rewards for honest behavior, it can also create positive rewards for dishonest behavior if is leakage resistant and has low counterfeit worth. In these cases, malicious parties can freely leak while still enjoying positive returns. These results suggest that designing resilient economies for tokenized data structures may take significant research to do correctly.
4.6.2 Transaction Model
In this model, agents who wish to gain access to data would pay a direct fee units to all token holders in but would not gain ownership in the the tokenized data structure. The weakness of this model is that the expectation of future returns from the dataset now no longer constrains the behavior of purchasing agents. Without this positive reward for honest behavior, leaks will become more likely and destroy the value of .
4.7 Future Returns
Constructing a new tokenized data structure can take a significant amount of effort. What motivates a potential contributor to put forward this effort? Simply put, the contributor will make this effort if the expected monetary reward for the effort is positive.
Theorem 4.3 (Expected Future Returns)
Let us suppose that units of capital must be expended for agent to obtain and store element . Let be a tokenized data structure in steady state. Let’s suppose that in the future, a total of agents will be interested in obtaining membership in for units value each. Then the expected return for contributing to tokenized data structure with candidate reward is where is the fractional ownership of in .
The fractional ownership that receives in is . Then the expected future return that will receive for its work is leading to expected return .
This theorem provides a corollary that guides how high must be priced for contributions to be encouraged.
Corollary 1 (Data Pricing)
Let us suppose that units of capital must be expended for any agent to obtain an element suitable for tokenized data structure . Then price for the dataset must be set greater than for to have positive expected return on its contribution.
Note that pricing depends on whether is a static or dynamic quantity. For inflationary token economics, where grows larger with time, the required price for will grow larger with time as well.
4.8 Forking a
Definition 13 (Forks)
A fork is an operation which proposes splitting a given into two separate and structures. The elements must be divided (without overlap) between the two children structures. This means in particular that the two child ledgers cannot intersect
An agent who wishes to trigger a fork must place at stake. This deposit triggers a forking period. All tokens holders on a registry have time to adopt one of the two forked registries. Adoption of one registry means token holdings on the other registry are destroyed.
The recursive nature of tokenized data structures introduces an interesting complicated though. Let’s suppose that a given contains element which is itself a tokenized data structure . Let’s say that a fork is triggered for which splits the data structure into and . By convention, let’s agree that is the direct offshoot and is the forked variant. Note then that is not yet an element of ! The token holders in will need to apply for candidacy in . This extra candidacy step place an additional hurdle to discourage frivolous forks.
4.9 Slashing Conditions
Slashing conditions for tokenized data structures are implicitly implemented via the challenge mechanism. This implicit scheme can be significantly more robust than an automated slashing condition since there need not be a simple algorithmic rule for slashing. Human (or intelligent agent) token holders can issue challenges for arbitrary reasons including suspicion of fraud that is hard to prove with a rigid algorithmic condition.
4.10 Token Valuations
A tokenized data structure depends critically on its associated token . What is the economic value of such a token in equilibrium conditions? We have discussed the economic rewards that accrue to token holders at depth already. In this model, the value of the token is directly proportional to the future economic rewards that will accrue to a token holder. Let’s suppose that a total of units of discounted future economic rewards will accrue to . Then the unit value of a token should be
. This simple heuristic provides justification for why a data structure token has value, but more refined analysis is left to future work.
In this section, we consider a number of possible attacks upon tokenized data structures. We discuss the severity of each form of attack and consider potential mitigation strategies.
5.1 Dilution of Token Economies with Depth
The recursive definition of a tokenized data structure means that nesting can go arbitrarily deep. As a result, tokens generated for nested substructures of can have very small economies. For this reason, such substructures may be especially prone to other attacks. This is related to the minimum economy size problems identified in the TCR paper [goldin2017tcr], which suggests that TCRs may not be well suited for small economy problems such as generating grocery lists. In the case of a decentralized data exchange, smaller exchanges that are deeply nested within the broader tokenized data market may not have sufficient economic protections to discourage attacks.
5.2 Trolling Attacks
The TCR paper [goldin2017tcr] identifies trolling attacks as a class of vulnerabilities. In this attack, trolls are actors who are willing to attack a system individually to poison gathered data. Such trolls seek to maximize chaos, but are usually not willing to suffer large personal losses to do so. The requirement for a stake to propose TCR candidates means that trolling attacks should be relatively ineffective since the loss of stake resulting from challenes could make such attacks expensive.
For tokenized data structures, it is possible that trolling attacks could prove more dangerous. As we have discussed above, it may make sense to set to in order to lower barriers for potential contributors to . In such a case, the economic barriers against trolling attacks no longer pose a barrier. However, if token holders can algorithmically detect troll-submitted candidates with low effort, then the severity of such attacks may be mitigated.
5.3 Madman Attacks
The TCR paper [goldin2017tcr] introduces madman attacks, where motivated adversaries are willing to undergo economic losses to poison a registry. For example, a corporation or nation-state may seek to thwart the construction of particular data structure which could hurt its interests. Such adversaries may be willing to pay large sums to thwart the construction of such structures.
Defenses against madman attacks are limited by the size of the economy. For tokenized data structures with large economies, such attacks will be prohibitively expensive, but for smaller economies, these attacks will likely prove damaging. These attacks could prove challenging for decentralized data exchanges, where existing data brokers could be motivated to attack decentralized datasets that challenge their market position. Future work needs to consider how to mitigate such attacks.
5.4 Sybil Attacks
Token holders can use multiple coordinated accounts to game a tokenized data structure. In this section, we discuss a few possible such attacks and their effects.
5.4.1 Data duplication
A token holder can propose the addition of data element to . If the data element is valid, this will merit a reward issued to the proposer. Once the reward bonding period is complete, the token holder can use a second account to challenge the liveness of and purposefully fail the liveness challenge. At this point, the metatdata for would be modified to note that it has been challenged. The proposer can use a new account to propose re-adding to . Performed iteratively, this scheme could repeatedly gain rewards for the same datapoint.
To defend against the attack, the creator of can choose to make the reward bonding period large. Alternatively, parameters could be set so refreshing a previously challenged element might earn a much smaller reward. This modification could close off the duplication attack, but might lower incentives for agents to refresh dead data.
It’s also likely that many tokenized data structure creators will require their makers to provide proof of their identity. Known identities will create another layer of accountability that will make duplication attacks more challenging.
5.5 Data Leakage
The data stored off-chain on will likely leak over time as the number of agents who have accessed the dataset increase. We have argued in the membership model that purposeful leakage is economically disincentivized, but it’s likely that residual leakage will happen over time. It remains an open problem to construct a membership model that will minimize leakage over long time periods.
5.6 Forking Attacks
A malicious agent could seek to trigger adversarial forks of in order to gain additional control. However, frivolous forks will likely not gain broad backing from the token holders of so even if a fork is triggered, the offshoot branch will have a much smaller economy. This will limit the potential economic gain for the malicious agent.
Decentralized data markets might prove very useful for the development of intelligent agents. For example, a deep reinforcement learning[mnih2015human] or evolutionary agent with a budget could algorithmically construct a tokenized dataset to solicit the construction of a dataset needed to further train the existing model. Significant research progress in deep reinforcement learning has resulted in the design of agents that can learn multiple sets of skills [mankowitz2018unicorn], so it seems feasible for an agent to learn how to budget dataset gathering requests. More prosaically, data scientists and researchers can access the decentralized data exchanges to find datasets thay may prove useful for their work. Access to data liquidity could prove a powerful tool for democratization of machine learning and AI models
In tokenized schemes, it’s common to ask whether the token is a necessary part of the design. Couldn’t the same design be constructed using an existing token such as ETH or BTC? It does in fact seem likely that a tokenized data structure can be meaningfully constructed with all stakes placed in ETH for example. However, it’s not possible to issue recursive sub-tokens if we insist on ETH stakes; requiring contributors to a to front significant capital makes it unlikely that they will participate. For this reason, we suspect that tokenized data structures without custom tokens will face major challenges constructing nontrivial data structures.
In the present work, we have limited our analysis to a mathematical presentation of the properties of tokenized data structures. We leave for future work the nontrivial challenge of implementing tokenized data structures on an existing smart contract platform such as Ethereum [buterin2013ethereum].
In this work, we demonstrate how to construct a decentralized data exchange. This construction is built upon the primitive of tokenized data structures. Such tokenized data structures combine the strengths of past work on token curated registries [goldin2017tcr] and distributed hash tables [stoica2003chord] to provide a framework for constructing incentivized data structures capable of holding off-chain, private data. In addition, tokenized data structures introduce the notion of recursive sub-tokens to incentivize contributors. We provide a mathematical framework for analyzing tokenized data structures and prove theorems that show that participants in a tokenized data structure are incentivized to construct the data structure for positive expected rewards. We discuss how these incentives allow for the construction of robust decentralized data markets. We conclude by discussing how such decentralized data markets could prove useful for the future development of machine learning and AI.
It’s worth noting that our theorems don’t prove Byzantine Fault Tolerance of tokenized data structures against adversaries. Rather they provide much weaker guarantees that honest participants will benefit from participating. We provide qualitative arguments why tokenized data structures are robust against some classes of adversarial attacks, but a more rigorous formal treatment is left to future work.