Data is the new oil [dataoil-economist]. Like oil, it generates enormous value for the individuals and organizations that know how to tap into and refine it. Like oil, there are only a select few who know how to exploit it. In this paper, we present our vision for data market platforms: collections of protocols and systems that together enable participants to exploit the value of data.
Data only generates value for a few organizations with expertise and resources to solve the problems of data sharing, discovery, and integration
. Only after these problems are solved they can apply their advanced analytics and machine learning algorithms to inform decisions. None of these problems are easy to solve. Despite the many contributions of the database community (and others) to theory, algorithms, and systems for sharing, discovering, and integrating data, these problems still demand a huge investment of time and resources. This explains why the large majority of organizations only partially benefit from the data they own.
The central argument of this paper is that sharing, discovering, and integrating data is hard because data owners lack information and incentives to make their data available in a way that increases consumers’ utility. For example, companies do not share data with each other because data is a competitive asset and because it may bring legal trouble. Inside organizations, teams do not share data because it is time-consuming, may leak confidential information, and it is unclear what is the return on investment. Even if everybody made their data available, consumers still need to discover data that satisfies their needs, which is hard given the large volume of datasets they may need to check. And last, even if relevant datasets are identified, they are often in the format chosen by the owner, which is different than the format the consumer needs. This means the datasets need to be integrated (prepared, wrangled) which is hard and resource-intensive.
Data market platforms establish rules to share data in a way that is easy to discover and integrate into the format consumers need. In a data market platform, data owners are encouraged to share their data because they may receive profit if a consumer is willing to pay for it. Consumers are encouraged to share their data needs because the market will solve the discovery and integration problems for them in exchange for some money. In short, by spreading information among interested parties and incentivizing them, data market platforms bring data value to all participants.
In recent years we have seen the appearance of many data markets such as Dawex [dawex], Xignite [mkt2], WorldQuant [mkt2] and others. The interest in trading data is not new. Economists have been considering these problems for decades [varian1, valuedata1] and the database community has made progress in issues such as pricing queries under different scenarios [qbdp15, chenMLQ19, revMax19]. We believe the time is ripe to design and implement data market platforms that tackle the sharing, discovery, and integration problems, and we think the database community is in an advantageous position to apply decades of data management knowledge to the challenges these new data platforms introduce. In this paper, we outline challenges and a research agenda around the construction of data market platforms.
Using Data Markets to Solve Data Problems
Consider a market with sellers, buyers, and an arbiter. Sellers and buyers can be individuals, teams, divisions, or whole organizations. Consider the following example:
[leftmargin=1em, itemsep=.1em, parsep=.1em, topsep=.1em, partopsep=.1em]
wants to build a machine learning classifier to identify good locations to set up a new business.needs features , and at least an accuracy of 80% for the responsible engineer to trust the classifier.
Seller 1 owns a dataset that they want to share with the arbiter.
Seller 2 owns a dataset that they won’t share with the arbiter unless the dataset is guaranteed to not leak any business information.
Details of the example. In the example is a function of , such as a transformation from Celsius to Fahrenheit. The function can also be non-invertible, such as a mapping of employees to IDs. Note that neither nor owns attribute , which wants: we discuss this attribute in Section 7.1. Last, is an attribute similar to .
Challenge-1. Using alone is not enough to satisfy ’s needs. The arbiter must incentivize Seller 2 to share . The first challenge is on setting a price for the dataset so Seller 2 wants to share it with the market and so that Buyer 1 wants to pay for it.
Challenge-2. Even if becomes available to the arbiter, neither nor alone fulfill ’s need. may not want to pay for two incomplete datasets that still need to go through a slow and expensive integration process. Without a transaction, sellers won’t obtain a profit and buyers won’t satisfy their data need. The second challenge requires combining the datasets supplied by sellers to satisfy buyers’ needs. That, in this case, involves finding a way of going from to , which is the attribute the buyer wants.
Challenge-3. Without the right pricing mechanism, the buyer may offer a very small price and seller 2 may be compelled to share it if that’s the best they can obtain. Eventually, incentives may be so small that neither sellers nor buyers are willing to participate in the market anymore. The third challenge is to design pricing mechanisms that incentivize sellers and buyers to participate in the market and that prevents them from gaming the market.
Challenge-4. The degree of trust between sellers, buyers, and arbiter may vary. In internal markets deployed within an organization, sellers and buyers are employees and it is conceivable they trust each other. In external markets, we cannot assume participants trust each other. The fourth challenge is to help all participants trust each other.
Requirements of Data Market Platforms
Driven by the example and challenges above we can enumerate a list of requirements that data market platforms must implement to solve data problems.
Value of Data. The market must price datasets so that it incentivizes sellers to share their data and create supply and incentivizes buyers to tell the arbiter their data needs, thus generating demand.
Market Design. Without the right rules to govern the participation of sellers, buyers, and arbiter, the market can be gamed and collapse. The key intellectual questions in this area are on how to design the rules of the market when the asset is as unique as data, which is freely replicable and can be combined in many different ways [agarwal19]. A requirement of a data market platform is to be resilient to strategic participants.
Plug n’ Play Market Mechanisms. Markets can be of many types: i) internal to an organization to bring down silos of data, in which case employee compensation may be bonus points; ii) external across companies where money is an appropriate incentive; iii) across organizations but using the shared data as the incentive. An example of the last type is a market between regional hospitals to facilitate data sharing where a hospital’s incentive to share data is to obtain other hospitals’ data. The goals of the market may also be varied, from optimizing the number of transactions to social welfare, data utility, and others. A requirement of data market platforms is to flexibly support markets of different types and with different goals.
Arbiter Platform. Because the supplied data will have a different format than the demanded data, a key requirement to enable transactions between sellers and buyers is an arbiter platform that can combine datasets into what we call data mashups to match supply and demand.
In the example above the arbiter can combine and ’s datasets, obtaining a dataset that is much closer to ’s need. For that, it needs to understand how to join both datasets and needs to find an inverse mapping function that would transform into if such a function exists, or otherwise find a mapping table that links values of to values of . In addition to these relational and non-relational operations the arbiter must also support data fusion operators to contrast different sources of data of the same topic. Briefly, the data fusion operators we envision produce relations that break the first normal form, that is, each cell value may be multi-valued, with each value coming from a differing source. Data fusion operators are appropriate when buyers want to contrast different sources of information that contribute the same data, i.e., weather forecast signals coming from a city dataset, a sensor, and a phone. As an illustration, note Seller 2 owns attribute which is almost identical to , but has some non-overlapping, conflicting information. A buyer may be interested in looking at both signals, or at their difference, or at their similarities, etc. A data mashup is a combination of datasets using relational, non-relational, and fusion operations.
Data Market Management System (DMMS). In addition to the arbiter platform, a data market platform calls for platforms to support sellers and buyers. Sellers need anonymization capabilities so they feel confident when sharing their data. For example, without such a feature, Seller 2 won’t share data for fear of leaking business information. Buyers need to have the ability to describe with fine granularity their data needs and the money they are willing to pay for a certain degree of satisfaction achieved on a given task. In the example above, the buyer should have the ability to define that they are only willing to pay money for a classifier that achieves at least 80% accuracy. A requirement of a data market management system is to offer support for sellers, buyers, and the arbiter.
Building Trust. The degree of trust among sellers, buyers, and arbiter can be different depending on the scenario, i.e., whether in an internal market or across the economy. A requirement of a DMMS is to implement mechanisms to help participants trust each other, such as using decentralized architectures [blockchain2] and supporting computation over encrypted data [processingencrypteddata].
Market Simulator. The mathematics used to make sound market designs do not account for evil, ignorant, and adversarial behavior, which exists in practice. For that reason, after producing a market design and before deploying it on a DMMS, it is necessary to simulate it in adversarial scenarios to check its robustness. Therefore, a data market platform calls for a market simulator.
Our vision is to produce market designs for different scenarios (points (1) and (2) in Fig. 1
) using a market design toolbox. The market design toolbox uses techniques from game theory and mechanism design[mechanismdesign] to deal with the modeling and engineering of rules in strategic settings, such as data markets. Every market design is tested using a data market simulator (point (3)), before being finally deployed in a DMMS (point (4)). While the output of the market design toolbox is a collection of equations, the output of the DMMS is software. There is an explicit interplay between market design and DMMS that constrains and informs the capabilities of the other. Exploring such an interplay is a critical aspect of our proposal.
In the remainder of this paper, we delve into the details of our vision for data markets. We present many research questions that fall directly within the territory of the database community and beyond.
The rest of the paper is organized as follows. In Section 2 we present one approach to pricing datasets. We follow with our vision of the data market toolbox in Section 3 and our vision of a DMMS in Section 4. Section 5 focuses on our proposal for a Mashup Builder that matches supply and demand. We present our evaluation plan in Section 6. We continue with a discussion of the impact of data markets as we envision in this paper in Section 7, followed by related work in Section 8, before presenting a final discussion in Section 9.
2 The Value of Data
What’s the value of data? This question has kept academics and practitioners in economics, law, business, computer science and other disciplines busy. [varian1, valuedata1]. To participate in data markets, sellers and buyers need to answer this question so they can decide whether selling and buying a dataset is profitable for them, but this is difficult.
The crux of the problem is that the value of a dataset may be different for a seller and a prospective buyer. Sellers may choose to price their datasets based on the effort they spent in acquiring and preparing the data, for example. Buyers may be willing to pay for a dataset based on the expectation of profit the dataset may bring them: e.g., how much they will improve a process and how valuable that is. None of these strategies is guaranteed to converge to a price—and hence a transaction agreement—between participants.
And yet, this is how prices are set in current markets of datasets such as Dawex [dawex], Snowflake’s Data Exchange [snowflake-exchange], and many others. Sellers choose a fixed price for datasets without knowing what the demand is and buyers who are willing to pay that price obtain the datasets, without knowing how useful the dataset will be to solve their problem. This leaves both sellers and buyers unsatisfied. Buyers may pay a high price for datasets that do not yield the expected results. Similarly, sellers may undervalue datasets that could yield more profits because they lack information about what buyers want.
A tempting option to the database community is to value datasets based on their intrinsic properties, such as quality, freshness, whether they include provenance information or not, etc. Unfortunately, valuing datasets based on the intrinsic properties of data alone does not work either. We illustrate why with the following scenarios:
Example 1: Two datasets A and B contain the same number of records and cover the same domain. The only difference is that while A is missing 1% of its values, B is missing 30% of them. Which dataset is more valuable? Intuition says the first dataset because of the fewer missing values it contains.
However, if the desired answer involves a set of values that appear in B, but not in A, then B may be more valuable for this particular task, even though A has fewer missing values.
Example 2: Consider two datasets that contain crime data of the city of Chicago. The first one is a snapshot of 12/31/2010. The second one is a current up-to-date snapshot. Which one is more valuable in this case?
Freshness is usually a positive trait of a dataset. However, if I’m trying to understand what the crime in Chicago was in 2010 – for example, before a particular political event – then the snapshot from 2010 will be more valuable than the current one.
Example 3: We want to predict a variable of interest, X. We have a dataset with 5 features and a second dataset with 20 features that subsumes the first. Would you pay more money for the second dataset than for the first?
Even though more features may seem better, if the dataset with 5 features is good enough to predict the variable of interest, adding more features may not help at all – in which case paying more for such a dataset would be pointless. In fact, adding more features may lead to overfitting the model, hence achieving worse quality. This would then require doing feature engineering, therefore consuming more of the buyer’s resources. In this last case, one would pay more for the first dataset, since it’s the one that helps solve the task at hand better.
So, if intrinsic properties of data are not a good proxy to value it, what is?
Intrinsic vs Extrinsic Value of Data
In the markets we envision, the value (and price) of a dataset is decided by the arbiter based on the economic principles of supply and demand [supplyanddemand]. A scarce dataset that lots of buyers want will be priced higher than a common dataset that is hardly ever requested, regardless of the intrinsic properties of such datasets. In other words, the value of a dataset is primarily extrinsic.
The role of intrinsic properties. When intrinsic properties of datasets are important to buyers, they can explicitly let the arbiter know. If, as a consequence of buyers’ requests, a demand is created for a particular dataset with, say, few missing values, sellers who provide those datasets will profit more.
In conclusion, we argue that the value of a dataset should be established by the market as a function of supply and demand and that when intrinsic properties of a dataset are valuable, buyers will declare such is the case as part of the request generated by data buyers.
Having discussed how to price data we now dive into the market design component of our vision.
3 Designing the Market Rules
In this section, we show how without the right rules a market leads to undesirable outcomes. We then explain the challenges of engineering the rules of the market. To illustrate these challenges we consider a scenario such as the one depicted in Fig. 2.
The figure (top) shows two different data sellers, and , who want to share datasets with the arbiter. The arbiter is shown in the middle of the figure, acting as an intermediary. There are two buyers, and who want a dataset . Both and own (although in practice it is common that they will have overlapping datasets, it is also possible the have the exact same dataset, such as in this case). How can we maximize everybody’s happiness in this scenario? Owners want to sell their data and obtain a profit. Buyers want to obtain the dataset to solve the task at hand. Both benefit if the transaction happens, and for that, sellers and buyers must agree to pay a price, for .
If we set based on prices posted by the sellers, then sellers would be incentivized to set higher and higher prices trying to obtain the highest benefit, and no buyer would ever buy , leading to zero overall satisfaction. Conversely, if we set based on prices posted by buyers, then buyers would be incentivized to make prices lower and lower, and owners would stop sharing their data because they would not obtain any profit. A key question a market design must answer is what rule should be used to set .
A simple pricing strategy. One possible mechanism to set a price in situations such as these is a double auction [doubleauctions]. Here, and communicate to the arbiter their posting prices, , and (5 and 7 in the figure). Each sellers’ posting price is private. Similarly, and privately post their bids, and (2 and 3) to the arbiter. The arbiter then sets the price of the dataset as: and communicates the price to the winning participants who are the seller that offered the lowest price and the buyer that offered the higher one (shaded circles in the figure). Because this mechanism creates competition between the sellers, they are encouraged to set a reasonable price, otherwise, they will lose the transaction and receive no profit. Similarly, it encourages buyers to set a reasonable price as well because if they set prices too low, they won’t obtain the dataset they want.
Note: This mechanism is good to illustrate the kind of incentives that well-designed markets deliver to participants, but it is not sufficient to build a functioning data market platform. We discuss why in the next section.
Revenue allocation strategies. In addition to rules for pricing datasets, a well-designed market needs rules for allocating revenue. We illustrate how using the bottom diagram of Fig. 2:
Suppose that the dataset that and want to acquire is instead of , with being a combination of (owned by ) and . In this case, neither ’s dataset nor ’s satisfies the need. In order to facilitate the transaction, the arbiter must combine and in a mashup, , which is then sold to the buyers (shown in the center of the figure). Once the transaction takes place, the arbiter collects the money paid by buyers, , and uses it to compensate the sellers. How to allocate the revenue to the sellers is challenging. Each seller contributed to the transaction ( and ), so both must be compensated. Furthermore, the transaction would not have been made possible without the arbiter’s help, so the arbiter should also capture some of the revenue, .
3.1 Research Agenda: Market Design
We focus our discussion on 3 key challenges that our example above illustrates and form the basis of our vision for the design of data market rules. The first challenge is related to the design of mechanisms (such as the double auction above) that work when the asset is data. The second is related to support for buyers to indicate their needs and the price they are willing to pay for data. The third is related to the challenge of allocating revenue in a way that incentivizes participants. We discuss these 3 challenges next:
3.1.1 The Unique Characteristics of Data as an Asset
The double auction mechanism we used for the example above is not sufficient to design well-functioning markets, (despite its benefits to illustrate the market design problem). This is due to the unique characteristics of data:
1. Data is freely replicable.
2. Data is easy to combine arbitrarily.
Why double auctions are not sufficient. Consider what would happen in the double auction mechanism after the first transaction takes place. The buyer who wins the data walks away happily. The losing buyer, however, can simply bid for the same dataset again, offering the price they want to pay. Sellers, even though they’ve sold the dataset once, can sell it again because data is freely replicable. This means that buyers can just wait with their fixed price until a seller is willing to reduce the price to obtain some profit. Because some profit is better than no profit, the incentives are such that the prices will plummet, making this mechanism impractical. The success of double auctions with material goods, where the ownership of goods is transferred once, cannot be easily brought in to sell data.
The approach. The market design toolbox uses techniques from mechanism design [mechanismdesign], which is a discipline concerned with designing rules for a game (i.e., the market) to yield the desired results. Here, the desired results are decided by the designer based on the market goal, its type, the incentives that make sense (e.g., money vs bonus points) and possibly other constraints. We discuss different market types and goals in the next section. Mechanism design has produced many interesting results, such as in auction theory [optimalauctiontheory], which is used, among others, to implement the real-time ad bidding that powers today’s Internet economy [adwordsauction].
In particular, we are designing a mechanism that works across infinite rounds of transactions, as well as multiple sellers and buyers coming to the market and leaving at different times. A key insight of our proposal (which we do not discuss in detail here) is the artificial creation of scarcity to incentivize seller participation in the market.
3.1.2 How much should I Pay?
Buyers need to indicate: i) their data needs, which can vary from query-by-example type interfaces to asking for data complementary to existing datasets, or more abstract declarations. ii) A metric to measure how a specific dataset fulfills their data needs: the degree of satisfaction, which will be task-specific because metrics that may be useful to measure the quality of an ML model are different than metrics to determine how complete are the results of an aggregate query. iii) A price they are willing to pay for the dataset, which is a function of the metric.
To model these needs we introduce willing-to-pay functions (WTP-function), which consist of 4 components:
[leftmargin=1em, itemsep=.1em, parsep=.1em, topsep=.1em, partopsep=.1em]
A package that includes the data task that buyers want to solve. For example, this package could contain the code to train an ML classifier. This package is sent to the arbiter, so the arbiter can evaluate different datasets on the data task and measure the degree of satisfaction.
A function that assigns a willing-to-pay price for each degree of satisfaction. For example, this function may indicate that the buyer won’t pay any money for datasets that don’t help them achieve at least 80% classification accuracy. But after 80% accuracy, they will pay a fixed amount of money.
Packaged data that buyers may already own and don’t want to pay money for. For example, when buyers own multiple features relevant to train the ML model but want other datasets that augment their data (add features or training samples), they can send their code and their data to the arbiter.
Intrinsic dataset metrics that buyers desire. Some examples are expiry date to indicate for how long data is valuable to them; freshness to indicate that more recent datasets are more valuable; authorship to indicate preferences in who created the dataset; provenance to indicate buyer needs to know how data was generated; and many others such as semantic metadata, documentation, frequency of change, quality, among many others. Each of these properties has potentially many dimensions. For example, the buyer may indicate that they want data that is not more than 2 months old, fearing that concept drift [conceptshift] may affect their classification task otherwise.
We are working on new interfaces for users to easily indicate data needs, for example, through a schema description [dod]
. These new interfaces require new data models to express not only relational operations but also fusion operations that would permit merging/contrasting different signals/opinions and transformation needs, such as pivoting, aggregates, confidence intervals, etc. The WTP-Functions produced need to be interpreted by themashup builder, a component we introduce in the next section as part of our DMMS architecture that is in charge of matching supply and demand.
3.1.3 Allocating Revenue to Query Plans
Consider the example above (bottom of Fig. 2), where the mashup that leads to is some nontrivial combination of ’s and ’s data that involves joining datasets and applying some transformation functions. If the price paid by the winning buyer for was , how do we distribute among the two sellers that contribute datasets as well as the arbiter who solved the discovery and integration problem to produce the mashup?
We are investigating information-theory and information-flow control techniques to understand how much revenue each node of the query plan deserves. In the simplest relational scenario, tracking provenance could be sufficient. With non-relational functions—such as the mapping function in the example of the introduction, or a data fusion operation that wants to gather several conflicting values together, it becomes less clear what is a good strategy to solve this problem. A valid approach for this problem must answer precisely what data and (relational and non-relational) operations led to a value in the output.
3.2 Data Market Platform Design Space
A solution to the market design problem that involves addressing the 3 challenges above will be informed by the type of market we are designing. We consider different markets that cater to different scenarios:
External markets. In external markets, money is a good incentive to get companies who own valuable information to share it with others that may benefit from its use. These markets can be designed to optimize for social welfare such as in the example above, or to maximize seller or arbiter revenue, the number of buyers that satisfy their data needs, etc.
Internal markets. Internal data market platforms have the promise of bringing down data silos by incentivizing data owners (e.g., specific teams, or individuals) to publish their data in a way that is easy to consume by others, in exchange for bonus points or other employee compensation mechanism.
Barter and Gift Markets. These are markets where the participant’s incentive to share their data is to receive data from somebody else. Consider the coalition of hospitals we mentioned in the introduction. A hospital may want to exchange their data for data other hospitals own in order to pool more patient data that may help them devise better patient treatment strategies, for example.
3.3 FAQ: Frequently Asked Questions
Why would people use the market to share data? A well-designed market incentivizes sellers to share data to obtain some profit, which may be monetary or some other form. It also incentivizes buyers to share their data needs in exchange for having their discovery and integration problems solved by the arbiter.
What if I’m not sure if my dataset is leaking personal information? Sharing data is predicated on the assumption that it is legal. Certain PII information, for example, cannot be shared across entities without users’ permission. The DMMS that we present in the next section offers tools for anonymizing and reducing the risk of leaking data.
In addition, once a dataset has been assigned a price, it is possible to envision a data insurance market, where a different entity than the seller (i.e., the arbiter) takes liability for any legal problems caused by that data. In this case, the arbiter is incentivized to avoid those problems, stimulating more research in secure and responsible sharing of data.
Wouldn’t markets concentrate data around a few organizations even more? Today, data is mostly concentrated around a handful of companies with the expertise and resources to generate, process and use it. Ideally, we want to design markets that bring the value of data to a broader audience. It is certainly possible that a market would only worsen this concentration by allocating data to the richest and more powerful players. Fortunately, it is possible to design markets that disincentivize this outcome: achieving that is a goal of our research.
Is there going to be enough demand for a given, single dataset?
We expect certain datasets will naturally have less demand than others, as with any asset today. However, with a powerful enough arbiter, individual datasets are combined and add value to lots of different mashups that may be, in turn, designed to satisfy a varied set of buyers’ needs.
Furthermore, studying the market dynamics will be important to determine, for example, if domain-specific markets (markets for finance, for health, for agriculture) would be more efficient than more general ones in concentrating and uncovering highly valuable datasets.
Why would a seller or buyer trust the arbiter? We don’t assume they would, and we discuss in the next section how this is a key design goal of a DMMS.
Why would a seller know to assign a price to the dataset it’s trying to share/sell? One option is for sellers to assign a price based on the effort it took them to obtain the dataset or on their perceived value. In practice, after initially setting a price, the seller may need to adjust the price in order to sell the dataset based on, for example, feedback by the arbiter.
Alternatively, in markets where the goal is to maximize seller revenue, sellers can share their datasets without setting a price, knowing the arbiter will do its best to use their datasets in transactions.
The arbiter could prevent data duplication by assessing what datasets to accept, hence addressing one of the challenges of selling data. Regardless of the merits of that mechanism to enforce the right outcomes in the market, this design would not allow participants to trade free. Furthermore, since datasets can be arbitrarily similar to each other, it is unclear what threshold the arbiter should use to make a decision, or how to compute that threshold in the first place.
Why would a buyer give out their code (as part of the WTP-function) when it may be an industrial secret? It is conceivable that buyers won’t trust the arbiter. We allow buyers to evaluate the code locally and report back their price post-usage. We are designing truthful mechanisms that incentivize buyers to tell the real value instead of reporting a low value to pay less.
How do WTP-functions work for EDA-like analysis? WTP-functions capture the price buyers are willing to pay for achieving a particular degree of satisfaction. When buyers do not know how to measure their satisfaction, such as when engaging in exploratory data analysis kind of tasks, this won’t work. In these cases, we may need to rely on truthful mechanisms such as those mentioned in the previous question.
How do sellers know in what mashups did their datasets participate? The arbiter must keep track of every transaction that takes place. This involves recording the mashup building process—which is necessary to allocate revenue to sellers—as well as how each dataset was used to derive the mashup. This information should be made available to sellers when they do not trust the arbiter. A key challenge of the vision is to implement the necessary tooling to guarantee that all operations are recorded properly in a tamper-proof fashion.
There are many types of markets and goals. Each market definition (step (1) in Fig. 1) is fed to a market design toolbox (step (2)) which produces a set of market rules using techniques from mechanism design, among others. These rules are designed in order to incentivize players such that their actions produce the outcome the designer wants.
The rules alone do not solve the problems of sharing, discovering and integrating data. We need a DMMS to implement them in practice (step (4)): this is the topic of the next section.
4 Data Market Managmt. System
Data market management systems must be designed to support different market designs (i.e., rules) and they must offer support to sellers, buyers, and the arbiter. The DMMS system we propose achieves that using a seller, buyer, and arbiter management platforms, which are shown in Fig. 3.
4.1 Overview of Arbiter Mngmt. Platform
The arbiter management platform (AMP) is the most complex of all DMMS’s components: not only does it build mashups to match supply and demand, but it also implements the market design rules. We use the architecture in Fig. 3 to drive the description of how the AMS works.
The AMS receives a collection of WTP-functions from buyers specifying the data needs they have. Sellers share their datasets with the arbiter, expecting to profit from transactions that include their datasets. The AMS uses the Mashup Builder (top of the figure) to identify combinations of datasets (we call these mashups) that satisfy buyers’ needs. These are depicted as in the figure.
The next step is to evaluate the degree of satisfaction that each mashup achieves for each buyer’s WTP-function. This task is conducted by the WTP-Evaluator. The WTP-Evaluator first runs the WTP-function code on each mashup and measures the degree of satisfaction achieved. For example, on an ML task, it measures the accuracy. With the degree of satisfaction, it then computes the amount of money (or other incentives) the buyer is willing to pay, . The output of the WTP-Evaluator is a collection of pairs indicating the amount of money that a buyer is willing to pay for each mashup that fits the needs indicated by their WTP-function.
The next step is to use the Pricing Engine to set a price for each and choose a winner111the market design may specify more than one winner, but we use one here to simplify the presentation. The Transaction Support component delivers to the winning buyer and obtains the money, . Finally, the Revenue Allocation Engine allocates among the sellers that contributed datasets used to build and the arbiter. At this point the transaction is completed.
Arbiter Services. Because the arbiter knows the supply and demand for datasets, it can use this information to offer additional services for buyers and sellers, perhaps for a fee. For example, the arbiter could recommend datasets to buyers based on what similar buyers have purchased before [collaborativefiltering]. This kind of service leaks information that was previously private to other buyers. Therefore, this should be reflected in the market design.
Negotiation Rounds. If the AMS cannot find mashups that fulfill the buyer’s needs, it can describe the information it lacks and communicates to the sellers, who are incentivized to add that information to receive a profit. For example, the AMS may ask the seller to explain how to transform an attribute so it joins with another one, or it may request information about how a dataset was obtained/measured, semantic annotations, mapping tables, etc. Sellers will be incentivized to help if that raises their prospect of profiting from the transaction. Similarly, buyers can request the arbiter for data context (provenance, how data was measured/sampled, how fresh it is, etc.) when they need it to effectively use the data.
4.2 Seller Management Platform
The SMP communicates with the AMS to share datasets and receive profit, to coordinate anonymization procedures (as we see next), as well as to agree on changes to the dataset that may improve the seller’s chances of participating in a profitable transaction. Next, we explain the key services we envision SMP offering sellers:
Anonymization. Even if incentivized to sell data for money, sellers face a deterrent when their data may leak information—e.g., personally identifiable information (PII)—that should not be public. To assist sellers, the SMP must incorporate some support for dataset anonymization. And because anonymized datasets may leak information when combined with other datasets [deanonymizenetflix]—which is precisely what the arbiter will do as part of the mashup building process—the anonymization process must be coordinated between SMP and AMS.
Accountability. The SMP must allow sellers to track how their datasets are being sold in the market, e.g., as part of what mashups.
Data Packaging. The SMP assists with transforming datasets provided by sellers into a format interpretable by the arbiter. In addition, this feature must allow sellers to share datasets with coarse granularity (by pointing to a data lake, cloud storage full of files, or a data warehouse), which is useful in the case of internal markets of data, e.g., when a seller wants to remove a data silo.
4.3 Buyer Management Platform
Data buyers must provide the arbiter with a willing-to-pay function (WTP-function) that indicates the price a buyer is willing to pay given the satisfaction achieved by a given dataset.
Buyer management platforms (BMP) have the following requirements:
[leftmargin=1em, itemsep=.1em, parsep=.1em, topsep=.1em, partopsep=.1em]
Because manually describing a WTP function may be difficult, a BMP must help buyers define it. One way of achieving that is through learning schemes that capture buyers’ data declaration and expectation to sketch a WTP function transparently to the buyer.
Secure sharing of the WTP function with the arbiter, so the arbiter computes the level of satisfaction of different mashups and obtains the WTP price buyer bids for such a mashup.
Finally, a communication channel enables buyer-arbiter exchange mashups, WTP-functions, as well as allow the arbiter to recommend alternative datasets to the buyer, e.g., when the arbiter knows of other similar buyers who have acquired such datasets.
4.4 Trust, Licensing, Transparency
Now we zoom out to the general architecture comprising AMS, BMS, and SMS and consider how differing degrees of trust, the existence of data licenses, as well as the need for transparency, introduce additional challenges for the design and implementation of a DMMS.
Trust. We have assumed so far that sellers and buyers trust the arbiter. Sellers trust that the arbiter won’t share the data without sellers’ consent, that it will implement the rules established by the market design faithfully, and that it will allocate revenue following those rules too. Buyers trust the arbiter with their code (that ships as part of the WTP-function), and similar to sellers, they trust the arbiter will enforce the agreed market rules. Although we think it’s reasonable to assume trust in a third party—similar to how individuals and organizations trust the stock market—it is conceivable to imagine scenarios where trust is not granted. In this case, we need to consider techniques on privacy-preserving data management [privacypreservingdata], processing over encrypted data [processingencrypteddata], as well as disaggregated, peer-to-peer markets and blockchain platforms [blockchain1, blockchain2]. Similarly, if we want to prevent buyers from sharing their code and instead rely on them self-reporting how much value they extracted from a dataset after using it, we must make sure the market design incentivizes them to tell the truth.
Data licensing. Sellers can assign different licenses to the datasets they share that would confer different rights to the beneficiary. Similarly, buyers may be interested in obtaining datasets subject to licensing constraints. For example, a hedge fund may want to acquire a dataset with exclusive access, preventing perhaps other competitors to access the same data. The artificial scarcity generated by this license should cost more to buyers, who could be forced to pay a ’tax’ so long they maintain the exclusivity access. Other types of licenses are those that transfer ownership completely, so buyers sell the datasets as soon as they have bought them (creating a market for arbitrageurs as we discuss in the next section), or licenses that prevent the beneficiaries from selling a previously acquired dataset. Supporting these licensing options affects both market design and DMMS system. Furthermore, it raises questions of legality and ethics that go beyond computer science and economics.
Transparency. Transparency may be required at many points of the market process. Sellers may need to know in what mashups their data is being sold and what aspects of their data (rows, columns, specific values) is more valuable. Similarly, buyers may request transparent access to the mashup building process to understand the original datasets that contribute to the mashup and decide whether to trust them or not. We do not discuss the implications of these requirements, we only highlight they have an impact on the engineering of a DMMS.
4.5 Markets of Many Data Types
We have presented the AMP, SMP, and BMP without focusing on a specific type of data to be exchanged. We envision markets to trade data of many types:
Multimedia Data. A variety of multimedia data such as text, web (i.e., a search engine market that does not depend on ads?), as well as video are likely targets for a data market platform. How to build DMMS platforms to reason about how to combine and prepare this data for buyers is a challenge.
Markets for Personal Data. Ultimately, we’d like to be able to price a person’s own information. If I knew how much the information I’m giving an online service is worth, I could make a better decision on whether the exchange is really worth it or not. Because many times an individual’s own data is not worth much in itself—but quickly raises its value when aggregated with other users—it is conceivable that coalitions of users would form who collectively would choose to relinquish/sell certain personal information to benefit together from their services.
Embeddings and ML Models.
Embeddings and vector data is growing fast because they are the input and output format of many ML pipelines. As data-driven companies keep building on their ML capabilities, we expect this data will only grow. Obtaining some of these embeddings incurs a high cost in compute resources, carbon footprint, and time. For example, the BERT pre-trained models produced by Google[bert] take many compute hours to build. For this reason, we expect companies will rely on the exchange of pre-trained embeddings more and more, and hence our interest in supporting this format in our data market platforms.
Out of the many possible data formats, we focus initially on tabular data such as relations and spreadsheets because this data is sufficient to cover most business reporting, analytical, as well as many machine learning tasks. In the next section, we introduce a Mashup Builder specific to this type of data.
5 MB: Matching Supply and Demand
The goal of the mashup builder is to generate a collection of mashups that satisfy a WTP-function. That requires identifying relevant datasets among the many available datasets and integrating them into a mashup. The architecture of the system we are building is depicted in Fig. 4 and is designed to address the following problems:
Data Discovery. The arbiter receives many datasets coming from organizations (a single organization may own thousands of datasets). The goal of data discovery is to identify a few datasets that are relevant to a WTP-function among thousands of diverse heterogeneous datasets.
Data Integration and Blending.
The goal of data integration and blending is to identify strategies to combine the datasets identified by the discovery component into mashups that satisfy the WTP-function. Those strategies consist of identifying mapping and transformation functions to join attributes as well as other preparation tasks such as value interpolation to join on different time granularities.
Because multiple similar datasets may contribute to the same or a small group of similar mashups, data fusion operations permit combining and contrasting the different combinations, keeping track of the origin of each data item, so consumers understand how data was assembled.
We bootstrap the implementation of the mashup builder with Aurum [aurum], a data discovery system that not only allows users to find relevant datasets but also combines them using join operations. To do that it extracts metadata from the input datasets, it organizes that metadata in an index and uses the index to identify datasets based on the criteria indicated in the WTP-function. The architecture of the Mashup Builder is shown in Fig. 3(right). We describe the components next:
5.1 Metadata Engine
The metadata engine’s goal is to read and maintain the lifecycle of each input dataset. Datasets can be automatically read from a source in bulk (e.g., a relational database, a data lake, a repository of CSV files in the cloud) or they can be registered manually by a user who wants to share specific datasets. This is performed by the ingestion module through its batch and sharing interfaces as shown in the figure. Each dataset is divided conceptually into data items, which are the granularity of analysis of the engine. For example, a column data item can be used to extract the value distribution of that attribute. A row data item can be used to compute co-occurrences among values. A partial row data item can be used to compute correlations, among others. For each dataset, the metadata engine maintains a time-ordered list of context snapshots. A context snapshot captures the different properties of each dataset’s data item. For example, signatures of its contents, a collection of human or machine owners (i.e., what code is using what data), as well as the security credentials. This is performed by the Processor component of the system.
Because data item information is not given directly at ingestion time, the engine must harness that information. Data market platforms aim to incentivize users to provide that information directly, but in certain scenarios this is not possible: e.g., a data steward pointing to a collection of databases in an internal organization. The output of the metadata engine is conceptually represented in a relational schema, as performed by the Sink component.
The metadata engine is a fully-incremental, always-on system that is in charge of keeping the output schema as updated as possible while controlling the overhead incurred in the multiple source systems and the precision of the output information.
5.2 Index Builder
The index builder processes the output schema produced by the metadata engine and shapes data so it can be consumed by the dataset-on-demand engine (DoD), which is the component in charge of integration and blending of mashups. Among other tasks, the index builder materializes join paths between files, and it identifies candidate functions to map attributes to each other; i.e., it facilitates the DoD’s job. The index builder keeps indexes up-to-date as the output schema changes. This calls for efficient methods to leverage the signatures computed during the first stage.
5.3 DoD Engine
The DoD engine takes WTP-functions as input and produces mashups that fulfill the WTP-function requests as output. It uses the indexes built by the index builder, the output schema generated by the metadata engine, as well as the raw data.
The DoD relies on query reverse engineering and query-by-example techniques [reversequery1], as well as program-synthesis [programsynthesis], among others, to produce the desired mashups.
Data Fusion. When there are many datasets available, DoD may find multiple alternatives to produce mashups. In certain cases, a buyer wants to see a contrast of mashups (this will be specified in the WTP-function). For example, consider a buyer who wants to access weather data and there are multiple sources that provide this information. A data fusion operator can align the differing values into a mashup that the buyer can explore manually. A specific fusion operator may select one value based on majority voting, for example, while other fusion operators will implement other strategies. Buyers may want to have access to all available signals to make up their own minds. As a consequence, buyers may want to use DoD’s fusion operators to help combine the different sources into mashups.
5.4 Machines and People
Automatically assembling a mashup from individual datasets when only given a description of how the mashup should look is an ambitious goal. Our experience working on this problem for the last few years has taught us that in certain cases this may not be possible at all, such as when ambiguity makes it impossible to understand the right strategy to combine two datasets. We devise two strategies to tackle this problem.
The first strategy is to have the AMS system interact with sellers to request additional information about the datasets they have shared that may help with the integration and blending process, e.g., a semantic annotation, a function to obtain an alternative representation, etc. Sellers willing to include the additional information can be incentivized to do so by obtaining a higher profit, for example.
An alternative strategy is for the mashup builder itself to incorporate humans-in-the-loop as part of its normal operation. This has been done before to answer relational queries [crowddb, crowdenumeration], and there may be opportunities to extend those techniques to help with integration and blending operations as well. Because all this takes place in the context of a market, it becomes possible to compensate those humans according to the value they are creating.
6 Evaluation Plan
In this section, we explain how we plan to evaluate market designs, as well as the DMMS implementation.
6.1 Simulation of Market Designs
A market design that is sound on paper may suffer unexpected setbacks in practice. This may happen because rationality assumptions made at design time may break in the wild. In the context of mechanism design/game theory, rationality is interpreted as players will play the best strategy available to them. Unfortunately, that does not account for risk-lover or ignorant players. Furthermore, some players may be adversarial in practice, forming coalitions with other players to game the market. Or less dramatic, a faulty piece of software may cause erratic behavior. Below, we explain how we plan to evaluate the effectiveness and efficiency of market design in practice.
Effectiveness. The mismatch between theory and practice calls for a framework to evaluate how resilient a market design is under adversarial, evil, and faulty processes. We plan to design a simulation platform where it is possible to implement different rules and change the behavior of players, and where it is possible to model adversarial, coalition-building, as well as risky and ignorant players (this is shown in (3) of Fig. 1). The goal of the simulation platform is to understand the robustness of different mechanisms before their deployment.
Large-scale simulations introduce database challenges such as: i) supporting quick communication among many players (transaction processing); ii) modeling workloads to simulate different strategy distributions of players. Such a simulation framework will be of independent interest.
Efficiency. At its core, market mechanisms are implemented with an algorithm. The fields of mechanism design and algorithmic game theory have contributed to efficient approximation algorithms [mechanismdesign]. In databases, algorithms with high complexity are often used in practice for small problems, and conversely, algorithms with low complexity cannot be used practically because of the size of the data. We want to contribute empirical evaluations of these designs when implemented in a software platform such as the DMMS we describe.
6.2 Evaluating a DMMS
We plan to deploy a prototype of our DMMS in an internal market first, within the context of collaborating organizations. This will help hone the interface with humans, understand the deployment context and its constraints better, as well as to conduct quantitative evaluations. Although the metrics to evaluate a DMMS are many, we explain a few we deem important below.
Mashup building. We care about quantitative metrics such as throughput, latency, scalability, robustness as well as qualitative properties, such as the degree of automation achieved by the system. Qualitative properties are harder to evaluate because of the lack of standard integration benchmarks. We are designing benchmarks that capture the data market scenario—which is general to other point integration efforts as well. We think these benchmarks will be of independent interest to the database community.
SMP, BMP, AMP. These platforms can be evaluated on their performance and scalability, but also on how successful they are at assisting sellers with anonymizing datasets and with helping buyers specify their WTP-functions. In addition, we think there are interesting systems-research opportunities to speed up the execution of market rules by using caching, memoization, and other techniques.
7 Societal Impact of Data Markets
The side effects of data markets span beyond computer science and economics. We plan to engage with the broader community of scholars at The University of Chicago and elsewhere to discuss and outline the challenges of data markets in a broader societal context. We outline some interesting aspects below.
7.1 Economic Opportunities
A well-functioning market generates economic opportunities for other players besides sellers and buyers:
Arbitrageurs. They play seller and buyer at the same time. Arbitrageurs buy certain datasets, transform them, perhaps combining them with certain information they possess, and sell them again to the market. The transaction generates a profit for them whenever the sold dataset is priced higher than the dataset they buy. Since we want to design mechanisms that price datasets based on supply and demand, it is conceivable that the participation of arbitrageurs in the market will rise data’s value, because they will be incentivized to transform datasets into a shape that is desired by buyers.
Opportunistic data seller. Opportunistic data sellers may not own data, but they have time that they are willing to invest in collecting high-demand datasets. They obtain information about highly demanded datasets from the arbiter. For example, consider one more time the example of the introduction with the two sellers and the buyer. Consider a third seller, Seller 3, who does not own any dataset, but has time, and is willing to use that time to acquire/find data for profit. Because the arbiter knows that would benefit from attribute , which neither nor contain, the arbiter can ask Seller 3 to obtain a dataset for money. Because the arbiter knows supply and demand, not only does it help sellers and buyers, but it creates an ecosystem of economic opportunities for other entities.
Offloading tasks. As discussed above, when the arbiter does not know how to automatically assemble a mashup, it can schedule humans to help with the task and compensates them appropriately for their labor.
Data Insurance. Once data has a value and a price, it is possible to build an insurance market around it. Such an insurance market would be useful to reason about data breaches, for example. How liable is a company that suffers a data breach that results in leaking private customer information? Or, if a seller shares a dataset that is later de-anonymized by a third party, despite the best efforts from the arbiter to anonymize it, who is liable? Can/Should insurance cover these cases?
7.2 Legal and Ethical Dimension
Who owns a dataset? Throughout this paper, we have assumed that sellers owned the data they were sharing with the arbiter. Consider a seller who has collected a dataset through their manual effort and skill. In this case, does the seller own such a dataset? What if the records in the dataset correspond to users interacting with a service the seller has created? Do those users own part of the data too? A recent article from the New York Times [nyt-privacy] has illustrated in glaring detail how it is possible to determine with high precision the location of individuals and their daily activities from smartphone data traces. The data that permits that is routinely collected and sold by companies that profit from it. This leads to questions around what data is legal to possess, what does it mean to own data, and when it should be possible to trade data.
Market Failures. Markets sometimes fail and cause social havoc. Other times, markets work only for a few, causing or accentuating existing inequality. All markets are susceptible to these kinds of problems, including the ones we envision in this paper. The difference is that we haven’t implemented our market yet, so we have a chance to study beforehand what the consequences of malfunctioning markets on society is and decide whether the tradeoffs are worth it. Forecasting the implications of different market designs is a key aspect of our vision; hence the simulation framework introduced in the previous section.
|Feature-based Valuation [bigdatfirm18]||✓|
|Sparseness-based Valuation [predwbig13]||✓|
|Shapley on Data Points [ghorbani2019data]||✓||✓|
Shapley for KNN[jiaNN19]
|Model-based Query Pricing [chenMLQ19]||✓||✓|
|Max-Revenue Query Pricing [revMax19]||✓||✓||✓|
|Plain Query Pricing [qbdp15]||✓|
|Privacy-based Query Pricing [ppd14]||✓||✓|
8 Related Work
We propose the first comprehensive vision of end-to-end data market platforms that considers all players involved and makes an explicit separation between design and implementation (DMMS). We structure this section to explain how other work relates to our vision. We start with a discussion of data markets (Section 8.1) and then focus on work related to the DMMS and the Mashup Builder in Section 8.2.
8.1 Market Design Taxonomy
In order to ease the discussion of the related work, we explain four properties of data markets we justified in this paper are necessary. We then divide the related work into blocks and discuss them with respect to these properties. We have summarized this discussion in Table 1. The four properties we consider here are:
[leftmargin=1em, itemsep=.1em, parsep=.1em, topsep=.1em, partopsep=.1em]
P1. Data-Enhancing: The arbiter is an active party that matches supply and demand by creating mashups adjusted to buyers’ needs.
P2. Plug-n-Play Market Rules: The market rules can be adjusted to different goals and constraints.
P3. Incentive-Compatible: The market is designed so that buyers and sellers are incentivized to not game the system.
P4. Reward-Compatible: Reward high-value data, including data that is carefully curated and documented.
Existing Marketplaces of data. Today’s marketplaces of data do not fulfill any of properties P1, P2, P3 or P4. We use Dawex [dawex] as a representative of these markets which include OnAudience.com [ads1], BIG.Exchange [ads2], BuySellAds [ads3] for ad data, as well Qlik Datamarket [mkt1], Xignite [mkt2], WorldQuant [mkt3], DataBroker DAO [mkt4], Snowflake’s Data Exchange [snowflake-exchange], among others. In Dawex sellers offer datasets that are sold as-is. Dawex facilitates in this way the sharing of datasets. Buyers, however, still need to perform a discovery stage and an integration stage, where buyers must adapt the datasets to the format they need. The Dawex platform acts as a broker. It enables transactions but does not help with the discovery and integration problems, unlike the markets we envision. In particular, the arbiter does not combine datasets to fulfill the buyers’ needs. Buyers have an interface to explore a sample of the dataset they want to purchase, but they need to commit and pay the price for the dataset before truly knowing how valuable the dataset is for them. This is characteristic of today’s online marketplaces that have been built with a focus on sharing and not discovering and integrating.
Academic Market Designs. In the marketplace for data proposal [agarwal19], buyers with an ML prediction task request a dataset from the market. Given a combination of training data from multiple sellers, the work uses the Shapley value [shapleyval] to allocate revenue to sellers. The model considers one single buyer at one point in time. Like in our model, buyers only pay for datasets that are guaranteed to achieve certain quality on an ML task. This work assumes (P1) is solved without giving a solution (that’s not its focus), it considers one fixed market goal (P2) and designs mechanisms for that scenario. This proposal focuses on the market modeling and it does not explain how to implement the ideas in software, but it showcases many of the challenges we have outlined here related to the design of incentive-compatible mechanisms to govern the participation of participants.
Incentive-Compatible. A number of papers have focused on a specific problem: how to allocate revenue to multiple sellers that have contributed to a dataset (typically a training dataset) given a price for that dataset. They have used the Shapley value [shapleyval] to determine the ’value’ of each datum, and hence the total contribution of each seller [agarwal19, ghorbani2019data, jiaNN19]. The contributions of this work center around how to compute the Shapley value efficiently. While the algorithmic marketplace deals with the problem of data replicability, the other works do not. These lines of work is concerned with (P3), but none of (P1, P2, P4).
Query Pricing. There is a long and principled line of work coming from the database community around the problem of how to price queries [chenMLQ19, revMax19, qbdp15, ppd14]. In this setting, a dataset has a set price. The problem is how to price relational queries on that dataset in such a way that arbitrage opportunities (obtaining the same data through a different and cheaper combination of queries) are not possible. This line of work is concerned with (P3). Recent work in this line [revMax19] also considers how to maximize revenue for the broker under the same pricing model as above. If all datasets of a market are thought of as views over a single relation, then the setting of this work resembles ours. However, many data integration tasks require arbitrary data transformations, and many buyers want to buy fused datasets that contain diverging opinions, for example. This line of work is complementary to our vision and we plan to include these ideas as part of our design.
Value of Data. Some work has focused on the explicit question of how to value data. For example, in [bigdatfirm18, predwbig13] the authors consider what’s the impact of intrinsic properties of data (e.g., sparsity, number of features) for a given fixed prediction task. This line of work is complementary to ours and is interesting as a way of understanding the impact of intrinsic properties. It cannot replace our extrinsic way of pricing data, because the same dataset could be used for many tasks, i.e., other than a prediction task. It can, however, inform how buyers may perceive different intrinsic properties and help with communicating to sellers those needs.
Privacy-Value Connection. This line of work makes a connection between data value and privacy [chenMLQ19, revMax19, ppd14]. The buyer can specify a level of privacy associated with a query, in such a way that the higher the privacy level, the less the dataset is perturbed, meaning the dataset will be of higher quality. Therefore, the higher the privacy level, the higher the price of the dataset.
A key defining feature of our vision is that we make the explicit link between market design and software platform (DMMS), hence providing an end-to-end market environment. End-to-end means we must consider rules that anticipate the behavior of all players, as opposed to rules that apply to only narrow situations—how to perform revenue allocation once the price has been set, how to price features when the task is known to be an ML classification task, etc.
8.2 DMMS Related Work
The DMMS platform presents many new challenges. One of the more challenging and ambitious components is the Mashup Builder. This module directly builds upon the rich work in the theory, algorithms, and systems for data sharing, discovery, and integration. We explain the relationships of some of the relevant work in each category here.
Data Sharing Platforms. The datahub system [datahub] introduced a data version control system implemented on a software platform that allows members of a team to collaborate. OrpheusDB [orpheus] similarly offers teams the ability to collaborate over a relational system and capture how data evolves. In addition to these systems in the database community, many approaches in the library and information science community also deal with issues of data sharing. Systems such as TIND [tind], KOHA [koha], as well as online repositories such as the Harvard Dataverse [dataverse] or the ICPSR [icpsr] at the University of Michigan, geared towards sharing data across the social sciences.
Data Discovery. Data discovery systems such as Infogather [infogather], Google Goods [goods] and Dataset search [datasetsearch], define a specific task and focus on how to build indexes to solve that task. There is also a line of work on data catalogs, with Amundsen [amundsen], WhereHows [wherehows], Databook [databook]
as open-source examples and Alation[alation], Azure’s data catalog [azuredatacatalog], and Informatica’s data catalog [informaticadatacatalogue] as some commercial examples. A more general approach to data discovery is Aurum [aurum], which provides most of the functionality required to implement the systems above.
Data Integration. Relevant work in data integration for the mashup builder is query reverse engineering [reversequery1, reversequery2], as well as query-by-example and spreadsheet-style interfaces to data integration such as S4 [s4]. Modern data integration systems such as Civilizer [civilizer] and BigGorilla [biggorilla] assume the existence and participation of a human expert that needs to build DAGs of integration operators during the integration activity. We borrowed the term data mashup from the Yahoo Pipes system [yahoopipes]. Related to creating mashups given many different datasets, some work [lessismore12] has studied the diminishing returns of integrating datasets.
Data Fusion and Truth Discovery. Data fusion refers to the ability to combine multiple sources of information to improve the quality of the end result. In the context of our vision, we consider data fusion operators that permit combining multiple (possibly diverging) datasets and offer the result to users. This can be useful, among others, for truth discovery [truthdiscovery]: the process of identifying the real value for a specific variable. The database community has contributed results to these areas [srivbdi13, truthfinding1, truthdiscovery2, truthdiscovery3]. We are building on top of this existing work to inform the design of fusion operators that can be incorporated into the architecture we explained in this paper.
None of the work above has the goal of incentivizing users and buyers to solve the information-incentive problems that the markets we propose in this paper tackle. At the same time, all the work above is relevant to build the mashup builder, which is one piece of the larger class of DMMS systems we envision.
In this paper, we presented a vision for data market platforms that focus on the problems of data sharing, discovery, and integration. These problems are the main hurdle many organizations today face to extract value from data, and therefore, our vision has the potential impact of democratizing data.
While data and artificial intelligence are driving many changes to our economic, social, political, financial, and legal systems, we know surprisingly little about their foundations and governing dynamics. Furthermore, to an extent unseen in previous economic upheavals, the rapid pace of technological and social innovation is straining the ability of policy and economic practice to keep up. Moreover, while the recombination and integration of diverse data creates vast new value, we currently have neither theory for how data can be combined nor industrial policy for how to protect against the personal exposures and abuses that grow in proportion. We remain stuck with old models for understanding these new phenomena and antiquated heuristics for making decisions in the face of change. We think that the data markets we propose are a vehicle to initiate the study of theory, policy, and mechanism design to address this challenge.
We expect that the insights, algorithms, and systems that will be produced as a consequence of this research will inform the design of future data market platforms. We expect that the different systems, simulators, and approaches proposed here will pose interesting new lines of research for the database community.