Whenever data and models are shared, transformation ensues. Breaking down data silos unleashes value that makes companies more competitive. Pooling knowledge, such as when hospitals form coalitions, accelerates discovery. Entire disciplines change when researchers share benchmarks and models (medicalimaging; nlu). However, three barriers prevent effective sharing: easy access to sensitive data, data discovery and integration, and data governance and compliance are all challenges with both technical and human components.
Much prior work has tackled each barrier individually. However, individual solutions are often in conflict. For example, it is harder to discover relevant datasets when access is restricted, and to govern data when underlying datasets are not well integrated. We need a comprehensive solution that addresses all three barriers together.
Discovery and Integration. Data lakes (lake1; commons) ease data access by collecting unrestricted datasets in a central repository where they may be accessed and downloaded by analysts. However, large volumes of data mean analysts spend more time in finding (discovery) and combining (integration) datasets than in their analysis (eighty).
Access to Sensitive Data. Organizations are wary of sharing data because they fear information leakage (foster2018research). Simple anonymization techniques do not suffice (latanya; netflix). These disincentives block data sharing and stymie innovation.
Data Governance and Compliance. Analysts routinely download datasets from databases to produce machine learning (ML) models, reports, and other derived data products. The consequence is a governance nightmare for those who want to control access to sensitive information, need to comply with regulations such as GDPR (gdpr) and CCPA (ccpa), or want to ensure ethical use of data.
To tackle these challenges, a radically new data architecture is needed to address both the technical and the human problem. Such an architecture must change how people access, and use data.
Enter the Data Station. In the Data Station architecture, both data and derived data products—such as ML models, query results, and reports—are sealed and cannot be directly seen, accessed, or downloaded by anyone. The key idea is that instead of delivering data to users, users bring questions to data. For example, instead of downloading a dataset to train a ML model, a user may tell the Data Station what model they need and the Station identifies a suitable data + model combination, trains the model on the data, and makes the trained model available for inference. This inversion of compute and data mitigates many security risks of sharing sensitive data.
Centralizing data and computation permits fine-grained yet scalable data access: users see results of their tasks only after they have been given permission. In this model, data lifecycles and provenance are known, which permits straightforward implementation of data governance policies. For example, it is possible to prohibit the use of non-interpretable ML models; to control the attributes included in training data to avoid propagating biased and unfair models; and to limit the data used for deriving data products to avoid leaking sensitive data. In general, it is possible to control what and how derived data products are produced and used.
Centralizing data and computation has another benefit: the Station sees all datasets, all models, and all compute requests. This information lays the foundation for the design of data markets (dmms). Data markets incentivize humans to share data and concentrate their effort where it matters most: assisting with discovery and integration tasks. Market forces can be used to recruit humans to clean datasets, to indicate how to join datasets, or to annotate datasets with tags and other documentation.
Data Stations differ from Data Federation Architectures. Data federation architectures allow access to disparate sources through common schemas (global-as-view (discoveryambiguity) or others (garlic1; garlic2)). Each organization controls its own data locally, and must arbitrate query execution and release of results. Modern federated systems use statistical database privacy techniques to control the release of results (jenniesmc; federated1; federated2). Data Stations explore a different point of the design space. By inverting data and compute, they escape the need for a common schema, avoiding the agreement problem of data integration and opening up possibilities for computation beyond relational queries. By sealing all data and derived data products, they maintain the same level of security, but facilitate the enforcement of fine-grain access and governance policies, and enable the implementation of market mechanisms. Data Stations mitigate risks, but they also introduce four new challenges:
C1. Data-Compute Inversion. Data Stations users must submit computational requests (e.g., queries, model training/inference, data preparation) without seeing the data that their requests will engage. Methods are needed to allow users to determine if a dataset is suitable for their needs or if they can trust derived data products in the absence of crucial metadata (e.g., creator and provenance).
C2. Data Discovery and Integration. Upon receiving a computational request, the Station must determine what datasets are needed to perform the task (discovery) and how to prepare and combine those datasets to enable the computation (integration).
C3. Unbounded computation. Unlike traditional data processing platforms, in which computation is specified over concrete datasets, in the Data Station model one does not know the appropriate dataset a priori. Thus, the Data Station architecture introduces new resource management problems that modern schedulers are not designed to solve.
C4. Data governance and access. Centralization creates opportunities for precise data governance; exploiting those opportunities requires efficient and secure fine-grained data access management. Interfaces are needed for declaring access and governance protocols, and an engine is needed to control and audit enforcement.
To address these challenges Stations introduce the concept of a data-unaware task capsule to help users declare a computation without seeing its data; a discovery and integration platform that finds and combines datasets to satisfy a capsule’s request for computation; a new scheduler and compute substrate to map computations to compute resources; and support for the definition and enforcement of governance and access policies—all powered by a data asset catalog. Solving the four challenges automatically remains largely impossible in the general case without human guidance. To engage humans where they are most needed, we design and manage incentives by using market forces, making the Data Station an implementation of a data market management system (dmms).
2. Next Platform Requirements
We illustrate data problems within and across organizations and define the requirements that we use to motivate the architecture of Data Stations in the next section.
2.1. Data Problems within Organizations
In 2019, 20% of managers from top companies claimed that the planned to deploy ML technology, and 27% that they had already implemented such technologies. These numbers dropped to 4% and 18% in 2020 (pwcsurvey). There are two main reasons for this trend. First, finding ML experts to make use of available data is hard. Second, data in these companies are spread across heterogeneous repositories, such as databases, warehouses, spreadsheets, and lakes, and are managed by different teams, departments, and divisions. Managers, analysts, and stewards struggle to identify and prepare the data required by downstream ML tasks when those data are stored in silos. REQ1. Management of data lifecyles is necessary to discover relevant data. REQ2. Easy integration of data assets is needed by data consumers to save time.
Accessing data requires engaging with IT department administrators who enforce access controls and are in charge of ensuring that approved requests comply with regulations. Organizations may also want to implement other policies: for example, preventing analysts from using data columns that represent a protected class as input to ML models, or from using ML models that are not interpretable (under some well-specified definition of interpretability) and that would produce data whose origin cannot be easily explained. These governance needs cannot easily be met without a platform that REQ3. Allows users to declare and enforce governance policies.
2.2. Sharing Data Across Organizations
While data sharing among organizations who own complementary data has the promise of producing combinatorial value, it is often prevented by the fear of leaking sensitive or confidential information. We illustrate this opportunity with two use cases.
Data-Driven Physiology. Consider the task of linking ECG waveform patterns to sudden cardiac death, which kills 300,000 Americans every year. Health researchers have identified a number of clinical risk factors (heart failure, family history, etc.)—yet the vast majority of deaths occur in those without any of these conditions. Machine learning could be used to identify waveform signatures indicating elevated risk that could then be used to target preventive interventions. Similar exercises could yield insights into many other conditions. The few studies that have used small, proprietary datasets have given reason for optimism. Unfortunately, health data is stuck in organizations that are wary of sharing it for fear of leaking sensitive information. Researchers are forced to build personal relationships with data providers, agree on formats and integration strategy, and negotiate one-off data-sharing agreements, hence slowing down innovation.
Accelerating Materials Design and Discovery. The global advanced materials market is forecast to reach $2T by 2024 (materialsmarket). An important component for innovation is data. Materials science databases contain large volumes of data that introduce challenges related to discovery, integration, and sharing of potentially company-sensitive information. Examples of such datasets may include curated materials properties extracted from literature, corpora of experimental or simulated materials properties, and results from multi-fidelity simulations from the atomic to macroscopic (e.g., density functional theory, molecular dynamics, finite element method). For example, the Materials Genome Initiative (white2012materials; blaiszik2019data) has fueled innovations on microelectronics, aerospace, automotive, defense, energy, and health sectors. These data may be difficult for any one team or company to collect, and may require large expenditures of effort in experiment, simulation and curation.
REQ4. Pooling data across organizations securely is crucial for researchers in medicine, materials science, and others, to bootstrap their data-driven discovery and modeling efforts, reduce experimental and computational costs, and spur new innovations. But having a technical solution to sharing data is not sufficient. Data participants must be incentivized to share data in a way that eases its utility to others. Data Stations REQ5. Implement market mechanisms to manage incentives, so they concentrate resources and time where it matters most.
3. Centralizing Data and Models
We present the Data Station architecture in Section 3.1 and describe in Sections 3.2–3.5 its major components. We conclude in Section 3.6 with a summary of how the architecture addresses the challenges (C1-C4) and requirements (REQ1-REQ5).
3.1. The Data Station Architecture
We differentiate between data contributors, who deliver data to the Data Station, and data users, who use these data to solve problems. We talk about original data or datasets to refer to content that contributors deliver to the platform, and derived data product to refer to datasets, models, visualizations, reports, or any other result obtained by processing an original dataset. Data contributors use an interface to deliver data securely to the Station, much as they interact with data lakes today. Once data enters the Station, they are sealed and nobody can access them, or any data product derived from those data, directly. We next explain how the Station is used from the perspectives of first a data user and then a data contributor.
3.1.1. Data User Perspective
A Data Station does not deliver data to users; instead, users bring their computations to the data. They do so by creating data-unaware task capsules. A capsule encapsulates a declaration of some computation to be performed, as well as criteria to verify that the result is valid without looking at the data first. A capsule is said to be data-unaware because users have no access to any data when they create a capsule. As illustrated in the top left of Figure 1, a task capsule definition has three components:
[leftmargin=1em, itemsep=.1em, parsep=.1em, topsep=.1em, partopsep=.1em]
Task specification. A task specification consists of a task type that selects a computational task from a extensible finite set, e.g., classification, and a task payload that includes type-dependent information. The example task capsule indicates there are two classes and specifies a path to test data.
Degree of satisfaction (DOS). This metric depends on the task type and is used to determine what results are valid to users, e.g., demanding a ML model accuracy to be , as in the example.
Trust constraints. To trust the results, users want to know what datasets contributed to the result and when, by whom, and how the dataset was created. Lacking access to data, users cannot verify these criteria directly. directly. Instead, they include these requirements (see the example) in the form of constraints that are checked by the Station before delivering results.
Task Capsule Types. Data Stations can support other task types besides classification, such as Query-by-Example (qbe)
interfaces for analytical queries, ML tasks such as regression and anomaly detection via autoML(automl), and search.
3.1.2. Data Contributor Perspective
When a data contributor uploads a dataset to the Station they include a signature—based on public key cryptography—that identifies them as owning and being responsible for the dataset. By default, only a dataset’s owner(s) is granted access; the dataset remains otherwise invisible to all other users. Any further access to the dataset, or to any dataset derived from the dataset, must be mediated. We explain the protocol later in the section and focus now on the policy.
To make accessible by others, owners declare an access policy (see example in Fig. 1, top-right) that includes a minimum of three properties: discoverability, access, and derivation. Discoverability indicates whether the discovery module can include the dataset in responses to searches. Access indicates whether the dataset is closed to everyone (the default), open to everyone, or brokered; the latter case indicates that explicit permission must be given before the dataset can be accessed. Finally, derivation (not shown in the example) indicates whether the dataset can be combined with others or offered as-is. More fine-grained controls are also supported. For example, an access policy may give access to a relational dataset only by tasks of type analytical (queries with joins, group by, aggregations), only through a differential privacy (differentialprivacy) filter, and with the number of accesses constrained to control the privacy budget, . (The parameter controls the privacy in differential privacy.) Such access policies permit contributors to bound the access and usage of datasets without engaging in complex data sharing agreements.
Bulk uploads. Data contributors can upload entire data systems to the Station at once—such as when unlocking silos—and include a default access policy that applies to every dataset. The Station provides tools and APIs to update datasets previously submitted.
Encrypted datasets. Data contributors can upload encrypted datasets to comply with certain regulations, as long as those datasets are accompanied by non-encrypted metadata—that is, the metadata that would normally be extracted by the metadata engine and from humans via incentives.
3.2. Station Workflow
The Data Station component diagram is shown at the bottom of Fig. 1. Upon receiving a task capsule, the Data Station uses (Step 1) a discovery platform to identify datasets that are potentially relevant to the task, and then (Step 2) an integration platform to combine datasets so they are valid inputs to the capsule. It then allocates compute (Step 3) to evaluate the task on those datasets, checking whether any of them satisfy the DOS metric, e.g., the accuracy for an ML model. When a solution is found, the Station (in Step 4) interacts with the data user to mediate access to the results. Because solving this problem automatically is not always possible, Stations incentivize humans to participate when and where they are most needed. Stations use market mechanisms, which we explain in Section 4 to achieve this goal.
Step 1: Data Discovery. The goal of data discovery is to identify datasets that are relevant to a task capsule among thousands of diverse heterogeneous datasets. A dataset is relevant if it helps solve the task and it satisfies the capsule’s constraints. The discovery module is based on Aurum (aurum), and it uses a data asset catalog and discovery indexes. The catalog maintains the lifecycle of each dataset in the Station in the form of profiles, which are descriptions of the data, such as statistical distribution of values, sketches, but also temporal information and others. Profiles are automatically computed by the Metadata Engine when datasets are submitted to the Station, or elicited from humans using incentives. Discovery indexes are built from the catalog to ease dataset search and include similarity indexes to find complementary data, full-text search indexes to match keywords, linkage graphs to identify potential join paths—in the case of relational data, and many others. Any information in capsules useful to search data is used to query the discovery indexes and catalog. For example, the trust constraints are verified against the catalog to quickly prune potential results. Then, depending on the task type, test data is used to find training data (similar data) for building models, or, when the task type is QBE, attributes and data samples are used to steer the search.
Step 2: Data Integration and Blending. The goal is to transform a list of input datasets (the output of discovery) into a desired output dataset (the input of a task capsule). Blending uses techniques from program synthesis (programsynthesis)
, ML, and others, to identify what preparation and integration steps are necessary to derive the output from the input. Bounding the number of task types supported is needed to design blending engines tailored to analytical queries, ML tasks that require training data, etc. Some of the techniques used by this module include identifying mapping and transformation functions to join attributes (i.e., , in the case of relational data) as well as normalization and standardization tasks, such as value interpolation to join on different time and space granularities.
In the next steps, Stations search for a pair of (task capsule, blending output dataset) that satisifies the capsule’s DOS (Step 3). Access to task results and derived data products is mediated in Step 4. We explain both steps in Sections 3.4 and 3.5. We start by describing important components of the Station.
3.3. Data Asset Catalog
The data asset catalog maintains the lifecycle and metadata—how data came to be, how it changed, how it’s been used—of each dataset and derived dataset hosted in the Station. The catalog serves the discovery and integration engine, by data users to describe capsule’s trust constraints, and by contributors and stewards to implement access and governance policies.
To be interpretable by data users, data contributors, and the Station itself, the catalog implements a common mental model consisting of profiles and relationships. Profiles include: what-profile to describe an actual dataset, how-profile to indicate what program produced the current dataset version, who-profile to indicate who produced and who uses the dataset, where-profile to indicate how the dataset can be accessed, why-profile to explain the purpose of the dataset, and when-profile to explain when the dataset was modified and when it is valid. Relationships are built out of profiles: for example, provenance is built from who- and how-profiles, and syntactic relationships such as join and similarity graphs are built from what-profiles. Both profiles and relationships are used by the discovery and integration modules.
The catalog’s logical schema design strives for a balance between structuredness, which facilitates querying, and flexibility, which facilitates including new data. At its core, it reflects the mental model introduced above, which allows different parties to understand and query it effectively. To increase flexibility, it supports semi-structured data, such as JSON, to reflect the idiosyncrasies of different data formats: e.g., describing an image is different than describing a relation or a ML model.
Populating the Catalog. The Data Station triggers the execution of a metadata engine whenever a new dataset is received, an existing dataset is updated, or the Station produces a derived data product from existing ones. The metadata engine analyzes the dataset and extracts as many profiles as it can automatically. This is done via the orchestration of analyzers that specialize in different profiles, but also by eliciting this information from data users and contributors directly when it cannot be accessed differently (see Section 4. As a consequence, the full lifecyle of residing datasets and derived data products is known because all operations on that dataset happen within the realm of the Station.
Catalog Service. A catalog service facilitates loading and querying, and stores and enforces access and governance policies. The service maintains schemas of the semi-structured data as well as indexes that permit the discovery and blending engines to find the information they need, and data users to specify trust constraints to guarantee the origin and nature of the results they request.
3.4. Scalable Governance and Access
We consider two types of data policies: data governance policies control what derived data products are produced in the Station while data access policies indicate who can access what data.
Data governance policies. These specifications limit and control the use of datasets and data tasks. For example, one may want to prohibit production of derived data from datasets that contain personally identifiable information (PII). If PII is defined specifically enough, for example, by providing a table with 116 attributes that correspond to PII data, then the metadata engine can tag datasets that contain such information by matching the definition with the existing data—represented in what-profiles.
All governance policies are registered with the Catalog service, which is responsible for their enforcement. Only datasets and derived data products that pass the constraints are returned as a result. Data governance policies not only apply to data. They also govern what task capsule implementations are permitted, for example what kind of ML models are used. This permits prohibiting the use of certain models that are not sufficiently interpretable, or that are susceptible to biases otherwise.
Finally, because all data lifecycles are known to the Station via the catalog, it is possible to specify and enforce governance policies that apply actively to existing data, e.g., removing datasets and derived data products subject to the ‘right to be forgotten’ (rtbf). This capability facilitates complying with regulations such as GDPR and CCPA within the realm of the Station.
Data access protocols. Sealing all data and derived data products mitigates many of the problems associated with sharing sensitive information, but it requires implementation of a solution that will allow autorized users to access the results of their computations. The Data Station architecture is amenable to capability-based mechanisms (capabilitybased1; capabilitybased2) that give access to results as long as the computation includes an adequate access token. Access tokens can be requested from the platform or directly from the data contributors. The choice of mechanism can be left up to the preference of the data contributor and based on different access policies.
Access tokens can encapsulate richer information than merely a boolean value that indicates if access is granted. For example, they can grant one-time access or alternatively provide an expiry date—which can be infinite when access is granted permanently. Once in possession of a token, users can seamlessly use the Station to work on those data; they need only request new access token when accessing new and protected datasets. The burden of creating and managing tokens is on the Station, which understands what datasets have contributed to the results being requested and hence can orchestrate the actions needed to grant, manage, and revoke access as required.
3.5. Dealing with Unbounded Computation
In a traditional data processing architecture, such as a database, a task consists of a query that expresses the computation to perform and the data to be read. Given these two pieces of information, it is often possible to estimate the computational resources needed to obtain the results. This is not true in Data Stations, where the goal is to identify datasets for which the defined task achieves the desired DOS metric. Thus, the amount of computation that may be required to perform a user request may be ivirtually unbounded, as it may be necessary to check all dataset combinations.
Data Stations rely on two main mechanisms to tackle this challenge. First, the discovery and integration platform enable srapid pruning of the space of compatible datasets. Second, the Data Station is implemented on a modern execution platform (e.g., cloud) with the scalability and performance required to solve manyf tasks. A result cache maintains the relationship between executed tasks and datasets accessed, with the goal of informing and guiding the matching of tasks with data in the future, which is done, in turn, via a scheduler in a speculative manner.
Finally, Data Stations are only logically centralized. Physically, computation may take place across multiple machines that may be dedicated to specific tasks, such as serving ML models that are otherwise not accessible to users. Each compute node is stateless and accesses data from a disaggregated data store; dissaggregation of compute and storage facilitates scalability.
3.6. How Stations Address Requirements
The challenges (C1-C4) are addressed by the task capsules, the discovery and blending module, and the unbounded computation engine, as well as the catalog and its ability to manage and enforce access and governance policies.
By capturing and maintaining the lifecycle of each dataset and derived data product, the data asset catalog along with the discovery and blending engines satisfy REQ1, REQ2, and REQ3. The Station architecture, by sealing data and mediating access through fine-grained policies, helps with REQ4.
Nevertheless, Stations cannot solve all problems automatically. Semantic ambiguity in discovery and integration tasks (discoveryambiguity), for example, require humans in the loop. A metadata engine keeps the catalog up to date, but certain profiles are impossible to create automatically and need human input, e.g., a why-profile that describes the reason for the existence of a dataset. Finally, even if data in the Station are technically secured, humans may have other disincentives to share the data, such as fear of leaking proprietary information. This section has dealt with the technical problem, we discuss in the next section how Data Stations host data markets to help manage incentives to tackle the human factor.
4. Market Forces and Incentives
During the course of processing a task capsule, the Station may run into situations where it requires human input to make progress: for example, when it needs to join two tables on an attribute address, but lacks the information to choose between two alternatives, work address and home address. Stations may also block because they cannot determine if an action is safe. For example, a capsule wants to train an ML model and the Station has identified a candidate training dataset with sample data, but it cannot tell how the sample was generated and the corresponding why-profile is incomplete.
Data Stations introduce incentive mechanisms to motivate data contributors and data users to treat data as a valuable asset and help solve data problems when the technical solution is insufficient. Stations coordinate human effort to curate, document, and prepare data when ambiguity in task capsules, catalog, schemas, or other data descriptions prevents progress: e.g., by incentivizing the creator of the ML training dataset in the example above to explain how the data sample was collected. This coordination of human and machine is achieved via data market designs and relies on two mechanisms: task generation and incentives.
Task generation. When Stations block on a task they create human-readable task descriptions. These task descriptions must incorporate sufficient context so humans can effectively solve the problem. Incentives. When assigning a task to humans, Stations must indicate the incentives humans will receive in exchange for solving the problem. Incentives can be currencies of different kinds, such as money, time tokens, or others. For example, the Station may create a task that request filling the why-profile of the example above in exchange for 30 minutes of leisure time.
Balancing incentives and utility. Each participant seeks to maximize its utility model. For example, the participant of the example above will take on the task of filling the why-profile if they perceive the gain, 30 minutes, to be more valuable to them than the effort needed to complete the profile. It is safe to assume that data users maximize their utility when Stations answer their task capsules fast. Stations must account for the utility gained by data buyers, the one gained by data contributors, and strike an equilibrium that maximizes the utility of the market as a whole.
On designing data markets. Internal data markets such as that described above differ in their characteristics from those needed for a Station that serves a consortium of entities, such as a group of hospitals. An individual’s motivations and the Station’s goals are different, as are the levels of trust among entities. These differing qualities, in turn, call for different market designs. In order to design markets for scenarios where participants behave strategically to maximize their utility, we design truthful mechanisms (mechanismdesign) to align participants’ incentives with Station goals. Incentivizing those more familiar with the data allows Data Stations to solve task capsules while keeping all data sealed by default.
5. Discussion and Related Work
As analysts ask more varied questions, the schema-first approach of warehouses becomes a limiting factor. Data lakes (lake1; lake2; kudu; hudi; delta) are a partial answer to this problem. Lakes store the data first and push the burden of interpreting schemas to end users. By doing so, data lakes worsen the discovery and integration problem. Data Stations depart from traditional data architectures in different ways. They make the data asset catalog a crucial component of their architecture and make sure it is up to date at all times. The catalog, in turn, powers the discovery and integration engine which solves these problems without the need for agreeing on a schema a priori, such as in federated integration systems (garlic1; garlic2). Although automatically solving discovery and integration problems is difficult, Stations take advantage of its logical centralization of data and compute to implement market structures that incentivize humans to get in the loop and solve the hardest problems at its root, similar to what Anylog (anylog) does for distributed IoT scenarios.
An opportunity for data management. Data management problems remain as hard and relevant as ever, with human and technical factors that are uniquely shaped by the challenges of our time: ever increasing volumes of data that are hoarded by a few, and that are difficult to share for technical and legal reasons alike. New applications such as ML and statistical methods demand new query interfaces, and introduce new ethical problems due to their rapidly increasing impact in our lives. Increasing awareness of the role of data in society is leading to new regulations and laws. We believe it is time to rethink data architectures to tackle these modern challenges. Data Stations are a step towards this goal.