Log In Sign Up

Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML

by   Ashvin Agrawal, et al.

Machine learning (ML) has proven itself in high-value web applications such as search ranking and is emerging as a powerful tool in a much broader range of enterprise scenarios including voice recognition and conversational understanding for customer support, autotuning for videoconferencing, inteligent feedback loops in largescale sysops, manufacturing and autonomous vehicle management, complex financial predictions, just to name a few. Meanwhile, as the value of data is increasingly recognized and monetized, concerns about securing valuable data and risks to individual privacy have been growing. Consequently, rigorous data management has emerged as a key requirement in enterprise settings. How will these trends (ML growing popularity, and stricter data governance) intersect? What are the unmet requirements for applying ML in enterprise settings? What are the technical challenges for the DB community to solve? In this paper, we present our vision of how ML and database systems are likely to come together, and early steps we take towards making this vision a reality.


Unsolved Problems in ML Safety

Machine learning (ML) systems are rapidly increasing in size, are acquir...

Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making

Machine learning (ML) is being applied to a diverse and ever-growing set...

A Survey of Machine Learning for Computer Architecture and Systems

It has been a long time that computer architecture and systems are optim...

A Review of Machine Learning-based Failure Management in Optical Networks

Failure management plays a significant role in optical networks. It ensu...

Compute Trends Across Three Eras of Machine Learning

Compute, data, and algorithmic advances are the three fundamental factor...

Maize Yield and Nitrate Loss Prediction with Machine Learning Algorithms

Pre-season prediction of crop production outcomes such as grain yields a...

Troubling Trends in Machine Learning Scholarship

Collectively, machine learning (ML) researchers are engaged in the creat...

1 Introduction

Machine learning (ML) has proven itself in high-value consumer applications such as search ranking, recommender systems and spam detection [39, 19]. These applications are built and operated by large teams of experts, and run on massive dedicated infrastructures.111ML.NET [17] alone took dozens of engineers over a decade. The (exorbitant) human and hardware costs are well justified by multi-billion dollars paydays. This approach is entirely impractical when it comes to the (ongoing) mainstream adoption of ML.

Enterprises in every industry are developing strategies for digitally transforming their business at every level. The core idea is to continuously monitor all aspects of the business, actively interpret the observations using advanced data analysis—including ML—and integrate the learnings into appropriate actions that improve business outcomes. We predict that in the next 10 years, hundreds of thousands of small teams will build millions222

From our of >4M Python notebooks from public Github repositories, we conservatively estimate that 10% of the world’s developers will use ML in the next 10 years—totaling 20M engineering years.

of ML-infused applications—most just moderately remunerative, but with huge collective value.

When it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability will be substantially higher than in the first wave of ML applications. Consider the healthcare domain: ML models may be trained on sensitive medical data, and make predictions that determine patient treatments—copying CSV files on a laptop and maximizing average model accuracy just doesn’t cut it! We refer to this new class of applications as Enterprise Grade Machine Learning (EGML).

In this paper, we speculate on how ML and database systems will evolve to support EGML over the next several years. Database management systems (DBMSs) are the repositories for high-value data that demands security, fine-grained access control, auditing, high-availability, etc. Over the last 30 years, whenever a new data-related technology has gained sufficient adoption, inevitably DBMS vendors have sought to absorb the technology into their mainstream products. Examples include Object-Oriented [20], XML [29], and Big Data [15] technologies. Indeed, ML is no exception if we consider SQL Server Analysis Services [2], SQL Server R and Python integration [4], and Big Query support for ML [5]. Is the future then that ML will be assimilated by the DBMS?

Figure 1:

Flock reference architecture for a canonical data science lifecycle.

We believe that this is too simplistic, and understanding the path forward requires a more careful look at the various aspects of EGML, which we divide into three main categories: model development/training, model scoring and model management/governance.

Train in the Cloud. First, we are witnessing an ongoing revolution in frameworks for training an increasingly broad range of ML model classes. Their very foundations are still undergoing rapid development [24]. Often, these developments happen in conjunction with innovations in hardware. The rapidly expanding community of data scientists who train models are developing sophisticated environments for managing and supporting the iterative process of data exploration, feature engineering, model training, model selection, model deployment, etc., e.g.,  [37, 10]. This large, complex, evolving infrastructures are a good fit with managed cloud service infrastructure. Moreover, model training requires centralized data, is characterized by spiky resource usage, and benefits from access to the latest hardware. This leads us to believe that model training and development will happen in either private or public clouds.

Score in the DBMS. Second, while the models may be centrally trained the resulting inference pipelines will be deployed everywhere: in the cloud, on-prem, and on edge devices to make inferences (“scoring”) where the data is. This raises the question of whether doing inference on data stored in a DBMS can be done as an extension of the query runtime, without the need to exfiltrate the data. We strongly believe this can and must be supported. It appears likely that the most widely studied or promising families of models can be uniformly represented [37, 41], and given a particular model we can express how to score it on a given input using an appropriate algebra, and compile these algebraic structures into highly optimized code for different execution environments and hardware [23]. Taken together, these observations suggest that we need to consider how to incorporate ML scoring as a foundational extension of relational algebra, and an integral part of SQL query optimizers and runtimes—we present a concrete proposal in § 4.1.

Governance everywhere. Third, we believe that all data, including deployed models—models are, in fact, best thought of as derived data—and the inferences made using them, will need to be robustly governed. The deployment of ML models and their use in decision making via inference leads to many significant challenges in governance. For example, regulations such as GDPR and concerns such as model bias and explainability motivate tracking provenance all the way from data used for training through to decisions based on scoring of trained models. In turn, this requires efficient support for versioning data. While the ML community is focused on improvements in algorithms and training infrastructures, we see massive need for the DB community to step up in the areas of secure data access, version management, and provenance tracking and governance—we discuss initial work in this area in § 4.2.

In summary, the future is likely cloudy with a high chance of DBMS, and governance throughout. We describe how our vision is shaped by customer conversations, data and market analysis, and our direct experience as well as present several open problems. We conclude by highlighting promising initial results from a few of the solutions that we are working on.

[With the blessing of the PC Chairs we would like to experiment in making this a “live paper”, where the community’s opinions on our vision are captured and incorporated into our final manuscript after the conference presentation. We plan to prepare the community to have a productive discussion by disseminating a survey, and announcing after paper acceptance that the discussion at CIDR will become part of the paper. If this is not allowed we might do something similar by means of a blog/arXiV submission, but we believe CIDR is the perfect venue to experiment with this format.]

2 The Flock vision

In this section, we present our vision for Flock, a reference architecture to support the canonical data science lifecycle for EGML applications. Flock is our vehicle to explore assumptions (§ 3), discover open problems and validate initial solutions (§ 4).

We start from a key observation: Machine Learning models are software artifacts derived from data. The resulting dual nature of software artifact and derived data provides us with a useful lens to understand the role of the DB community in the EGML revolution.

The lifecycle shown in Figure 1 begins with a (typically) offline phase, where a data scientist (and more and more frequently any software engineer) gathers data from multiple data sources, transforms them and models reality using learning algorithms. Today, this phase is very manual and sadly closer to a black art than an engineering discipline. Looking at ML as software, we expect the ML and Software Engineering communities to provide us with automation [32], tooling, and engineering best practices—ML will become an integral part of the DevOps lifecycle. Looking at ML models as derived data, the DB community must address data discovery, access control and data sharing, curation, validation, versioning and provenance (§ 4.2). Moreover, today’s prevalent abstraction for data science is imperative Python code orchestrating data-intensive processing steps, each performed within a native library. This suggests that end-to-end ML pipelines can be approached as inherently optimizable dataflows (§ 4.1).

The second stage in the lifecycle is entered when a model is selected and is ready to be deployed. Using the models-as-software lens, deployment consists in packaging the entire inference pipeline (model + all data preprocessing steps) in a way that preserves the exact behavior crafted by the data scientist in the training environment, and find a suitable hosting infrastructure for scoring of the model. Today’s best practice is to package models in costly containers and hope that enough of the environment is preserved to ensure correctness333This is optimistic (e.g., is floating point precision guaranteed when running a container across Linux/Windows, x64/ARM?). Recall that in EGML settings individual decisions could be very consequential (e.g., loan acceptance, or choice of medical treatment), so “average model accuracy” is not a sufficient validation metric. Switching to our models-are-data lens, we observe that they must be subject to GDPR-style scrutiny, and their storage and querying/scoring must be secured and auditably tracked. Also, privacy and fairness implications must be handled carefully. Moreover, as the underlying data evolves models need to be updated. To retain consistency for complex applications multiple models might have to be updated transactionally. DBMSs have long provided these type of enterprise features for operational data, and we propose to extend them to support model scoring. While this was our primary motivation, our early experiments suggest that in-DB model scoring actually allows us to deliver to speedups over standalone state of the art solutions!

Model predictions usually come in the form of single numbers or vectors of numbers (e.g., the probability of each class in a classification problem). To act on a prediction, it must be transformed to domain terms (e.g., the name of the winning class). But actions are typically more nuanced and involve policies that encode business constraints and might actually override a model’s prediction under certain circumstances. Systematizing this policy space is important, as we have discussed in 


Throughout the entire life-cycle management and governance for data and models is vital. Access to a deployed model must be controlled, similar to how access to data or a view is controlled in a DBMS. Provenance here plays a key role and has two distinct applications—looking at models as software artifacts, we must be able to verify them or debug them, even as they evolve due to re-training; from model-as-data viewpoint, we must be able to determine how a model was derived and from which snapshot of (training) data, in order to interpret the predictions and answer questions such as whether they were biased. This leads to the need for pervasive and automated tracking of provenance from training through deployment to scoring§ 4.2.

Given this context, we argue that: 1) the ML development/training will happen in the Cloud; 2) Models must be stored and scored in managed environments such as a DBMS; and 3) Provenance needs to be collected across all phases.

3 The vantage point

Our perspective on what Enterprise-grade ML (EGML) will look like in 10 years is shaped by multiple inputs.

First-hand experience. Collectively, the authors of this paper have extensive experience in using ML technologies in production settings, e.g., content recommenders [19], spam filters [39], big data learning optimizers [44, 47, 33, 46], ML-based performance debuggers, Azure cloud optimizations based on customer load predictions, self-tuning streaming systems, and auto-tuning infrastructures for SQL Server internals. Many of us have also been working on systems for ML technologies, including big data infrastructure [27, 9], ML toolkits [17, 13, 14], and the systems that orchestrate it all in the cloud [10].

This experience has led to one key insight: “An ML model is software derived from data”. This means that ML presents characteristics that are typical of software (e.g., it requires rich and new CI/CD pipelines), and of data (e.g., the need to track lineage)—hence, the database community is well positioned to play a key role in EGML, but much work is needed. We discuss some of the problems we tackle in § 4.

A second—and painfully clear—observation is that the actual model development represents less than 10% of most data science projects. The remainder is about getting to the data, and then operationalizing the best model.

Conversations with enterprises. We have engaged with many large, sophisticated enterprises, including: (i) a financial institution seeking to streamline its loan approval process, (ii) a marketing firm identifying which customers to target for promotions, (iii) a sports company predicting athletes’ performance, (iv) a health insurance agency aiming to predict patient recidivism, and (v) a large automotive company modeling recalls, customer satisfaction, and marketing.

A key learning from these conversations is that compared to “unicorn” ML applications like web search, these enterprise applications are characterized by smaller teams with domain expertise rather than deep algorithmic or systems expertise. On the other hand, their platform requirements are much more stringent around auditing, security, privacy, fairness, and bias.444This is not intended to suggest that unicorn applications do not share these requirements; rather, enterprise teams want off-the-shelf platforms that have a much higher level of support built-in, whereas unicorn teams have typically built everything from scratch. This is particularly true for regulated industries. Existing ML technologies are not ready to support these applications in a safe, cost-effective manner.

Github analysis. To get a feel for trends in the broader data science community, we downloaded and analyzed nearly million public Python notebooks from Github, plus hundreds of thousands of data science pipelines from within Microsoft. Moreover, we analyzed hundreds of versions for popular Python packages. The details of this analysis are beyond the scope of this paper, but we make a few key observations. Figure 2 shows the fraction of notebooks that would be completely supported, if we only covered the K most popular packages (for varying values of K). The shift between 2017 and 2019 suggests that the field is still expanding quickly (many more packages) but also that we are seeing an initial convergence (a few packages are becoming dominant). For example, numpy, pandas and sklearn are solidifying their position. We also observe very limited adoption of solutions for testing/CI-CD/model tracking (MLFlow [37] is still not very popular despite its relevance to EGML).

Overall this suggests that systems aiming to support EGML must provide broad coverage, but can focus on optimizing a core set of ML packages.

Figure 2: Notebook coverage (%) for top-K packages.

Competitive landscape. Many of the companies that built the first “unicorn” ML applications also developed systems to support the data science lifecycle. Some of those tech stacks make it out as open source, some lead to services on the public cloud. In Figure 3, we compare some of the most mature systems in this area (with accessible information), based on the level to which they support different features555This is ostensibly a subjective judgement based on a few weeks of analysis of marketing material, code skimming, and light experimentation.. Note that the area is dynamic and that the table reflects our understanding of these systems at the time of writing. We consider Bing, Uber [6] and LinkedIn’s ProML [7] as examples of proprietary infrastructures powering “unicorn” applications. Also, all major public cloud providers have services to support enterprise machine learning [10, 11, 8].

Analyzing those systems, we identified key feature areas: Training, Deployment and Data Management. A detailed discussion is beyond the scope of this 6-page submission.666It can be added in camera ready if desirable. But we observe two major trends: 1) mature proprietary solutions have stronger support for data management—this is consistent with our own direct experience, and 2) providing complete and usable third-party solutions in this space is non-trivial—or the cloud vendors who already had internal versions of this would have already done so. We speculate that this relates to the extra challenges introduced by EGML, and believe this area is primed for disruptive research, as we discuss in more details in § 4.

Figure 3: ML Systems in the public cloud and major companies.

Reseach in the ML community. Research in the area of machine learning systems is plentiful, and too large an area to do justice to here. Instead, we focus on major trends that we observed over the last 15 years that influenced our thinking on the intersection of machine learning and data management systems.

After an initial focus on algorithms, the ML community has concentrated (in rough order of appearance) on: 1) systems for training, 2) systems for scoring, 3) AutoML solutions, and 4) responsible AI. The systems for training area was initially dominated by big-data extensions [3, 38] first and HPC-based solutions [43], and later by parameter servers with bounded staleness [26]. More recent attempts such as ML.NET [17] and TFX [22] have borrowed more profoundly from the dataflow/database literature to build ML native solutions. Systems for scoring were vastly ignored until systems such as Clipper [25], Pretzel [36]

and TensorFlow serving 

[40] came to fruition. They draw heavily from streaming systems and optimizing compilers in their design. As ML adoption began to broaden, AutoML solutions began to appear [32]. Lately, interest in bias, fairness and responsible use of machine learning is exploding, though only limited solutions exist. This aligns with feebdack from enterprise customers (i.e., “automate it, and don’t get me sued”).

We conclude that data platforms play a key role to achieve fast and reliable training and scoring, and that explicit metadata management and provenance tracking are foundational for responsible AI and AutoML solutions.

4 Open Problems & Advances

The vision for EGML we presented is an exciting one and presents many challenging problems. We summarize some key challenges below and present some of our ongoing work. We focus on two categories that require attention from the DB community and are not well understood: 1) the systems support required to go from a trained model to decisions, and 2) data management for ML.

4.1 From Model to Decision: Inference

Much attention has been given to learning algorithms and efficient model training, but models only have value insofar as they are used for inference, to create insights and make decisions. This typically involves a complicated setup of containers for deploying the trained model (as executable code), with applications invoking them via HTTP/REST calls. Further, the containerized code often extends model inference with the implementation of complex application-level policies.

While this containerized approach offers a desirable decomposition of the problem between models and the applications using them, it has significant drawbacks: (1) Many applications use more than one model, with each model applied to the outcome of some (potentially different) data processing step. These assemblies of models and preprocessing steps should be updated atomically. (2) It seems unlikely that this solution will fit the scenarios emerging from the millions of applications we expect in this space (e.g., latency-sensitive decisions and large batch predictions are poorly served). (3) Mixing application-level policies and inference logic makes it hard to separate and measure the impact of the two.

We believe that models should be represented as first-class data types in a DBMS. This will naturally address (1) by allowing database transactions to be used for updating multiple deployed models. To address (2), we believe inference/scoring should be viewed as an extension of relational query processing, and argue for moving model inference close to the data and performing it in-DBMS, without external calls for common types of models. Naturally, this calls for a separation of inference from application-level logic; we address a clean framework for (3) after we briefly summarize our early results on in-DBMS inference. (A more in-depth discussion appears in a concurrent CIDR submission titled “Extending Relational Query Processing with ML Inference”.)

In-DBMS inference. While in-DBMS inference appears desirable, a key question arises: Can in-DBMS model inference perform as well as standalone dedicated solutions?

To this end, several recent works [30, 35, 31, 18] in the database community explore how linear and relational algebra can be co-optimized. To carry this investigation further, we integrated the ONNX Runtime [14]

within SQL Server and developed an in-database cross-optimizer between SQL and ML, i.e., optimizations across hybrid relational and ML expressions. Further, we observe that practical end-to-end prediction pipelines are composed of a larger variety of operators (e.g., featurizers such as text encoding and models such as decision trees) often assembled in Python. We leverage static analysis to derive an intermediate representation (IR) amenable to optimization. The list of optimizations we have been exploring is therefore more comprehensive than prior work and includes classical relational optimizations, linear algebra to relational transformations, as well as:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • predicate push-up/down between SQL queries to ML models;

  • automatic pruning (projection) of unused input feature-columns exploiting model-sparsity;

  • model compression exploiting input data statistics;

  • physical operator selection based on statistics, available runtime (SQL/ONNX/Python UDFs [45]) and HW (CPU, GPU).

In Figure 4 we present two key results: 1) Performance of ONNX runtime within SQL Server (SONNX), and 2) A cross-optimization leveraging UDF In-lining [42] and predicate push-up and model pruning (SONNX-ext). The results show that SQL Server integration provides up to over standalone ONNX (due to automatic parallelization of the inference task in SQL Server) and up to from our combined optimizations. Early results indicate that in-DBMS inference is very promising.

Figure 4: In-database inference (left) impact of optimizations (right).

Bridging the model-application divide. ML applications need to transform the model predictions into actionable decisions in the application domain. However, the mathematical output of the model is rarely the only parameter considered before a decision is made. In real deployment scenarios, business rules and constraints are important factors that need to be taken into account before any action is taken. As a concrete example, we have built models to automate the selection of parallelism for large big data jobs to avoid resource wastage (in the context of Cosmos clusters [27]). While models are generally accurate, they occasionally predict resource requirements in excess of the amounts allowed by user-specified caps. Business rules expressed as policies then override the model.

Obviously, business rules and requirements can vary between different applications and environments. To that end, we employ a generic and extensible module [28] that takes as input user-defined policies which introduce various business constraints on top of EGML workloads. The module continuously monitors the output of the ML models and applies the specified policies before taking any further action in the application domain. It also maintains the system state and actions taken over time allowing to easily debug and explain the system’s actions. Finally, it makes sure that the actions happen in a transactional way, rolling back in case of failures when needed. Overall this closes the loop between model and application, providing us visibility necessary for both debugging and end-to-end accountability.

Next, we discuss the requirements for managing data for ML

4.2 Data Management for ML

Data Discovery, Access and Versioning. One of the main challenges ML practitioners face today revolves around data access and discovery. Training data commonly contains tabular data, but also images, video or other sensor data. This gives rise to a predominantly file-based workflow. Only a small fraction of the  million notebooks we analyzed makes use of a database access library. This is surprising, as the vast majority of the pipelines ultimately use Pandas [16], a structured DataFrame, to interact with this data. This state-of-the-art is deeply unsatisfying: Data Discovery support is virtually non-existent . This is especially troubling as data augmentation is one of the best strategies to improve a model.

Worse, data versioning is largely unsolved in this paradigm: A model is the result of both its training code and the training data. Both need to be versioned. And file versioning technologies fail to address key needs of data versions: They often can only represent a deletion via a history rewrite. More fundamentally, files are not the atomic unit of training data: an individual data point may be stored in a file, but equally likely, many files represent one data point; or one file contains many data points.

Hence, we believe that there is an open need for data abstractions backed by query, lineage-tracking and storage technology that can cover heterogenous, versioned, and durable data.

Model Management.We have argued that ML models are software artifacts created from data, and must be secured, tracked and managed on par with other high-value data. DBMSs provide a convenient starting point thanks to their support for enterprise-grade features such as security, auditability, versioning, high availability, etc. To be clear, we are not suggesting that all data management needs to be inside a relational DBMS; indeed, we see a trend towards comprehensive data management suites that span all of a user’s data across one or more repositories. Our point is that managing models should be treated on par with how high-value data is managed, whether in a DBMS (the most widely available option currently) or in emerging cross-repository managed environments.

Model Tracking and Provenance. Models are software of consequence. Their genesis needs to be tracked. To achieve that, the full provenance of a model must be known for debugging/auditing.

We need to capture not only the code that trained the model, but also the (training) data that went into it, together with its full, tamper-proof lineage. There are multiple industry efforts to capture the inner training loop of this lineage [37, 12]. This must be expanded to the full lineage, and also automated to achieve the scale we expect. In the context of EGML, the importance of provenance is most exaggerated by the number of applications it enables (e.g., compliance, model debugging, retraining). Yet, this is challenging:

C1. Provenance data model.

Data elements in EGML workloads are polymorphic (e.g., tables, columns, rows, ML models, and hyperparameter) with inherent temporal dimensions (e.g., a model may have multiple versions, one for each re-run of a training pipeline). As such, and in contrast to traditional data models of provenance over DBMSs, EGML workloads dictate polymorphic and temporal provenance data models. Such data models are hard to design, capture, maintain, and query.

C2. Provenance capture. EGML workloads typically span multiple systems and runtimes (e.g., a Python script may fetch data from multiple databases to train a model). These systems might have different architecture and programming constructs (e.g., declarative vs. imperative interfaces). Extracting a meaningful provenance data model in this setting requires different capture techniques tailored specifically for each system/runtime.

C3. Provenance across disparate systems. Even if we capture provenance on top of each system and runtime in isolation, we still require to combine this information across systems (e.g., if we change a column in a database, models trained in Python that depend on this column may need to be invalidated and retrained). Hence, EGML workloads require protocols for consolidating and communicating the provenance information across systems.

Our initial solution. Our solution consists of three major modules: the SQL Provenance module, the Python Provenance module and the Catalog. The Catalog (we use Apache Atlas [21]) stores all the provenance information and acts as the bridge between the SQL and the Python Provenance modules. It allows us to capture end-to-end provenance across different systems—hence, provides a principal way to address C3.

Provenance in SQL. Our SQL provenance module currently focuses on capturing coarse-grain provenance under two modes, traditionally referred to as eager and lazy. Under eager provenance capture, given a query, the module parses it to extract coarse-grained provenance information (i.e., input tables and columns that affected the output, with connections modelled as a graph). Under lazy provenance capture, the module gets as input the query log of the database and constructs the provenance data model, only this time by accounting the whole query history. Under both modes, the module populates the Catalog accordingly. To scale across databases, the parsing module utilizes Apache Calcite [1] that provides universal parsers and adapters across databases—hence, provides us a way towards addressing C2. For cases where Apache Calcite cannot parse queries, we specialize to the parser of the corresponding engine. Furthermore, note that all data stored in the Catalog is versioned (e.g., an INSERT to a table results in a new version of the table in the provenance data model)—hence, we address the temporal aspect of C1. The table below shows the provenance capture performance (latency and provenance graph size) for queries generated out of all query templates in TPC-H and TPC-C:

Dataset #Queries Latency Size(nodes+edges)
TPC-H 2,208 110s 22,330
TPC-C 2,200 124s 34,785

These early findings indicate that a) the per query capture latency can be significant and b) the provenance data model can become substantially large in size (e.g., a table having as many versions as the insertions that have happened to it). For these reasons, we develop optimized capture techniques, through compression and summarization, which are essential towards addressing C1.

Provenance in Python.

The Python provenance module parses scripts and automatically identifies the lines of code that correspond to feature extraction and model training using a combination of standard static analysis techniques and a knowledge base of ML APIs that we maintain. Through this process, we are able to identify which Python variables correspond to models, hyperparameters, model features and metrics. We can also track the transformations performed on these variables and eventually connect them with the datasets used to generate training data. The Python provenance module accesses the Catalog to collect the output of the SQL provenance module and eventually connect the datasets used in the Python scripts to the columns of one or more DBMS tables.

Dataset #Scripts %Models %Training Datasets
Covered Covered
Kaggle 49
Microsoft 37

The above table table shows the coverage currently achieved by the provenance module on the Kaggle dataset [34] and a Microsoft internal dataset of scripts deployed in production. In this experiment, we evaluate how often the module identifies correctly ML models and training datasets in the Python scripts.

5 Conclusion and Call to Action

We live in interesting times. Database architectures are undergoing major transformation to leverage the elasticity of clouds, and a combination of increased regulatory pressures and data sprawl is forcing us to rethink data governance more broadly. Against this backdrop, the rapid adoption of ML in enterprises raises foundational questions at the intersection of model training, inference and governance, and we believe the DB community needs to play a significant role in shaping the future.