Vamsa: Tracking Provenance in Data Science Scripts

01/07/2020 ∙ by Mohammad Hossein Namaki, et al. ∙ Washington State University Microsoft Case Western Reserve University, 0

Machine learning (ML) which was initially adopted for search ranking and recommendation systems has firmly moved into the realm of core enterprise operations like sales optimization and preventative healthcare. For such ML applications, often deployed in regulated environments, the standards for user privacy, security, and data governance are substantially higher. This imposes the need for tracking provenance end-to-end, from the data sources used for training ML models to the predictions of the deployed models. In this work, we take a first step towards this direction by introducing the ML provenance tracking problem in the context of data science scripts. The fundamental idea is to automatically identify the relationships between data and ML models and in particular, to track which columns in a dataset have been used to derive the features of a ML model. We discuss the challenges in capturing such provenance information in the context of Python, the most common language used by data scientists. We then, present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the user's code. Using up to 450K real-world data science scripts from Kaggle and publicly available Python notebooks, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 87.5 the order of milliseconds for scripts of average size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML) has proven itself in multiple consumer applications such as web ranking and recommendation systems. In the context of enterprise scenarios, ML is emerging as a compelling tool in a broad range of applications such as marketing/sales optimization, process automation, preventative healthcare, and automotive predictive maintenance, among others.

For such enterprise-grade ML applications [13], often deployed in regulated environments, the standards for user privacy, security, and explainability are substantially higher which now has to be extended to ML models. Consider the following scenarios:

Compliance. The protection of personal data is crucial for organizations due to relatively recent compliance regulations such as HIPAA [5] and GDPR [4]. As more emerging applications rely on ML, it is critical to ensure effective ongoing compliance in the various pipelines deployed in an organization is preserved. Thus, developing techniques that automatically verify whether the developer’s data science code is compliant (e.g., tools that determine if the features used to build a machine learning model are derived from sensitive data such as personally identifiable information (PII) [57]) is an immediate priority in the enterprise context.

Reacting to data changes. Avoiding staleness in the ML models deployed in production is a crucial concern for many applications. To this end, detecting which models are affected because data has become unreliable or data patterns have changed, by tracking the dependencies between data and models becomes critical. For example, it is possible that the code used to populate the data had some bug which was later discovered by an engineer. In this case, one would like to know which ML models were built based on this data and take appropriate action. Similarly, one might want to investigate whether the feature set of a ML model should be updated, once new dimensions have been added in the data.

Model debugging. Diagnosis and debugging of ML models deployed in production remain an open challenge. An important aspect of model debugging is to understand whether the decreased model quality can be attributed to the original data sources. For example, a data scientist while debugging her code might eventually find that the ML model is affected by a subset of the data which contains values for a particular feature. In such scenarios, one needs to automatically track the original data sources used to produce this model and evaluate whether they also contain values.

The aforementioned scenarios motivate the need for tracking provenance end-to-end, from the data sources used for training ML models to the predictions of the deployed ML models. In this paper, we take a first step towards this direction by introducing the ML provenance tracking problem. The core idea is to automatically identify the relationships between data and ML models in a data science script and in particular, to track which columns in a dataset have been used to derive the features (and optionally labels) used to train a ML model. To address this problem, we design Vamsa111Vamsa is a Sanskrit word that means lineage., a system that automatically tracks coarse-grained provenance from scripts written in Python (the most common language used by data scientists [7]) using a variety of static analysis techniques.

Consider the Python script presented in Figure 1 that was created in the context of the Kaggle Heart Disease competition [6]. The script trains a ML model using a patient dataset from a U.S. hospital. The model takes as input a set of features such as Age, Blood pressure, and Cholestoral, and predicts whether a patient might have a heart disease in the future. After performing static analysis on the script, Vamsa not only detects that this script trains a ML model but also that the columns Target and SSN from the heart_disease.csv dataset are not used to derive the model’s features.

Building a system that captures such provenance information is challenging: (1) As opposed to data provenance in SQL, scripting languages are not declarative and thus may not specify the logical operations that were applied to the data [56]. This is exacerbated in dynamically typed languages, such as Python. (2) Data science is still an emerging field as exemplified by popular libraries like scikit-learn [46]

still evolving their APIs and growth of newly available frameworks like PyTorch 

[12]. (3) Scripts encode various phases of the data science lifecycle including exploratory analysis [62], visualizations, data preprocessing, training, and inference. Hence, it is nontrivial to identify the relevant fraction of the scripts that contribute to the answer of a specific provenance query.

Vamsa is specifically designed to address the aforementioned challenges without requiring any modifications to the users code by solely relying on a modular architecture and a knowledge base of APIs of various ML libraries. Vamsa does not make any assumption about the ML libraries/frameworks used to train the models and is able to operate on all kinds of Python libraries as long as the appropriate APIs are included in the knowledge base. Additionally, Vamsa’s design allows users to improve coverage by simply adding more ML APIs in the knowledge base without any further code changes.

This paper makes the following contributions:

  1. [leftmargin=*,partopsep=1ex,parsep=1ex]

  2. Motivated by the requirements of enterprise-grade ML applications, we formally introduce the problem of ML provenance tracking in data science scripts that train ML models. To the best of our knowledge, this is the first work that addresses this problem.

  3. We present Vamsa, a modular system that tackles the ML provenance tracking problem in data science scripts written in Python without requiring any modifications to the users’ code. We thoroughly discuss the static analysis techniques used by Vamsa to identify variable dependencies in a script, perform semantic annotation and finally extract the provenance information.

  4. Using real-world data science scripts from Kaggle [8] and publicly available Python notebooks [54], we perform experiments using up to scripts and verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa’s accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa’s precision and recall range from to and its latency is typically in the order of milliseconds for scripts of average size.

The rest of the paper is organized as follows: in Section 2 we formally define the problem of ML provenance tracking and in Section 3 we give an overview of Vamsa’s architecture. Sections 45, and  6 provide a detailed description of Vamsa’s major components and their corresponding algorithms. Section 7 presents our experimental evaluation and Section 8 discusses related work. We conclude the paper and discuss directions for future work in Section 9.

Figure 1: A data science script written in Python

2 Problem Statement

We start by defining the concepts used by Vamsa, followed by the problem of ML provenance tracking in data science scripts that Vamsa targets.

A Data Source can be a database table/view, a spreadsheet, or any other external files that is typically used in Python scripts to access the input data e.g., hdf5, npy [45].

A common ML pipeline accesses data source and learns a ML model with two steps. First, feature engineering is conducted to extract a set of training samples from to be used to train the model . The training samples consist of features and labels that are both derived from selected columns in by e.g., transformation functions. The training process then derives the model by optimizing a learning objective function determined by the training samples and specific predictive or descriptive needs.

A Data Science Script reads from a set of data sources and trains a set of machine learning models 222Note that there are also data science scripts that do not perform any model training but provide other functionality (e.g., visualization, optimization, etc.) In this work, we focus on data science scripts that include statements that train ML models as our goal is to capture the relationships between data sources and generated ML models.. In this work, we focus on scripts written in Python, as this is the major language currently used by data scientists [7, 11].

We now formally define the problem of automated ML provenance tracking

which Vamsa targets. The essence is to identify which columns in a dataset have been used to derive the features (and optionally labels in the context of supervised learning) of a particular ML model in a data science script, thus automatically capturing the relationships between data sources and models at a coarse grain level and during static analysis time.

ML Provenance Tracking. Given a data science script, find all triples where each is a constructed machine learning model trained in the script using data source . ln particular, the model is trained using features (and optionally labels) derived from a subset of columns of data source , denoted as . The goal is to identify each trained model in the script, its data source , and the columns that were used to train model .

Example 1: The input of Vamsa is a data science script such as the one in Figure 1. The script reads from heart_disease.csv as a data source

and trains an ensemble of decision trees using

catboost library [49]. In this script, only a single model was trained. Note that not all the columns of the data source have been used to derive the model’s features and labels. To select the features, in the script, a range of columns from is explicitly extracted, followed by the drop of the columns . Similarly, only the Target column was used to derive the labels. Thus, the desired output is a triple where is the variable that contains the trained model, is the training dataset, and is the set: . Vamsa automatically parses the script and produces this output.

3 Vamsa Architecture

Vamsa takes as input a script and produces the provenance information that captures the relationship between data sources accessed by the script and the ML models trained in the script. It follows a modular architecture (illustrated in Figure 2) that addresses all the above challenges without requiring manual modifications to the users’ code.

At a high-level, Vamsa performs static analysis on the Python script to determine the relationships between all the variables in the script, followed by an annotation phase that assigns semantic information to the variables in the script. It then uses a generic provenance tracking algorithm that extracts the feature set for all the ML models trained in the script and stores this information in a central catalog that can be accessed by various provenance applications.

More specifically, Vamsa processes data science scripts with the following three major modules: the Derivation Extractor, the ML Analyzer, and the Provenance Tracker that we discuss in detail in the following sections:


(1) Derivation Extractor generates a workflow intermediate representation (WIR) of the script. It extracts the major workflow elements including imported libraries, variables, and functions, as well as their dependencies (Section 4).


(2) ML Analyzer annotates variables in WIR based on their roles in the script (e.g., features, labels, and models). To this end, it uses our proposed annotation algorithm and a knowledge base that contains information about the various APIs of different ML libraries (Section 5). Through the knowledge base, we are able to declaratively introduce semantic information to Python functions which in turn allows us to track provenance in data science scripts.


(3) Provenance Tracker infers a set of columns that were explicitly included in or excluded from the features/labels by using the annotated WIR and consulting the knowledge base. We remark that acquiring labeled data is non-trivial or even infeasible [59]

in real-world settings. The Provenance Tracker is able to operate in both supervised and unsupervised learning settings (in the latter by tracking provenance only at the features level).

Figure 2: Vamsa Architecture.

Vamsa does not make any assumption about the ML libraries/frameworks used to train the models. By utilizing a modular architecture combined with a knowledge base of APIs for various ML libraries, Vamsa is able to operate on all kinds of Python libraries as long as the appropriate APIs are included in the knowledge base. Additionally, this design allows users to improve coverage by simply adding more ML APIs in the knowledge base, without having to modify their code or Vamsa’s other components.

To evaluation Vamsa, we have populated our knowledge base with APIs from four well-established data science libraries: scikit-learn [46]

, XGBoost 

[2], LightGBM [3], and Pandas [37]. Nevertheless, Vamsa can operate on top of any other library such as CatBoost [49], StatsModels [58], and Graphlab [1], among others.

Figure 3: An example WIR

4 Derivation Extractor

In the first phase, Vamsa parses the Python script and by performing static analysis, builds a workflow model which captures the dependencies among the elements of the script including imported libraries, input arguments, operations that change the state of the program, and the derived output variables. This model is captured in a workflow intermediate representation (WIR) (Section 4.1). Section 4.2 describes how Vamsa automatically generates the model instances.

4.1 Workflow Model

To formally define the WIR model, we introduce the notions of variables, operations, and provenance relationships (). We then discuss how Derivation Extractor component generates the WIR for a given script.

Variable. In programming languages, variables are containers for storing data values. We denote the set of all variables in the data science script as . For instance, catboost, cb, train_df are a few examples of variables in the script of Figure 1.

Operation. An operation operates on an ordered set of input variables to change the state of the program and/or to derive an ordered set of output variables . An operation may be called by a variable, denoted as caller . While an operation may have multiple inputs/outputs, it has at most one caller.

Example 2: In Figure 1, the import statements, read_csv() in line 4, attribute values in line 5, CatBoostClassifier() in line 9, and fit() in line 10 are examples of operations. Consider the fit() operation: it is invoked by the clf variable and takes three arguments namely, features and labels, and an evaluation set. While fit() does not explicitly produce an output variable, it changes the state of the variable clf from to .

Provenance relationship. An invocation of an operation (by an optional caller ) depicts a provenance relationship (). A is represented as a quadruple , where is an ordered set of input variables, (optional) variable refers to the caller object, is the operation, and is an ordered set of output variables that was derived from this process. A can be represented as a labeled directed graph, which includes (1) a set of input edges (labeled as ‘input_edge’), where there is an input edge for each , (2) a caller edge (labeled as ‘caller_edge’) if is called by , and (3) a set of output edges (labeled as ‘output_edge’), where there is an output edge for each . For consistency, we create a temporary output variable for the operations that do not explicitly generate one.

Example 3: Consider line 4 of the script in Figure 1 where the CSV file is read. The corresponding is depicted in Figure 3 (dashed rectangle) and corresponds to the quadruple where , , and . We create a temporary variable and set to be used as the input by another .

Workflow Intermediate Representation. are composed together to form a WIR , which is a directed graph that represents the sequence and dependencies among the extracted . The WIR is useful to answer queries such as: “Which variables were derived from other variables?”,“ What type of libraries and modules were used?”, and “What operations were applied to each variable?”.

More formally, a WIR is a directed bipartite graph with vertices and edges . Each edge has an associated type from the following set: {input_edge, output_edge, caller_edge}.

Example 4: Figure 3 illustrates a fraction of a WIR that was generated from the script of Figure 1. The variables and operations are represented by rectangles and ovals, respectively. The caller, input, and output edges are marked in blue, red, and black color, respectively. Consider the operation fit, one can tell the following from the WIR: 1) it is called by variable clf; 2) it has two ordered input variables train_x2 and train_y2; and 3) a temporary variable, denoted as tmp_fit, was created as its output.

Figure 4: A fraction of an abstract syntax tree (AST)

4.2 WIR Generation

Vamsa generates workflows with the following three-step process. First, its Derivation Extractor component parses333https://github.com/python/cpython/blob/master/Lib/ast.py the script to obtain a corresponding abstract syntax tree (AST) [9, 10] representation. It then identifies the relationships between the nodes of the AST to generate the . Finally, it composes the generated into a directed graph.

Figure 4 shows a fraction of an AST that was generated from line 4 of the script in Figure 1. The AST is a collection of nodes that are linked together based on the grammar of the Python language. Informally, by traversing the AST from left-to-right and top-to-bottom, we can visit the Python statements in the order presented in the script.

Due to the recursive nature of AST node definitions, the WIR generation algorithm is naturally recursive. The algorithm, denoted as and illustrated in Figure 5, takes as input the root of the AST tree and traverses its children from left-to-right. For each visited AST node, in order to generate , it invokes a recursive procedure (Figure 5). Each invocation of in line 3 of may create multiple . All the are accumulated (line 4) and a graph is constructed by connecting the inputs/caller/outputs of .

The procedure is illustrated in Figure 5 and takes as input an AST node and a set of already generated . It returns a set of WIR variables and the updated . The returned WIR variables may be used as input/caller/output of other . To this end, initially obtains the operation from the attributes of the AST node (line 1). If the AST node is a literal or constant [9, 10], it returns the current (line 3). Otherwise, to obtain each of input variables , potential caller , and potential derived variables , recursively calls itself (lines 4-6). Once all the required variables for a are found, a new is constructed and added to the set of so far generated (line 7). It finally returns the output of the last generated as well as the updated set of (line 8).

During this process, the procedure extracts the input and output set and a potential caller variable for each (see definition of in Section 4.1). To this end, it investigates the AST node attributes to instantiate these variables by invoking the extract_from_node procedure which we summarize next. The procedure takes as input an AST node and and a literal parameter denoting the information requested (input, output, caller, operation), and consults the abstract grammar of AST nodes [9] to return the requested information for the given node. For example, when processing the Assign node of the AST in Figure 4, the procedure identifies Assign.value as input, Assign as operation, and Assign.targets as output. It also sets the caller as , as the procedure does not return a caller for the AST node type Assign.

  Algorithm  Input: AST root node . Output: WIR . 1. ; 2. for each  in  do 3.       () v, ); 4.       ; 5. Construct by connecting ; 6. return ; Procedure  Input: AST node and generated so far. Output: a set of WIR variables and updated . 1. ; extract_from_node; 2. if  then 3.       return ({v}, ); 4. (, ) extract_from_node; 5. (, ) extract_from_node; 6. (, ) extract_from_node; 7. ; 8. return (, );  
Figure 5: WIR generation algorithm

Complexity. Each AST edge is visited at most once during the WIR generation. Thus, for a Python script whose corresponding AST contains edges, GenWIR has complexity. More specifically, the procedure requires constant time since for each visited AST node, it only traverses a bounded number of neighbors (see the Python grammar [9]). In addition, the number of nodes/edges in a WIR is also bounded by the number of nodes/edges in its corresponding AST since for each node/edge in the AST, we may generate a corresponding node/edge in WIR.

Library Module Caller API_Name Inputs Outputs
catboost NULL NULL CatBoostClassifier

eval_metrics: hyperparameter

model
catboost NULL model fit
features
labels
eval_set: validation sets
trained model
sklearn model_selection NULL train_test_split
features
labels
test_size: testing ratio
features
validation features
Table 1: Example of facts in Vamsa knowledge base

5 Machine Learning Analyzer

The generated WIRs capture the dependencies among the variables and operations in a script. Nevertheless, WIRs alone do not provide semantic information such as the role of a variable in the script (e.g., ML model, features) or the type of each object (e.g., CSV file, DataFrame). To support provenance queries, semantic information about variables should be associated to the WIRs. Such information, in turn, identifies critical variables such as hyperparameters, models, and metrics for ML applications.

Finding the role of each variable in a WIR is a challenging task for multiple reasons: (1) One cannot accurately deduce the role/type of input and output of each operation by only looking at the name of the operation as different ML libraries may use the same name for different tasks; (2) Even in the same library, an operation may accept different number of inputs or provide different outputs. For example, in the sklearn library [46], the function fit accepts a single input when creating a clustering model but two inputs when generating a classification/regression model; (3) The type of the caller object might also affect the behavior of the operation. For instance, in sklearn, invocation of the fit function by a RandomForestClassifier creates a model but calling it via LabelEncoder does not; (4) The APIs of many libraries are not yet stable and change as these evolve; (5) Some variables are even harder to semantically annotate because of lack of concrete APIs associated with them. For example, identifying when a variable represents features is challenging since typically there is no specific API to load the training dataset. Instead, the common practice is to use generic functions such as read_csv to load training data similarly to other data sources.

A semantic annotation framework to be usable across various data science scripts must be: (1) compatible with the various ML libraries [46, 37, 2, 49, 3, 58, 1] and their different versions, and (2) extensible to accommodate new ML libraries. To this end, we propose an annotation algorithm that relies on a knowledge base of ML APIs () that contains information on the APIs of various ML libraries, their modules and signatures (Section 5.1). The can be used to answer questions such as “What is the role of the input/output variables of a particular operation belonging to a given ML library?”. Our annotation algorithm annotates the WIR by querying the to obtain semantic information about the various variables and operations.

5.1 Knowledge Base of ML APIs

The contains fine-grained information about ML libraries stored in the form of relational tables. For each library, the stores its name (e.g., sklearn, xgboost), version, and modules (e.g., ensemble, svm). For each unique API in a library, the captures the corresponding library, module, caller type, and the operation name (e.g., train_test_split from the model_selection module of the sklearn library or read_csv from the Pandas library). For each potential input of an operation, the stores its role (features, labels, hyperparameter, and metric) and its data type (DataFrame, array, CSV file). Similarly, the contains semantic information about the outputs of the various operations.

Example 5: Table 1 shows three tuples in our . These are a subset of tuples that are utilized by the annotation algorithm to identify the variables that correspond to models and features in the script of Figure 1. The second tuple shows that when the operation is called via a model constructed by library, its first and second input are features and labels, respectively. It also accepts the validation sets as input. The output of the operation is a trained model.

To facilitate the annotation of WIR variables, supports two types of queries. The first one denoted as takes as input the name of a library, module, caller type, and operation name and returns a set of user-defined annotations that describe the role and type for each input/output of operation . The second one denoted as obtains the annotations of the input variables of operation given the annotations of its output variables.

Note that in our current prototype, and similar to other efforts for population [34], the construction of Vamsa’s is manual. As such the construction and maintenance costs may seem to be non-negligible over time. As we show in our experiments, however, our manual (yet minimal) results in large coverage on big collections of data science scripts. This is primarily because many data science scripts rely on similar coding patterns. Finally, we note that an orthogonal and really interesting future work is how to populate such KBs automatically.

5.2 Annotation Algorithm

The annotation algorithm traverses the WIR and annotates its variables by appropriately querying the when needed. After each annotation, new semantic information about a WIR node is obtained that can be used to enrich the information associated with other WIR variables, as is typical in analysis of data flow problems [61]. The propagation of semantic information is achieved through a combination of forward and backward traversals of the WIR.

The algorithm, illustrated in Figure 6, annotates the WIR variables by using the . It takes as input an extracted WIR from a script and the knowledge base and computes an annotated WIR enriched with the semantic information.

The algorithm starts by finding a set of with as a seed set for upcoming DFS traversals (line 1). These contain the information about imported libraries and modules in the Python script. For each , the algorithm extracts the library name and the potential utilized module (line 3). It then initiates a DFS traversal that starting from traverses the WIR in a forward manner i.e., by going through the outgoing edges (line 4). For each seen , it obtains the annotation information for both of its inputs and outputs by querying the knowledge base (lines 5-6) as described in the previous section.

If a new annotation was found for an input variable , the algorithm initiates a backward DFS traversal. As the input variable can be the output of another , for new information discovered for , we can propagate this information to other in which is their output. In particular, starting from , the algorithm traverses the WIR in a backward manner i.e., by going through the incoming edges (line 8). During the backward traversal, the is used to obtain information about the inputs of an operation given its already annotated output. In each initiated DFS traversal, each edge is visited only once. The algorithm terminates when we cannot obtain more information from initiating more forward/backward traversals.

  Algorithm  Input: WIR and knowledge base . Output: Annotated WIR . 1. Find the process nodes in as the seed set ; 2. for each  do 3.    Extract library and module ; 4.    Starting from , follow a DFS forward traversal on : 5.       for each seen do 6.          Obtain annotation of and          by invoking 7.          for each annotated do 8.            Starting from , follow a DFS backward traversal            on : 9.               for each seen do 10.                  Obtain annotation of by invoking
11.
   return ;
 
Figure 6: Annotation algorithm

Example 6: Operating on the WIR of Figure 3, the annotation algorithm initializes the seed set with one import operation and sets catboost and . Once it visits the = CatBoostClassifier operation, it queries the to obtain the annotation of its output. Given, , , catboost and , the annotates clf as a model. Since there exists no input edge for the CatBoostClassifier node in this WIR, no backward traversal is initiated. The algorithm moves forward and visits the fit function. It queries the with the same and , but updated and . The algorithm annotates the output of fit as trained model and then stops the forward propagation since there are no more outgoing edges in the node. However, at this time, successfully annotated the train_x2 and train_y2 as the features and labels, respectively. Thus, two backward traversals are started to propagate this information as much as possible to the previous nodes in the WIR. Let us follow the DFS that was started from train_x2. By visiting the train_test_split node, the algorithm annotates train_x as features. Similarly, it back-propagates the new annotation to train_df as the caller of drop operation. The algorithm continues until we cannot obtain more annotation information.

Complexity. In WIR , let be the set of nodes corresponding to the import operations and let denote its cardinality. The annotation algorithm executes rounds of forward DFS traversals. Furthermore, each forward DFS may initiate a backward DFS traversal for a newly visited input variable. Note that the backward traversal is executed only if the operation is included in the . Consider the set of those operations and let be the maximum in-degree of the nodes corresponding to operations in . The number of these executions is bounded by . Note that since each DFS visits an edge at most once, it takes up to time. Thus, in the worst case, the algorithm has complexity. Our analysis with real-world scripts [54] shows that the average and is typically small ( and ).

6 Provenance Tracker

Figure 7: WIR with Subscript operation

We next introduce the provenance tracker component of Vamsa. The provenance tracker is responsible for automatically detecting the subset of columns in a data source that was used to train a ML model. We’d like to point out that this is only one of the various provenance/tracking applications that can be built on top of the ML Analyzer.

To identify the columns, we need to investigate the operations in the annotated WIR that are connected to variables that contain features and labels in their annotation set. There are various operations that take features (or labels) as their caller/input, and may apply transformations, drop a set of columns from it, select a subset of rows upon satisfaction of a condition, copy it into another variable, and/or use it for visualization, etc. All these dynamic, runtime operations, and their dependencies should be captured in .

Following this intuition, we enrich our with a new table to guide our provenance tracker algorithm. The new table consists of two types of operations as follows: 1) operations from various Python libraries that exclude columns (e.g., drop and delete in Pandas library) or explicitly select a subset of columns (e.g., iloc and ix), and 2) a few native Python operations such as Subscript, ExtSlice, Slice, Index, and Delete [9, 10]. For each entry in this table, we set a flag column_exclusionTrue if the corresponding operation can be used for column exclusion (e.g., drop and delete). We remark that some operations captured in the can be used to remove both columns and rows depending on the values of one or more input parameters. As an example, the function drop in the Pandas library is used to remove rows when the parameter axis is set to , and remove columns when the value of the parameter is . The parameters of the operations are also captured in the WIR, and thus we can easily verify their values. The condition that needs to be checked to verify whether a particular invocation of an operation is used to remove columns is also added into the along with the operation.

We query this table by invoking where is the name of the operation. The query returns if there is no matching entry in the . However, if the operation matches to one of the entries in the table, the query returns the following output: (1) : the condition associated with the operation as mentioned above (if any); 2) : whether the operation can be used for column exclusion; and 3) : a description on how to start a backward traversal from the node’s input edges in order to identify a set/range of indices/column names.

Example 7: Figure 7 is another fraction of WIR that was generated from line 5 of the script in Figure 1 that includes a Subscript operation. The statement in line 5 keeps all the rows but only includes the columns from index to the last index in the dataset. One can find the set of included columns by traversing backward the nodes following the input edge of the Subscript operation and reaching the constant values connected with the Slice operations. The traversal rule associated with the Subscript operation shows that the input edge of this node must be followed in a backward manner to eventually reach the selected columns. Note that this is the case for all WIRs that contain this operation.

Similarly, consider the drop operation in Figure 3

. This operation is related to feature selection since its caller (

train_df) was annotated as features and it operates at the level of columns (the condition is satisfied by this invocation of the operation). To find the columns that were dropped, we again need to follow the input edge of drop backwards until we reach the constants ‘Target’ and ‘SSN’.

Our provenance tracking algorithm is illustrated in Figure 8. The algorithm takes as input the annotated WIR and the , and returns two column sets: the columns that from which features/labels were explicitly derived (inclusion set ) and (2) the columns that are explicitly excluded from the set of features/labels (exclusion set ). The algorithm scans each to find the ones with a variable that has been annotated as features (or labels) and an operation which can potentially be used for feature (or label) selection based on the information stored in the (line 2-3). A core component of the algorithm is the   operator (shown in Figure 8) that starts a guided traversal of the WIR based on the information in the .

For each of the selected , the operator queries the and obtains the corresponding , flag, and (line 1). If a exists but it is not matched by the particular operation, we can deduce that the operation was not used for feature (or label) selection and return without further action (line 2). Otherwise, the operator checks if this contains constant values in its input set (line 3). If so, it incorporates the discovered constant values/range of column indices into the inclusion/exclusion sets based on the flag. In case the does not directly contain the columns, the operator follows the to obtain a new on (line 8) that needs to be evaluated. It then calls the operator again for this (line 9).

  Algorithm PTracker Input: Annotated WIR , knowledge base . Output: Column inclusion set , column exclusion set . 1. ; ; 2. for each  in  do 3.     if it has a variable that was annotated as features or labels           and  4.           ; 5. return ; Operator  Input: Visited , annotated WIR , knowledge base , column inclusion set , column exclusion set . Output: Updated , . 1. , , = ; 2. if exists and it is then return  3. if  has constant inputs then 4.     if  = then 5.        ; 6.     else ; 7.     return ; 8. Obtain new on based on ; 9. ;  
Figure 8: Provenance tracking algorithm

Example 8: Continuing example 7, the provenance tracking algorithm finds the drop operation with a caller that was annotated as features (Figure 3) and thus invokes the operator. Since the corresponding path query is satisfied (aka, ), we know that the operation is used for feature selection and in particular feature exclusion (based on the information in the ). Thus, the algorithm follows the traversal rule to perform a backward traversal from its the operation’s input edge until it finds the constants ‘Target’, and ‘SSN’. These two columns are then added to the exclusion set.

When the feature tracking algorithm finds the Subscript operation (Figure 7) in the annotated WIR, it invokes the operator again. Note that the Subscript operation does not have an associated path query in . Thus, the operator only obtains the corresponding traversal rule from the and initiates a backward traversal starting from the input edge of the Subscript operation. A similar process is performed when the operator visits ExtSlice and Slice nodes. Using the traversal rule associated with the Slice operation in the , the algorithm looks for a range of columns with lower bound (respectively upper bound) that can be found by traversing the appropriate input edges of the Slice node (see Figure 7).

Complexity Analysis. Let be the set of variables that were annotated as features or labels. Let be the maximum number of operations that are directly connected to a variable in . The provenance tracker algorithm scans all the to find the set and evaluates, in constant time, whether the corresponding operations are related to feature/label selection. If an operation is indeed related to feature selection, the algorithm follows the traversal rule which in the worst case, visits all the edges of . The algorithm, thus, has complexity. Note that in practice .

7 Experimental Evaluation

Dataset
Error-free
& Python 3
compatible
/
Scripts with
selected
ML libraries
/
()
()
Table 2: Output of pre-processing pipeline

In this section, we evaluate Vamsa on a large set of Python scripts and provide an analysis of our experimental results. Our experiments are designed to answer the following questions: (1) What is the accuracy of Vamsa in identifying the features used to train ML models?; (2) How often is Vamsa able to extract provenance information (coverage) from a data science script?; (3) What is the latency of Vamsa?

Dataset Column Exclusion Column Inclusion Annotation Precision
Precision Recall Jaccard Coeff. Precision Recall Jaccard Coeff. Model Train Dataset
Table 3: Accuracy of Vamsa on the labeled datasets

7.1 Experimental Settings

Datasets. To evaluate Vamsa on a variety of data science scripts, we downloaded a large set of publicly available Python scripts from two different data sources: (1) a large-scale corpus that consists of Python notebooks published in 2017 that was crawled from  [54] ( dataset). From this corpus, we excluded the notebooks that do not include any import statement resulting in a corpus of scripts, (2) a set of Python scripts that we downloaded via the public Kaggle API444https://github.com/Kaggle/kaggle-api( dataset).

Dataset pre-processing pipeline. Real-world scripts may have syntax errors or may not be compatible with Python 3 (which is the version of Python that Vamsa’s implementation currently targets). Moreover, not all of them train machine learning models. For these reasons, we created a data pre-processing pipeline that applies various filters to the scripts in order to capture only those that are relevant to the ML provenance tracking problem. The pipeline is invoked on both the and dataset.

We now show how the pipeline works using the dataset as an example. The pipeline takes as input the Python scripts and prunes the scripts for which we cannot generate the corresponding abstract syntax tree due to syntax errors, exceptions triggered by Python’s AST generation module or incompatibility with Python 3. The resulting dataset is denoted as . The pipeline then prunes the scripts that are not importing any of the following ML frameworks: scikit-learn, XGBoost [2], and LightGBM as well as the scripts that do not invoke any training-related operations from these frameworks (e.g., fit, create, and train, etc). The resulting dataset is denoted as . Note that most of our experiments are performed on this dataset as we have populated our with APIs from the selected ML libraries discussed above. Note that it is easy to extend to other libraries by just populating the (no code changes are required).

Table 2 shows more information about the and dataset after the pipeline has been applied to them.

Experimental methodology. A challenge when evaluating Vamsa with such large-scale corpora is to determine the correctness of the output. Unfortunately, due to the novel nature of ML provenance tracking, there is no public benchmark available. The brute force approach would be to manually go over the corpus and determine the relationships between ML models and data sources so that we can evaluate Vamsa’s output. Since this is not feasible at the scale we are operating, we decided to perform two classes of experiments. First, we select a small subset of scripts for which we manually extract the provenance information (ground truth) and evaluate the accuracy of Vamsa on those 555

Note that we plan to open-source this dataset so that it can be used as a benchmark for future ML provenance tracking efforts.

. The second class of experiments is performed on the large corpus. The goal is to evaluate the coverage of the system, defined as how often Vamsa extracts the provenance information. We also evaluate both component-level and end-to-end system performance.

Hardware and software configuration. We conducted our experiments on a Linux machine powered by an Intel GHz CPU with GB of memory. For all the experiments we used Python . We manually populated our knowledge base with the APIs from scikit-learn, XGBoost, LightGBM, and Pandas [37].

Dataset ML Analyzer Coverage
Prov. Tracker
Coverage
Model Train Dataset

Table 4: Vamsa coverage in large-scale evaluation

7.2 Experiments with Labeled Datasets

These experiments evaluate the accuracy of Vamsa on a set of Python scripts for which we have manually extracted the relationship between data sources and ML models. From each of the and datasets, we randomly selected 150 scripts, ensuring that Vamsa can produce output for all the selected scripts. We evaluate the accuracy of Vamsa on both column exclusion and column inclusion using three metrics: precision, recall and Jaccard coefficient. The precision shows the proportion of discovered included/excluded columns that were truly included/excluded columns in the feature selection process. The recall shows the proportion of the true included/excluded columns that were discovered by Vamsa. The Jaccard coefficient evaluates the similarity of the two sets. The higher the values of these metrics are, the better the accuracy of Vamsa is. Given a script, the ground truth consists of two sets, namely the included columns and excluded columns . The metrics for column exclusion are defined as follows:

(1)
(2)
(3)

The metrics for column inclusion are similar but take into account the column inclusion set that Vamsa produces as well as the included columns in the ground truth.

Additionally, we investigate in more detail, how often Vamsa correctly identifies which variables correspond to ML models and which to training datasets as this is a prerequisite for correctly identifying the features/labels. To this end, we also report results that show the precision of the annotation phase (for both models and training datasets). The annotation precision shows the proportion of discovered models/training datasets that were true models/training datasets according to the manual labeling we did on the two datasets.

Table 3 shows the results on the two datasets. For each metric, we report the average values obtained over the scripts of the dataset. As shown in the table, Vamsa achieves high precision and recall values for all the tasks evaluated. Overall, we can make the following observations:

  1. [leftmargin=*,partopsep=1ex,parsep=1ex]

  2. When Vamsa identifies a model, its training dataset, and the corresponding features, the output is highly reliable.

  3. Vamsa reported models accurately and made a few mistakes in detecting their training datasets. We further investigated these scripts and found that the data scientists appended the testing data to the training data in order to perform global value transformations. The merged test data then got separated via a slicing operation immediately before training. Vamsa’s annotation algorithm was not able to follow this operation, i.e merge followed by split, and mistakenly identified the testing dataset as the training dataset.

  4. Vamsa detects column exclusion sets slightly better than column inclusion ones. This is because, for column exclusion, data scientists typically use a set of specific APIs such as drop and pop from Pandas, or del keyword, which can be tracked more easily.

Figure 9: Latency breakdown

7.3 Large-scale Experiments

In these experiments, we use a large corpus of Python scripts both from the and datasets extracted from the pre-processing pipeline. The goal is to evaluate the coverage of various components of Vamsa (Derivation Extractor, ML Analyzer, Provenance Tracker) as well as the performance/efficiency of the system. We also present a detailed analysis of the cases where Vamsa was not able to produce an answer.

Derivation Extractor. First, we evaluate the coverage of Vamsa on generating the workflow intermediate representation for our scripts. Note that the Derivation Extractor is a standalone component that does not rely on the to produce the WIR. For this reason, we perform our experiment using two large datasets ( scripts) and ( scripts) that import multiple ML libraries.

Our results show that Vamsa successfully generates the WIR for of the scripts in the dataset and of the scripts in the dataset. The few cases where Vamsa is not able to produce a WIR are mainly due to Vamsa’s current implementation. In particular, we have not yet covered certain constructs in the Python grammar such as DictComp, SetComp, and JoinedStr. However, we’d like to note that incorporating these constructs in Vamsa is solely a matter of extending the implementation and does not require any change in Vamsa’s design or architecture.

ML Analyzer. The goal of this experiment is to investigate the coverage of the ML Analyzer and in particular, how often the annotation algorithm identifies ML models and training datasets. In this evaluation, we use the ( scripts) and ( scripts) datasets.

Table 4 shows the percentage of the cases where the ML Analyzer can annotate at least one variable as a model and one other variable as a training dataset. As shown in the table, Vamsa can report model and training datasets for and of the scripts in the dataset. The coverage is a bit lower for the dataset.

To better understand the cases where Vamsa was not able to perform the annotation, we first investigated the cases that the ML Analyzer could not find a model. We identified the following reasons for the failure: 1) Some scripts called APIs commonly used for training models, such as e.g., fit

, to perform other operations such as feature extraction. In these cases, the ML Analyzer correctly did not report any model. 2) In a few scripts, the statements used to train a model were commented out. This was not detected by our pre-processing pipeline and thus these scripts were falsely included in the final dataset. 3) Some scripts imported modules using the * notation. In these cases, Vamsa could not relate the import statement to the API calls. 4) In a few other scripts, the data scientist imported a module with an alias name and used the alias when invoking the APIs. Vamsa’s implementation does not currently cover such cases. We are continuously investigating these issues and improving Vamsa’s implementation.

We have also explored the cases where the ML Analyzer could not find a training dataset. We found that: 1) In some scripts, hard-coded data e.g., a large numpy array was used as the training data, and 2) Some APIs are not presented in our and thus the annotation algorithm is not able to perform back propagation. However, we’d like to point out that these cases would be covered by extending out to include more APIs. We note that allowing users to increase coverage by enhancing the as needed was one of the major requirements behind Vamsa’s design.

Provenance Tracker. We evaluate the Provenance Tracker component on the and datasets. Table 4 shows the percentage of the cases where the Provenance Tracker can identify at least one set of features. Note that the Provenance Tracker is invoked only if the ML Analyzer can identify a model and its corresponding training dataset. We thus expect the coverage of this component to be bounded by the coverage of the ML Analyzer.

As shown in Table 4, Vamsa reports a non-empty column set for of the scripts in dataset and of the scripts in the dataset. We have also analyzed the cases that Vamsa could find both a ML model and a training dataset but did not discover the column set. The main reasons for this behavior are the following: 1) In some scripts, the columns have not been selected explicitly but based on a condition on their values (e.g., a column is in the feature set if it contains at least non-zero values.), (2) Similar to the ML Analyzer, some scripts required new rules to be added into the for the Provenance Tracker to operate correctly, and (3) Some scripts did not include any feature selection operations and thus Vamsa did not produce any output.

(a)
(b)
Figure 10: Latency while varying the script size

7.4 Performance Experiments

In this experiment, we evaluate the efficiency of each component of Vamsa as well as the end-to-end latency. To this end, we use a subset of the datasets on which Vamsa operates end-to-end successfully. We call these datasets ( scripts) and ( scripts).

Breaking down the latency. We now evaluate the performance of the Derivation Extraction, ML Analyzer, and Provenance Tracker, as well as the end-to-end latency. Figure 9 shows the results. We observe that the time spent by each component is negligible on both datasets and on average is in the order of milliseconds. Furthermore, the most time is spent on derivation extraction in comparison to the other components. Breaking down the Derivation Extractor tasks to AST generation, generation, and WIR composition, we observed that most of the time is spent in AST generation and WIR composition. In particular, our analysis shows that for the dataset, AST generation takes milliseconds (msecs), generation takes msecs and WIR composition takes msecs. The corresponding numbers for the dataset are: msecs, msecs and msecs.

Performance of Vamsa varying the lines of code. We further evaluate the performance of each component and the end-to-end performance as the lines of code in the script vary. Figure 10(a) and Figure 10(b) show the average latency of the components as the script size varies for both our datasets. We see that increasing the number of lines of code in a Python script naturally increases the latency of all Vamsa components. However, Vamsa can produce output within seconds on a script consisting of lines of code (see Figure 10(a)). Similarly, Vamsa requires seconds to collect the provenance information on a script with lines of code (see Figure 10(b)).

Size of intermediate representation. To gain more insight about the datasets, we evaluate the size of generated WIRs. Figure 11 shows the average size of WIR nodes and edges along with the average lines of code in the script. We see on average for each line of code, nodes and edges were created in a WIR.

Figure 11: Size of WIR

8 Related Work

We describe relevant related work from three areas: model management systems, provenance in databases, and workflow management systems.

Model management systems. There has been emerging interest in developing machine learning life-cycle management systems [55, 26, 64, 41, 56, 40, 63]. ModelDB [64] was one of the first open-sourced model management systems. It focused on storing trained models and their results to enable visual exploration and querying of metadata and artifacts (e.g., hyperparameters and accuracy metrics). ModelDB requires users to change their scripts to comply with their API for logging (e.g., adding “sync” to function calls), and it works for a specific set of libraries. ModelHub [41]

aimed to store model weights across different versions with a focus on deep learning. It is a more fine-grained versioning system for the ML artifacts than other general-purpose systems such as

Git. ModelHub enables querying on hyperparameters, accuracy, and information loss during training phases. Amazon’s ML experiments system [56] focuses on tracking metadata and provenance of ML experimentation data. This system automated the provenance extraction for SparkML [39] and scikit-learn [46] pipelines whenever a logical abstraction of operations (e.g.,estimators/transformers) are available. ProvDB [40] proposed a graph data model and two graph query operators to store and query the provenance in data science projects. It works by ingesting the provenance via shell commands similar to versioning systems such as Git. The major focus of ProvDB is to efficiently store and query the ML provenance data. Finally, to enable the model diagnosis, Mistique [63]

was developed to store model intermediates that are produced in different stages of traditional ML pipelines or hidden representations in deep learning (

i.e.,neuron activations produced by different layers in a neural network).

In contrast to this line of model management systems, (1) Vamsa does not require developers to modify their code, (2) Vamsa can operate on top of any library as long as it is included in the , (3) by focusing on the process of provenance extraction rather than efficiently storing the captured data, Vamsa is complementary to systems such as ProvDB [40] and Mistique [63], and (4) none of the previous systems aimed to track features that are used to train ML models.

Provenance in databases. Capturing lineage or provenance has been studied extensively for databases (e.g., surveys [60, 31, 20, 17]. Data provenance typically describes where data came from, why an output record was produced [18], and how it was derived [27]. Provenance can be captured at different granularities and in various levels of detail [40]. It can be as coarse-grain as datasets, files, and their dependencies [14] or fine-grain including dependencies between input, intermediate, and output records [20, 35, 51]. The value of provenance is most exaggerated by the applications that it can support including, but not limited to, explanations [44, 65, 24]; interactive visualizations [51, 50, 52, 30]; verification and recomputation when data sources are outdated or not reliable [32], debugging [33, 36, 21]; data integration [22]; auditing and compliance [4]; and security [19, 35]. Finally, central to how provenance is utilized across domains is the task of provenance querying, a task complementary to the task of provenance capture. More specifically, provenance querying includes several sub-problems involving the construction of provenance querying languages [35], provenance browsers [52, 30], and efficient indexing schemes for captured provenance to streamline provenance queries [35, 51, 53].

In contrast to this line of work, Vamsa is specifically designed to automatically capture provenance in data science scripts written in Python. Captured provenance in Vamsa can be used to answer queries such as “Which datasets were used to train/test a model?”, “What libraries have been used in this script?”, and “What type of ML model was trained?”. As such, the granularity of the provenance is at the level of variables and the operations that utilize them to derive output variables. The operations can be API calls (read_csv() from Pandas library), accessing objects properties (.values), and user-defined functions including their implementation. In other terms, the way data science logic is specified in Python has little to no similarities with how queries are structured in databases. As such, the task of ML provenance tracking requires provenance capture techniques tailored to the intrinsic semantics of how data science logic is implemented in Python. Nevertheless, however, the identified provenance information from Vamsa can be stored and used by upstream applications, through provenance querying systems, in a similar fashion to how provenance is stored and used out of database workloads.

Workflow management systems. Workflows are widely used in scientific applications [47, 23, 28, 42, 29, 14] where scientists glue various tasks together and each task may take input data from previous tasks [25]. Workflow management systems aid in collecting, managing, and analyzing the provenance information to enable sharing of experiments and ensure their reproduciblity [25]. Closer to our work are Starflow [16, 15], noWorkflow [43, 48], and YesWorkflow [38] systems. StarFlow [16, 15] statically analyzes a Python program to build provenance traces at the level of functions for a Python script. However, it does not extract the dependencies inside of the functions. noWorkflow also transparently captures the control flow information and library dependencies in the scripts. It extracts the provenance in three levels of definition, deployment, and execution. In order to monitor the evolution of scripts, it also uses abstract syntax trees (AST) to discover user-defined functions and their arguments [43, 48]. While Starflow [16] and YesWorkflow [38] require modifications to the users’ script, noWorkflow handles unmodified programs.

Our work differs from these works as follows: (1) By focusing on data science scripts, we are able to capture more relevant provenance information to our needs such as finding trained models, hyperparameters, datasets, and features; (2) noWorkflow systems also generate a dependency graph among the variables. However, the workflow is not statically generated and the program needs to be executed. This is not always possible due to external dependencies on both the datasets and the various imported libraries. Indeed, AST was not utilized to generate a workflow graph but to detect the user-defined functions and their arguments; (3) As opposed to YesWorkflow and Starflow, Vamsa does not require users to modify their code e.g., adding tags and decorators to the function and variable definitions. To this end, it uses the (Section 5.1) to automatically extract the relevant provenance information from the script.

9 Conclusions and Future Work

ML has become a ubiquitous and integral technology across the stacks of enterprise-grade applications. Unfortunately, the management of machine learning logic is still in its infancy. In this direction, in this paper, we introduced the problem of ML provenance tracking with the goal to obtain connections between data sources and features of ML models—a fundamental type of provenance information that enables multiple upstream management applications including, but not limited to, compliance; security; model maintenance; and model debugging. Our proposed system, Vamsa, and our experimental evaluation show that it is indeed possible to recover this type of provenance with a very high precision and recall across large corpora of ML scripts authored imperatively in Python—even in the hard setting of static analysis time without availability of runtime information. Finally, we believe that the techniques and components of Vamsa (e.g., the and the framework for forward and backward traversals) are broadly applicable beyond the design of Vamsa and the scope of our work.

There are many areas for future work to explore both in the space of ML provenance tracking and the general space of automated management of ML pipelines. First, incorporating runtime information, if available, can better help us identify connections between data sources and ML models (e.g., access to data sources can let us provide the exact set of excluded features). Second, and in-line with the first line of future work, identifying finer-grained provenance information between data sources and ML models (e.g., which partitions of a data source were used for training a model) can better assist upstream applications (e.g., model debugging and compliance). Such type of information can be obtained either statically (e.g., identifying filters in Python scripts) or dynamically (e.g., tracing the inputs to ML models by inspecting program stacks and data flows at runtime). Finally, automatically populating or decreasing the manual effort for the population of the knowledge base (e.g., functions that have not changed across library releases can share the same annotations) are technically challenging problems, yet provide an integral technology for the management of imperatively specified ML pipelines.

References

  • [1] Graphlab. https://turi.com/l, 2013.
  • [2] Xgboost. https://xgboost.readthedocs.io/en/latest/index.html, 2014.
  • [3] lightgbm. https://lightgbm.readthedocs.io/en/latest/, 2017.
  • [4] EU GDPR Regulations. https://ec.europa.eu/commission/priorities/justice-and-fundamental-rights/data-protection/2018-reform-eu-data-protection-rules/eu-data-protection-rules_en, 2018.
  • [5] HIPAA Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/index.html, 2018.
  • [6] Kaggle Heart Disease. https://www.kaggle.com/ronitf/heart-disease-uci, 2018.
  • [7] Kaggle survey. https://www.kaggle.com/kaggle/kaggle-survey-2018, 2018.
  • [8] Official Kaggle API. https://github.com/Kaggle/kaggle-api, 2018.
  • [9] Abstract syntax trees. https://docs.python.org/3/library/ast.html, 2019.
  • [10] Python AST docs. https://greentreesnakes.readthedocs.io/en/latest/, 2019.
  • [11] Python language. https://towardsdatascience.com/programming-languages-for-data-scientists-afde2eaf5cc5, 2019.
  • [12] PyTorch. https://pytorch.org/, 2019.
  • [13] A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, J. Leeka, K. Park, H. Patel, O. Poppe, F. Psallidas, R. Ramakrishnan, A. Roy, K. Saur, R. Sen, M. Weimer, T. Wright, and Y. Zhu. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML, 2019.
  • [14] I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In IPAW, pages 118–132, 2006.
  • [15] E. Angelino, U. Braun, D. A. Holland, and D. W. Margo. Provenance integration requires reconciliation. In TaPP, 2011.
  • [16] E. Angelino, D. Yamins, and M. Seltzer. Starflow: A script-centric data analysis environment. In IPAW, 2010.
  • [17] R. Bose and J. Frew. Lineage retrieval for scientific data processing: a survey. CSUR, pages 1–28, 2005.
  • [18] P. Buneman, S. Khanna, and T. Wang-Chiew. Why and where: A characterization of data provenance. In ICDT, 2001.
  • [19] A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In CIDR, 2017.
  • [20] J. Cheney, L. Chiticariu, W.-C. Tan, et al. Provenance in databases: Why, how, and where. TRDB, pages 379–474, 2009.
  • [21] L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, pages 942–944, 2005.
  • [22] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. TODS, 25(2):179–227, 2000.
  • [23] S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD, 2008.
  • [24] D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577–588, 2017.
  • [25] J. Freire and M. Anand. Provenance in scientific workflow systems. IEEE Data Engineering Bulletin, 2007.
  • [26] R. Garcia, V. Sreekanti, N. Yadwadkar, D. Crankshaw, J. E. Gonzalez, and J. M. Hellerstein. Context: The missing piece in the machine learning lifecycle. In KDD CMI Workshop, 2018.
  • [27] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In SIGMOD-SIGACT-SIGART, pages 31–40, 2007.
  • [28] T. Guedes, V. Silva, M. Mattoso, M. V. Bedo, and D. de Oliveira. A practical roadmap for provenance capture and data analysis in spark-based scientific workflows. In WORKS, 2018.
  • [29] T. Heinis and G. Alonso. Efficient lineage tracking for scientific workflows. In SIGMOD, 2008.
  • [30] M. Herschel and M. Hlawatsch. Provenance: On and behind the screens. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, 2016.
  • [31] R. Ikeda and J. Widom. Data lineage: A survey. Technical report, Stanford InfoLab, 2009.
  • [32] R. Ikeda and J. Widom. Panda: A system for provenance and data. IEEE Data Eng. Bull., 2010.
  • [33] M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. PVLDB, 9(3):216–227, 2015.
  • [34] Z. Ives, Y. Zhang, S. Han, and N. Zheng. Dataset relationship management. In CIDR, 2019.
  • [35] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, 2010.
  • [36] D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In SoCC, pages 17:1–17:15, 2013.
  • [37] W. McKinney. pandas: a foundational Python library for data analysis and statistics. PyHPC, 2011.
  • [38] T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, J. Freire, et al. Yesworkflow: A user-oriented, language-independent tool for recovering workflow information from scripts. IJDC, pages 298–313, 2015.
  • [39] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib: Machine learning in apache spark. JMLR, pages 1235–1241, 2016.
  • [40] H. Miao and A. Deshpande. Provdb: Provenance-enabled lifecycle management of collaborative data analysis workflows. IEEE Data Eng. Bull., pages 26–38, 2018.
  • [41] H. Miao, A. Li, L. S. Davis, and A. Deshpande. Towards unified data and lifecycle management for deep learning. In ICDE, 2017.
  • [42] P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, 2010.
  • [43] L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire. noworkflow: capturing and analyzing provenance of scripts. In IPAW, pages 71–83, 2014.
  • [44] M. H. Namaki, Q. Song, Y. Wu, and S. Yang. Answering why-questions by exemplars in attributed graphs. In SIGMOD, pages 1481–1498, 2019.
  • [45] F. Nelli. Python data analytics: with pandas, numpy, and matplotlib. 2018.
  • [46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. JMLR, pages 2825–2830, 2011.
  • [47] J. F. Pimentel, J. Freire, L. Murta, and V. Braganholo. A survey on collecting, managing, and analyzing provenance from scripts. CSUR, page 47, 2019.
  • [48] J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. VLDB, 2017.
  • [49] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: unbiased boosting with categorical features. In NIPS, 2018.
  • [50] F. Psallidas and E. Wu. Provenance for interactive visualizations. 2018.
  • [51] F. Psallidas and E. Wu. Smoke: Fine-grained lineage at interactive speed. VLDB, pages 719–732, 2018.
  • [52] E. D. Ragan, A. Endert, J. Sanyal, and J. Chen. Characterizing provenance in visualization and data analysis: an organizational framework of provenance types and purposes. IEEE Transactions on Visualization and Computer Graphics, 22(1):31–40, 2016.
  • [53] P. Ruan, G. Chen, T. T. A. Dinh, Q. Lin, B. C. Ooi, and M. Zhang. Fine-grained, secure and efficient data provenance on blockchain systems. Proc. VLDB Endow., 12(9):975–988, May 2019.
  • [54] A. Rule, A. Tabard, and J. D. Hollan. Exploration and explanation in computational notebooks. In CHI, page 32, 2018.
  • [55] S. Schelter, F. Biessmann, T. Januschowski, D. Salinas, S. Seufert, G. Szarvas, M. Vartak, S. Madden, H. Miao, A. Deshpande, et al. On challenges in machine learning model management. IEEE Data Eng. Bull., pages 5–15, 2018.
  • [56] S. Schelter, J.-H. Böse, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In Machine Learning Systems workshop at NIPS, 2017.
  • [57] P. M. Schwartz and D. J. Solove. The pii problem: Privacy and a new concept of personally identifiable information. NYUL, page 1814, 2011.
  • [58] S. Seabold and J. Perktold. Statsmodels: Econometric and statistical modeling with python. In Scipy, 2010.
  • [59] L. Shao, Y. Zhu, A. Eswaran, K. Lieber, J. Mahajan, M. Thigpen, S. Darbha, S. Liu, S. Krishnan, S. Srinivasan, C. Curino, and K. Karanasos. Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms, 2019.
  • [60] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. Sigmod Record, pages 31–36, 2005.
  • [61] L. Torczon and K. Cooper. Engineering A Compiler. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2nd edition, 2011.
  • [62] J. W. Tukey. We need both exploratory and confirmatory. The American Statistician, pages 23–25, 1980.
  • [63] M. Vartak, J. M. F da Trindade, S. Madden, and M. Zaharia. Mistique: A system to store and query model intermediates for model diagnosis. In SIGMOD, 2018.
  • [64] M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Modeldb: a system for machine learning model management. In HILDA, 2016.
  • [65] E. Wu and S. Madden.

    Scorpion: Explaining away outliers in aggregate queries.

    PVLDB, 6(8):553–564, 2013.