I Introduction
Provenance capture and analysis is being increasingly seen as a crucial enabler for prosperous data science activities [1, 2, 3, 4, 5]. In general, cCapturing provenance allows the practitioners introspect the data analytics trajectories, monitor the ongoing modeling activities, increase auditability, aid in reproducibility, and communicate the practice with others [2]. Specific systems have been developed to help diagnose distributeddataflow programs [6, 7], ingest provenance duringin the lifecycle [5, 2], and understandmanage pipelines for highlevel modeling paradigms [8, 9].
Compared with wellestablished data provenance systems for databases [10], and scientific workflow systems for escience [11], building provenance systems for data science faces an unstable data science lifecycle that is often ad hoc, typically featuring highly unstructured datasets, an amalgamation of different tools and techniques, significant backandforth among team members, and trialanderror to identify the right analysis tools, models, and parameters. Schemalater approaches and graph data model are often used to capture the lifecycle, versioned artifacts and associated rich information [1, 2, 4], which also echoes the modern provenance data model standardization over a long period of time as a result of long period community consolidation for scientific workflows [12] and the Web [13].
Although there is an enormous potential value of data science lifecycle provenance, e.g., reproducing the results or accelerate the modeling process, the evolving and verbose nature of the captured provenance graphs makes them difficult to store and manipulate. Depending on the granularity, storing the graphs could take dozens of GBs within several minutes [14]. More importantly, the verbosity and diversity of the provenance graphs makes it difficult to write general queries to explore and utilize them; there are often no predefined workflows, i.e., the pipelines change as the project evolves, and instead we have arbitrary steps (e.g., trial and error) in the modeling process. In addition, though storing the provenance graph in a graph database seems like a natural choice, most of the provenance query types of interest involve paths [15], and require returning paths instead of answering yes/no queries like reachability [16]
. Writing queries to utilize the lifecycle provenance is beyond the capabilities of the pattern matching query (BPM) and regular path query (RPQ) support in popular graph databases
[17, 18, 19]. For example, answering ‘how is today’s result file generated from today’s data file’ requires a segment of the provenance graph that includes not only the mentioned files but also other files that are not on the path that the user may not know at all (e.g., ‘a configuration file’); answering ‘how do the team members typically generate the result file from the data file?’ requires summarizing several query results of the above query while keeping the result meaningful from provenance perspective.Lack of proper query facilities in modern graph databases not only limits the value of lifecycle provenance systems for data science, but also of other provenance systems. The specialized query types of interest in the provenance domain [15, 16] had often led provenance systems to implement specialized storage systems [20, 21] and query interfaces [22, 23] on their own [11]. Recent works on provenance graphs in the provenance community propose various graph transformations for different tasks, which are essentially different template queries from the graph querying perspective; these include grouping vertices together to handle publishing policies [24]
, aggregating vertices in a verbose graph to understand commonalities and outliers
[25], segmenting a provenance graph via declarative language for feature extractions in cybersecurity
[26]. Our goal with this work is to initiate a more systematic study of abstract graph operators that modern graph databases need to support to be a viable option for storing provenance graphs.Toward that end, in this paper, we propose two graph operators for common provenance queries to let the user explore the evolving provenance graph without fully understanding the underlying provenance graph structure. The operators not only help our purpose in the context of data science but also other provenance applications which have no clear workflow skeletons and are verbose in nature (e.g., [14, 25, 26]).
First, we introduce a flexible graph segmentation operator for the scenario when users are not familiar with the evolving provenance graph structure but still need to query It queries the retrospective provenance between a set of source vertices (e.g., today’s data file) and a set of destination vertices (e.g., today’s result file) via certain boundary criteria (e.g., hops, timestamps, authorship). The operator is able to induce vertices that contribute to the destination vertices in a similar way w.r.t. the given source vertices in the specified boundary. Parts of the segmentation query require a contextfree language (CFL) to express its semantics. We study how to support such CFL query in provenance graphs, exploit novel grammar rewriting schemes and propose evaluation techniques that run orders of magnitude faster than stateoftheart for our graph operator.
Second, we propose a graph summarization operator for aggregating the results of segmentation operations, in order to analyze the prospective provenance of the underlying project (e.g., typical pipeline from the data file to the result file). It allows the user to tune the summary graph by ignoring vertex details and characterizing local structures during aggregation, and ensures the summary is meaningful from the provenance perspective through path constraints. We show the optimal summary problem is PSPACEcomplete and develop effective approximation algorithms that obey the path constraints.
We illustrate the operators on provenance data model standards (W3C ); the formulations and evaluation techniques are general to many provenanceaware applications. We show how to build such operators on top of modern property graph backends (Neo4j). We present extensive experiments that show the effectiveness and efficiency of our proposed techniques on synthetic provenance graph datasets mimicking realworld data science projects.
Our key contributions are as follows:

We introduce the segmentation operation for the scenario when users are not familiar with the evolving provenance graph structure but still need to query retrospective provenance. We are the first to use a contextfree language as a provenance query primitive and develop efficient evaluation algorithms for provenance graphs.

We propose the summarization operation that combines similar segments and preserves provenance meaning for querying prospective provenance in lifecycles without predefined workflow skeletons. We analyze its complexity and develop efficient algorithms.

We present extensive experiments that show the effectiveness and efficiency of our proposed techniques on synthetic provenance graph datasets mimicking realworld data science projects.

The provenance queries, their formulations and evaluation techniques are general to many provenanceaware applications. We show how to build a system on top of modern graph databases.
In the rest of the paper, we first describe provenance management for data science lifecyles, by introducing the provenance model and unique challenges of the query type of interests in Sec. II. Then we describe the segmentation operation and summarization operation respectively in Sec. III and Sec. IV. The implementation of the operations are discussed in Sec. LABEL:sec:system, followed by the extensive experiments in Sec. V. We summarize the related work from different communities in Sec. VI, and conclude with future directions in Sec. VII.
Outline: We begin with an overview of our system, introduce the provenance model and unique challenges (Sec. II), then describe semantics and evaluation algorithms for segmentation (Sec. III) and summarization operators (Sec. IV). We present extensive experimental evaluation (Sec. V) and summarize the related work from different communities (Sec. VI).
Ii Overview
We mainly focus on querying provenance for data science lifecycles, i.e., collaborative pipelines consisting of versioned artifacts and derivation and transformation steps, often parameterized [1, 2, 4]. We show our highlevel system architecture and introduce the background with a motivating example. We then define the standard provenance model, and summarize typical provenance query types of interest and analyze them from the perspective of graph querying.
Iia System Design & Motivating Example
Data science projects can range from welldefined prediction tasks (e.g., predict labels given images) to building and monitoring a large collection of modeling or analysis pipelines, often over a long period of time [3, 27, 28, 8]. Using a lifecycle provenance management system (Fig. 1) [1, 2], details of the project progress, versions of the artifacts and associated provenance are captured and managed. In Example IIA
, we use a classification task using neural networks to illustrate the system background, provenance model and query type.
In Fig. 2(a)
, Alice and Bob work together on a classification task to predict face ids given an image. Alice starts the project and creates a neural network by modifying a popular model. She downloads the dataset and edits the model definitions and solver hyperparameters, then invokes the training program with specific command options. After training the first model, she examines the accuracy in the log file, annotates the weight files, then commits a version using git. As the accuracy of the first model is not ideal, she changes the neural network by editing the model definition, trains it again and derives new log files and weight parameters. However the accuracy drops, and she turns to Bob for help. Bob examines what she did, trains a new model following some best practices by editing the solver configuration in version
, and commits a better model.Behind the scene, a lifecycle management system (Fig. 1) tracks user activities, manages project artifacts (e.g., datasets, models, solvers) and ingests provenance. In the Fig. 2(a) tables, we show ingested information in detail: a) history of user activities (e.g., the first train command uses model and solver and generates logs and weights ), b) versions and changes of entities (e.g., weights , and ) and derivations among those entities (e.g., model is derived from model ), and c) provenance records as associated properties to activities and entities, ingested via provenance system ingestors (e.g., dataset is copied from some url, Alice changes a pool layer type to AVG in , accuracy in logs is ).
Provenance Model: The ingested provenance of the project lifecycle naturally forms a provenance graph, which is a directed acyclic graph^{1}^{1}1In our system, we use versioning to avoid cyclic selfderivations of the same entity and overwritten entity generations by some activity. and encodes information with multiple aspects, such as a version graph representing the artifact changes, a workflow graph reflecting the derivations of those artifact versions, and a conceptual model graph showing the involvement of problem solving methods in the project [1, 2]. To represent the provenance graph and keep our discussion general to other provenance systems, we choose the W3C data model [29], which is a standard interchange model for different provenance systems.
The full data model is complex in order to satisfy application needs for different domains [29]. Due to space constraints and for simplicity, we use the core subset of it, which is shown in Fig. 2(b). We use the core set of data model shown in Fig. 2(b). There are three types of vertices () in the provenance graph:

[leftmargin=18pt,label=)]

Entities () are the project artifacts (e.g., files, datasets, scripts) which the users work on and talk about in a project, and the underlying lifecycle management system manages their provenance;

Activities () are the system or user actions (e.g., train, git commit, cron jobs) which act upon or with entities over a period of time, ;

Agents () are the parties who are responsible for some activity (e.g., a team member, a system component).
Among vertices, there are five types of directed edges^{2}^{2}2There are 13 types of relationships among Entity, Activity and Agent. The proposed techniques in the paper can be extended naturally to support more relation types in other provenance systems. ():

[leftmargin=18pt,label=)]

An activity started at time often uses some entities (‘used’, );

then some entities would be generated by the same activity at time () (‘wasGeneratedBy’, );

An activity is associated with some agent during its period of execution (‘wasAssociatedWith’, ); For instance, in Fig. 2(a), the activity train was associated with Alice, used a set of artifacts (model, solver, and dataset) and generated other artifacts (logs, weights). In addition:

Some entity’s presence can be attributed to some agent (‘wasAttributedTo’, ). e.g., the dataset was added from external sources and attributed to Alice;

An entity was derived from another entity (’wasDerivedFrom’, ), such as versions of the same artifact (e.g., different model versions in and in Fig. 2(a)).
In the provenance graph, both vertices and edges have a label to encode their vertex type in , , or edge type in , , , , . All oOther provenance records are modeled as properties, ingested by a set of configured project ingestorsthe system during the period of activity executions and represented as keyvalue pairs.
standard defines various serializations of the concept model, such as RDF, XML, and JSON (e.g., RDF, XML, JSON) [13]. In our system, we use a physical property graph data model to store it, as it is more natural for the users to think the artifacts as nodes when writing queries using Cypher or Gremlin. It is also more compact than RDF graph for the large amount of provenance records, which are treated as literal nodes. We discuss implementation details in Sec. LABEL:sec:system. As a summary, we formally define the provenance graph used in the rest of the paper.
[Provenance Graph] Provenance in a data science project is represented as a directed acyclic graph (DAG), , where vertices have three types, , and edges have five types, . Label functions, , and are total functions associating each vertex and each edge to its type. In a project, we refer to the set of property types as and their values as , then vertex and edge properties, and edge properties , are partial functions from vertex/edge and property type to some value.
Using the data model, in Fig. 2(c), we show the corresponding provenance graph of the project lifecycle listed in Fig. 2(a). Vertex shapes follow their type in Fig. 2(b). Names of the vertices (e.g., ‘modelv1’, ‘trainv3’, ‘Alice’) are made by using their representative properties (i.e., project artifact names for entities, operation names for activities, and first names for agents) and suffixed using the version ids to distinguish different snapshots. Activity vertices are ordered from left to right w.r.t. the temporal order of their executions. We label the edges using their types and show a subset of the edges in Fig. 2(a) to illustrate usages of five relationship types. Note there are many snapshots of the same artifact in different versions, and between the versions, we maintain derivation edges ‘wasDerivedFrom’ () for efficient versioning storage. The figure shows the provenance of those entities in all three versions. The property records are shown as white boxes but not treated as vertices in the property graph.
Characteristics: The provenance graph has the following characteristics to be considered, which we need to consider when designing query facilities:

Versioned Artifact: Each entity is a pointintime snapshot of some artifact in the project. For instance, the query ‘accuracy of this version of the model’ discusses a particular snapshot of the model artifact, while ‘what are the common updates for solver before train’ refer to the artifact but not an individual snapshot. R1: The query facilities need to support both aspects in the graph.

Evolving Workflows: Data science lifecycle is explorative and collaborative in nature, so there is no static workflow skeleton, and no clear boundaries for individual runs in contrast with workflow systems [11]. For instance, the modeling methods may change (e.g., from SVM to neural networks), the data processing steps may vary (e.g., split, transform or merge data files), and the usercommitted versions may be mixed with code changes, error fixes, thus may not serve as boundaries of provenance queries for entities. R2: The query facility for snapshots should not assume workflow skeleton and should allow flexible boundary conditions.

Partial Knowledge in Collaboration: Each team member may work on and be familiar with a subset of artifacts and activities, and may use different tools or approaches, e.g., in Example IIA, Alice and Bob use different approaches to improve accuracy. When querying retrospective provenance of the snapshots attributed to other members or understanding activity process over team behaviors, the user may only have partial knowledge at query time, thus may find it difficult to compose the right graph query. R3: The query facility should support queries with partial information reflecting users’ understanding and induce correct result.

Verboseness for Usage: In practice, the provenance graph would be very verbose for humans to use and in large volume for the system to manage. R4: The query facility should be scalable to large graph and process queries efficiently.
IiB Provenance Query Types of Interest
However, the provenance standards (e.g., , OPM) do not describe query models, as different systems have their own applicationlevel meaning of those nodes [13, 12]. General queries (e.g., SQL, XPATH, SPARQL) provided by a backend DBMS to express provenance retrieval tend to be queries are very complex [11]. To improve usability, a few systems provide novel query facilities [30, 22], and some of them propose special query languages [16, 23]. Recent provenance systems which adopt W3C data model naturally use graph stores as backends [14, 31]; since the standard graph query languages often cannot satisfy the needs [16, 17], a set of graph manipulation techniques is often proposed to utilize the provenance [24, 25, 26]. By observing the characteristics of the provenance graph in analytics lifecycle and identifying the requirements for the query facilities (Sec. LABEL:subsec:datamodel), we propose two graph operators (i.e., segmentation and summarization) for general provenance graphs in data model, that we illustrate next with examples, and discuss in more depth in the next section.
Segmentation: A very important provenance query type of interest is querying ancestors and descendants of entities [11, 16]. In our context, the users introspect the lifecycle and identify issues by analyzing dependencies among snapshots. Lack of a workflow skeleton and clear boundaries, makes the queries over the provenance graph more difficult. Moreover Also the user may not be able to specify all interested entities in a query due to partial knowledge. We propose a segmentation operator that allows the user to specify takes sets of source and destination entities, and the operator induces other important unknown entities satisfying a set of specified boundary criteria.
In Fig. 2(d), we show two examples of provenance graph segmentation query. In Query 1 (), Bob was interested in what Alice did in version . He did not know the details of activities and the entities Alice touched, instead he set {dataset}, {weight} as querying entities to see how the weight in Alice’s version was connected to the dataset. He filtered out uninterested edge types (.e.g, , ) and excluded actions in earlier commits (e.g., ) by setting the boundaries as two activities away from those querying entities. In the figure, tThe system found connections among the querying entities, and included vertices within the boundaries. After interpreting the result, Bob knew Alice updated the model definitions in model. On the other hand, Alice would ask query to understand how Bob improved the accuracy and learn from him. In Query 2 (), instead of learned weight, accuracy property associated log entity is used as querying entity along with dataset. The result showed Bob only updated solver configuration and did not use her new model committed in .
Summarization: In workflow systems, querying the workflow skeleton (aka prospective provenance) is an important use case (e.g., business process modeling [32]) and included in the provenance challenge [15]. In our context, even though a static workflow skeleton is not present, summarizing the skeleton of similar processing pipelines, showing commonalities and identifying abnormal behaviors are very useful query capabilities. However, general graph summarization techniques [33] are not applicable to provenance graphs due to the subtle provenance meanings and constraints of the data model [34, 24, 25]. We propose a summarization operator with multiresolution capabilities for provenance graphs. Tto support querying the artifact aspect different aspects of the provenance. It operates over query results of segmentation and allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures provenance meaning through path constraints.
We show a summarization query example inIn Fig. 2(e), an outsider to the team (e.g., some auditor, new team member, or project manager) wanted to understand the activity overview in the project. Segmentation queries (e.g., , in Fig. 2(d)) only show individual trails of the analytics process at the snapshot level. The outsider issued a summarization query, Query 3 (), by specifying the aggregation over three types of vertices (viewing Alice and Bob as an abstract team member, ignoring details of files and activities), and defining the provenance meanings as a 1hop neighborhood. The system merged and into a summary graphaccording to the query. In the figure, the vertices suffixed name with provenance types to show alternative generation process, while edges are labeled with their frequency of appearance among segments. The query issuer would change the query conditions to derive various summary at different resolutions.
Iii Segmentation Operation
Among the snapshots, collected provenance graph describes important ancestry relationships which form ‘the heart of provenance data’ [16]. Often lineages w.r.t. a query or a run graph trace w.r.t. a workflow are used to formulate suchancestry queries in relational databases or scientific workflows [23]. However, in our context, there are no clear boundaries of logical runs, or query scopes to cleanly define the input and the output. Though a provenance graph could be collected, the key obstacle is lack of formalisms to analyze the verbose information. In similar situations for querying script provenance [35], Prolog was used to traverse graph imperatively, which is an overkill and entails an additional skillset for team members. In our system, wWe design to let the users who may only have partial knowledge to query retrospective provenance. semantics induce a connected subgraph to show the ancestry relationships (e.g., lineage) among the entities of interest and include other causal and participating vertices within a specified boundary that is adjustable by the users. Next we first define the elements of the operator and query semantics, followed by query evaluation techniques.
Iiia Semantics of Segmentation ()
At a high level, we view tThe operator is a 3tuple query on a provenance graph asking how a set of source entities are involved in generating a set of destination entities . induces induced vertices (entities, activities and agents) that show the detailed generation process and satisfy certain boundary criteria . It returns a connected subgraph , where , and .
When discussing the elements of below, we use the following notations for paths in . A path connecting vertices and is a vertexedge alternating sequence , where , , and . Given a path , we define its path segment by simply ignoring and from the beginning and end of its path sequence, i.e., . A path label function maps a path or path segment to a word by concatenating labels of the elements in its sequence order. Unless specifically mentioned, the label of each element (vertex or edge) is derived via and . For example, from to , there is a path ac, where , , and ; its path label ac, and its path segment label ac. For ease of describing path patterns, for ancestry edges (used, wasGeneratedBy), i.e., with label or , we introduce its virtual inverse edge with the inverse label or respectively. A inverse path is defined by reversing the sequence, e.g., ac, while ac, ac.
Next we discuss semantics and our rationale in detail.
IiiA1 Source () & Destination Entities ()
Provenance is about the entities. In a project, the user know the committed snapshots (e.g., data files, scripts, and their metadata) better than the detailed processes generating them. When writing a query, we assume the user believes may be ancestry entities of . Then reasons their connectivity among and , and shows other vertices and the generation process which the user may not know and be able to write query with. Note that users may not know the existence order of entities either, so we allow and to overlap, and even be identical. In the latter case, the user could be a program [26] and not familiar with the generation process at all.
IiiA2 Induced Vertices
Given and , intuitively are the vertices (entities, activities and agents) contributing to the generation process. What vertices should be in is the core question to ask. It should reflect the generation process precisely and better concisely in order to assist the user introspect part of the generation process and make decisions.
Prior work on inducing subgraphs from a set of vertices do not fit our needs. First, lineage query would generate all ancestors of , which is not concise or even precise: siblings of and siblings of entities along the paths may be excluded as they do not have path from or to in (e.g., log in ). Second, at another extreme, a provenance subgraph induced from some paths [23] or all paths [24] among vertices in will only include vertices on the paths, thus exclude other contributing ancestors for (e.g., model and solver in ). Moreover, quantitative techniques used in other domains other than provenance cannot be applied directly either, such as keyword search over graph data techniques [36] which also do not assume that users have full knowledge of the graph, and let users use keywords to match vertices and then induce connected subgraph among keyword vertices. However, the techniques often use tree structures (e.g., least common ancestor, Steiner tree) connecting and are not aware of provenance usages domain knowledge, thus cannot reflect the ancestry relationships precisely.
Instead of defining quantitatively, we define qualitatively by a set of domain rules: a) to be precise, includes other participating vertices not in the lineage and not in the paths among ; b)
to be concise, utilizes the path shapes between and given by the users as a heuristic to filter the ancestry lineage subgraph. We define and categorize the rules forthat generate subsets of as follows and illustrate in Fig.
3:(a) Vertices on Direct Path (): Activities and entities along any direct path between an entity and an entity are the most important ancestry information. It helps the users answer classic provenance questions, such as reachability, i.e., whether there exists a path; workflow steps, i.e., if there is a path, what activities occurred. We refer entities and activities on such direct path as , which is defined as follows: . Note not only the shortest path are of interest, but all such path in the DAG should be derived.
(b) Vertices on Similar Path (): Though is important, due to the partial knowledge of the user, just considering the direct paths may miss important ancestry information including: a) the entities generated together with , b) the entities used together with , and c) more importantly, other entities and activities which are not on the direct path, but contribute to the derivations. The contributing vertices are particularly relevant to the query in our context, because data science project consists of many backandforth repetitive and similar steps, such as dataset splits in crossvalidation, similar experiments with different hyperparameters and model adjustments in a plot (Fig. 3).configurations of preparing data in alternative ways, adjusting model templates, and evaluating experiments.
To define the induction scope, on one hand, all ancestors w.r.t. in the lineage subgraph would be returned, however it is very verbose and not concise to interpret. On the other hand, it is also difficult to let the user specify all the details of what should/should not be returned. Here we use a heuristic: induce ancestors which are not on the direct path but contribute to in a similar way, i.e., path labels from to are the same with some directed path from . In other words, one can think it is similar to a radius concept [26] to slice the ancestry subgraph w.r.t. , but the radius is not measured by how many hops away from but by path patterns between both and entities that are specified by the user query. Next we first formulate the path pattern in a context free language [37], , then can be defined as a constrained reachability query from via over , only accepting path labels in the language.
A contextfree grammar (CFG) over a provenance graph and a query is a 6tuple , where ,,,,,, is the alphabet consisting of vertex labels, edge labels in and vertex identifiers (e.g., id in Neo4j) in , is a set of nonterminals, is the set of production rules, and is the start symbol. Each production rule in the form of defines an acceptable way of concatenations of the RHS words for the LHS nonterminal . Given a CFG and a nonterminal as the start symbol, a contextfree language (CFL), , is defined as the set of all finite words over by applying its production rules.
The following CFG defines a language that describes the heuristic path segment pattern for the induced vertex set. The production rules expand inversely from some both ways to reach and , such that the concatenated path has the destination in the middle.
Now we can use to define asaccordingly: for any vertex in , there should be at least a path from a going through a and then reaching some vertex , such thats.t. the path segment label is a word in :
Using CFG allows us to express the heuristic properly. Note that cannot be described by regular expressions over the path(segment) label, as it can be viewed as a palindrome language [37]. Moreover, it allows us to extend the query easily by using other label functions, for example,e.g., instead of and whose domains are types, using property value or in allows us to describe interesting constraints, e.g., the induced path from to should use the same commands as the path from to , or the matched entities on both sides of the path should be attributed to the same agent. For example, the former case can simply modify the second production rule in the CFG as follows:
This is a powerful generalization that allows to constrain induction scope by describing can describe repetitiveness and similarily ancestry paths at a very fine granularity.
Note that regular pattern queries (RPQ) with possible path variables are not supported well in modern graph query languages and graph database [38, 39, 19]; cannot be evaluated directly via query facilities provided by the graph database backend. We develop efficient evaluation technique that is suitable for our needs on provenance graphs in Sec. IIIB.
(c) Entities Generated By Activities on Path (): As mentioned earlier, the sibling entities generated together with may not be induced from directed paths. The same applies to the siblings of entities induced in and , if the siblings do not have paths to . We refer to those entities as and define it as:
(d) Involved Agents (): Finally, the agents in the provenance graph may be important in some situations, e.g., from the derivation, identify who makes a mistake, like git blame in version control settings. On a provenance graph , agents can be derived easily: , where is the union of all query vertices and other induced vertices.
IiiA3 Boundary Criteria
On the induced subgraph, besides path shapes, the segmentation operator should be able to express users’ logical boundaries when asking the ancestry queries. It is particulary useful in an interactive setting once the user examines the returned induced subgraph and wants to make adjustments. We categorize the boundary criteria support as a) exclusion constraints and b) expansion specifications.
First, boundaries would be constraints to exclude some parts of the graph, such as limitingprocessions ownership (authorship) (who), time intervals (when), project steps (particular version, file path patterns) (where), understanding capabilities (neighborhood size) (what), etc. Most of the boundaries can be defined as boolean functions mapping from a vertex or edge to true or false, i.e., , , which can be incorporated easily to the CFG framework for subgraph induction. We define the exclusion boundary criteria as two sets of boolean functions ( for vertices and for edges), which could be provided by the system or defined by the user. Then the labeling function used for defining would be adjusted by applying the boundary criteria as follows:
In other words, a vertex or an edge that satisfies all exclusion boundary conditions, is mapped to its original label. Otherwise the empty word () is used as its label, so that paths having that vertex will not satisfy .
Second, instead of exclusion constraints, the user may wish to expand the induced subgraph. We allow the users to specify expansion criteria, , denoting including paths from entities which are activities away from entities in . For example, In Fig. 2(d), excludes , edges via and expands by , so updatev2, modelv1 are included.
IiiB Query Evaluation
IiiB1 Overview: TwoStep Approach
Given a (,,) query, we separate the query evaluation into two steps: 1) induce: induce and construct the induced graph using and in memory, 2) adjust: apply interactively to filter induced vertices or retrieve more vertices from the property graph store backend. The rationale of the twostep approach is that the operator is designed for the users with partial knowledge who are willing to understand a local neighborhood in the provenance graph. Any induction heuristic applied would be unlikely to match the user’s implicit interests and would require backandforth explorations.
In the rest of the discussion, we assume a) the provenance graph is stored in a backend property graph store, with constant time complexity to access arbitrary vertex and arbitrary edge by corresponding primary identifier; b) given a vertex, both its incoming and outgoing edges can be accessed equally, with linear time complexity w.r.t. the in or outdegree. In our implementation (Sec. LABEL:sec:system), we use Neo4j as our storage backend, which satisfies the conditions – both nodes and edges are accessed via their id.
IiiB2 Induce Step
Given and , induces which consists ofwith four categories. We mainly focus our discussion on inducing vertices on direct and similar paths, as the other two types, i.e., sibling entities and related agents can be derived in a straightforward manner by scanning 1hop neighborhoods of the first two sets of results.
Cypher: The definition of vertices on similar path requires a contextfree language, and cannot be expressed by a regular language. When developing the system, we realize it can be decomposed into two regular language path segments, and express the query using path variables [38, 39]. We handcraft a Cypher query shown in Query 1. The query uses (b) and (e1) to return all directed paths via path variables (p1), and uses Cypher with clause to hold the results. The second match finds the other half side of the via path variable p2 which then joins with p1 to compare the nodebynode and edgebyedge conditions to induce . If we do not need to check properties, then we can use length(p1) = length(p2) instead of the two extract clauses. However, as shown later in the evaluation (Sec. V), Neo4j takes more than 12 hours to return results for even very small graphs with about a hundred vertices. Note that regular pattern queries (RPQ) with path variables are not supported well in modern graph query languages and graph database [38, 19], we develop our own algorithm for provenance graphs.
CFLreachability: Given a vertex and a CFL , the problem of finding all vertieces such that there is a path with label is often referred as single source CFLreachability (CFLR) problem or single source LTransitive Closure problem [40, 41]. The allpairs version, which aims to find all such pairs of vertices connected by a path of the problem, has the same complexity. As would be all vertices, we do not distinguish between the two in the rest of the discussion. Though the problem has been first studied in our community [40], there is little follow up and support in the context of modern graph databases (Sec. VI). CFLR finds its main application in programming languages and is recognized as a general formulation for many program analysis tasks [41]. On graph representations of programs, program analysis tasks such as program slicing and pointer analysis, can be described in a CFL to specify path patterns (Sec. VI).
State of the art CFLR algorithm [42] solves the problem in time and space w.r.t. the number of vertices in the graphs. It is based on a classic cubic time dynamic programming scheme [41, 43] which derives production facts nonrepetitively via graph traversal, and uses the method of four Russians [44] during the traversal. In the rest of the paper, we refer it as . We analyze it on provenance graphs for , then present improvement techniques. The details of and proofs are included in Appendix.
We first describe the algorithm briefly and then present improvement techniques for on provenance graphs.
Given a CFG, works on its normal form [37], where each production has at most two RHS symbols, i.e., or .
The normal form of is listed in Fig. 7.
At a high level, the algorithm traverses the graph and uses grammar as a guide to find new production facts , where is a LHS nonterminal, are graph vertices, and the found fact denotes that there is a path from to whose path label satisfies .
To elaborate, similar to BFS, it uses a worklist (queue) to track newly found fact and a fast set data structure with time complexity for set diff/union and for insert to memorize found facts.
In the beginning, all facts from all single RHS symbol rules are enqueued. In case ( in Fig. 7), each is added to as . From , itthe algorithm processes one fact at a time until is empty. When processing a dequeued fact , if appears in any rule in the following cases:
,
the new LHS fact is derived by set diff or by in .
Then the new facts of are added to to avoid repetition and to explore it later. Once is empty, the start symbol facts in include all vertices pairs which have a path with label that satisfies . If path is needed, a parent table would be used similar to BFS.
In (Fig. 7), the start symbol is Re, , facts include all , s.t. between them there is .
reachability on : Next we study the performance of for on a graph, and show the fast set method is not suitable for graph. Then we further explore grammar and graph properties, instead of normal form, we rewrite the grammar to allow several pruning strategies and propose a lineartime algorithm if can be viewed as a constant. In our context, the provenance graph is often sparse, and both the numbers of entities that an activity uses and generates can be viewed as a small constant, however the domain size of activities and entities and are potentially large. The following lemma shows show the fast set method is not suitable for graph.
solves reachability on a graph in time if using fast set. Otherwise, it solves it in time. On normal form (Fig. 7), for , derives LHS facts by a LHS fact dequeued from (Note it also derives from ). For , uses edges in the graph during the derivation, e.g., from LHS Re to . As can only be in the worklist once, we can see that each 3tuple is formed only once on the RHS and there are at most of such 3tuples. To make sure is not found before, is checked. If not using fast set but a time procedure for each instance , then it takes to produce the LHS; on the other hand, if using a fast set on domain for each , for each , time is required, thus it takes in total. Applying similar analysis on and using to derive new facts, we can see it takes with fast set and without fast set. Finally and can be viewed as following a vertex selfloop edge and do not affect the complexity result.
In our context, the graph is often sparse, and both the numbers of entities that an activity uses and generates can be viewed as a small constant, however the domain size of and are potentially large.
The lemma also reveals a quadratic time scheme for
reachability if we can view the average in/out degree as a constant. Note that the quadratic time complexity is not surprising, as is a linear CFG, i.e., there is at most one nonterminal on RHS of each production rule. The CFLR time complexity for any linear grammar on general graphs have been shown in theory as by a transformation to general transitive closures [40].
Rewriting : Most CFLR algorithms require the normal form mentioned earlier. However, under the normal form, it a) introduces more worklist entries, and b) misses important grammar properties. Instead, we rewrite as shown in Fig. 5, and propose and by adjusting . Comparing with standard normal forms, and have more than two RHS symbols. utilizes the rewritten grammar and graph properties to improve . Moreover, solves reachability on a graph in linear time and sublinear space if viewing as constant. The properties of the rewritten grammar and how and to utilize them are described below, which can be used in other CFLR problems:
a) Reduction for Worklist tuples:
Note that in Fig. 5,
, combines rules in the normal form, i.e., and .
combines rules in the normal form, i.e., and , is derived by:
Instead of enqueue and then , adds to directly.
In the previous normal form, there may be other cases that can also derive , i.e., in presence of and . In the worst case, enqueued number of Lg in which later find the same fact . It’s worth mentioning that in because now would be derived by many in , before adding it to , we need to check if it is already in . In . We use two pairs of bitmaps for Ee and Aa for and respectively, the space cost is
.
Compressed bitmaps would be used to improve space usage at the cost of nonconstant time random read/write.
b) Symmetric property: In the rewritten grammar, both nonterminals Ee and Aa are symmetric, i.e., Ee implies Ee, Aa implies Aa, which is not held in normal forms. Intuitively Ee means some path label from to is the same with some path label from to . Using symmetric property, in , we can use a straightforward pruning strategy: only process in both and if idid, and if idid; and an early stopping rule: for any Aa that both ’s and ’s order of being is before all query entities, we do not need to proceed further. Note the early stopping rule is and graph specific, while solving general CFLR, even in the singlesource version, cannot take source information and we need to evaluate until the end. Though both strategies used by do not improve the worstcase time complexity, they are very useful in realistic graphs (Sec. V).
c) Transitive property: By definition does not have transitivity, i.e., given Ee and Ee, it does not imply Ee. This is because a query allows multiple , Ee and Ee may be found due to different . However, if we evaluate separately, then Ee and Aa have transitivity, which leads to a linear algorithm for each : instead of maintaining Ee or Aa tuples in and , we can use a set or to represent an equivalence class at iteration or where any pair in the set is a fact of Ee or Aa respectively. If at iteration , the current holds a set , then AaEe is used to infer the next (a set ); otherwise, must hold a set , then EeAasimilarly is used to infer next equivalence class as the next . In the first case, as there are at most possible tuples, the step takes time; in the later case, similarly the step takes time. The algorithm returns vertices in any equivalence classes . Overall, because there are multiple vertices, the algorithm runs in time and space. The early stop rule can be applied as well, instead of a pair of activities in , in all activities in an equivalent class are compared with entities in in terms of the order of being; while the pruning strategy is not necessary, as all pairs are represented compactly in an equivalent class.
solves reachability on a graph in timeand space, if viewing can be viewedas a constant.
IiiB3 Adjust Step
Once the induced graph is derived, the adjustment step applies boundary criteria to filter existing vertices and retrieve more vertices. Comparing with induction step, applying boundary criteria is rather straightforward. For exclusion constraints and , we apply them on vertices and edges in linearly if present. For , we traverse the backend store with specified entities for hops through and edges to their ancestry activities and entities. To support back and forth interaction, we cache the induced graph instead of inducing multiple times. We expect is small constant in our context as the generated graph is for humans to interpret, otherwise, a reachability index is needed. For other purposes where the twostep approaches are not ideal, the exclusion constraints and , and expansion criteria can be evaluated together using , and with small modifications on the grammar. In the label function of can be applied at on or , while of can be applied at rest of the rules involving and . For and , and can be applied together at .
IiiB4 Discussion
We mainly focus on developing adhoc query evaluation schemes. As of now, the granularity of provenance in our context is at the level of commands executions, the number of activities are constrained by project members’ work rate. In case when the graph becomes extremely large, indexing techniques and incremental algorithms are more practical. We leave them as future steps.
Iv Summarization Operation
In a collaborative analytics project, collected provenance graph of the repetitive modeling trails reflects different roles and work routines of the team members and records steps of various pipelines, some of which having subtle differences. Using , the users can navigate to their segments of interest, which may be about similar pipeline steps. For example, the query result of a single ,,, e.g., ‘yesterday’s input data and prediction result’, shows a pipeline subgraph about how (prediction result) was derived from (input data) together with other induced (e.g., modeling steps). As there is no skeleton for the pipeline, given a different tuple to get another segment , e.g., ‘today’s input data and model result’, the pipeline subgraph may or may not be the same as . Given a set of segments, our design goal of is to produce a precise and concise provenance summary graph, , which will not only allow the users to see commonality among those segments of interests (e.g., yesterday’s and today’s pipelines are almost the same), but also let them understand alternative routines (e.g., an old step excluded in today’s pipeline). Though no workflow skeleton is defined, with that ability, would enable the users to reason about prospective provenance in evolving workflows of analytics projects.
Iva Semantics of Summarization ()
Although there are many graph summarization schemes proposed over the years [33] (Sec. VI), they are neither aware of provenance domain desiderata [25] nor the meaning of segments. Given a set of segments , each of which is a result, a query is designed to take a 3tuple as input which describes the level of details of vertices and constrains the rigidness of the provenance; then it outputs a minimum provenance summary graph ().
IvA1 Property Aggregations & Provenance Types of Vertices
Given a set of segments , each of which is a result, tTo combine vertices and edges across the segments in , we first introduce two concepts: a) property aggregation () and b) provenance type (k), which takes as input and allow the user to obfuscate vertex details and retain structural constraints.
Property Aggregation (): Similar to an attribute aggregation summarization query on a general graph [45], depending on the granularity level of interest, not all the details of a vertex are interesting to the user and some properties should be omitted, so that they can be combined together. For example; e.g., in Example IIB, the user may neither care who performs an activity, nor an activity’s detailed configuration; in the former case, all agent type vertices regardless of their property values (e.g., name) should be indistinguishable and in the latter case, the same activity type vertices even with different configuration settingsproperty values (e.g., training parameters) in various segments should be viewed as if the same thing (e.g., a training activity) has happened.
Formally, property aggregation is a 3tuple (, , ), where each of the tuple element is a subset of the graph property types, i.e., ,, (Definition IIA). When used iIn a query, it discards other irrelevant properties for each vertex type, e.g., properties of entity type in are ignored. For example, in Fig. 2(e), , , , so properties such as version of the entity, details of an activity, names of the agents are ignored.
Provenance Type (k): In contrast with generalpurpose graphs, in a provenance graph, the vertices with identical labels and property values may be very different [25]. For example, two identical activities that use different numbers of inputs or generate different entities should be considered different (e.g., updatev2 and updatev3 in Fig. 2(d)). In [25], Moreau proposes to concatenate edge labels recursivelyusing a recursive definition over a vertex’s hop neighborhood to assign a vertex type for preserving provenance meaning later aggregation. However, the definition ignores in/outdegrees of vertices, and the recursive definition is exponential w.r.t. to . It is worth mentioning that the former issue occurs in bisimulationbased method as well [68].
We extend the idea of preserving provenance meaning of a vertex and use the hop local neighborhood of a vertex to capture its provenance type: given a segment , and a constant , , provenance type k is a function that maps a vertex to its hop neighborhood subgraph in its segment , k. For example, in Fig. 2(e), , thus provenance type of vertices is the 1hop neighborhood, vertices with label ‘update’, ‘model’ and ‘solver’ all have two different provenance types (marked as ‘t1’, ‘t2’).
Note one can generalize the definition of k as a subgraph within hop neighborhood of satisfying a pattern matching query, which has been proposed in [46] with application to entity matching where similar to provenance graphs, just using the vertex properties are not enough to represent the conceptual identity of a vertex.
Vertex Equivalence Relation (): Given , denoting the union of vertices as , with the user specified property aggregation and provenance type k, we define a binary relation over , s.t.such that for each vertex pair :

[leftmargin=18pt,label=)]

vertex labels are the same, i.e., ;

all property values in are equal, i.e., ;

k and k are graph isomorphic w.r.t. the vertex label and properties in , i.e., there is a bijection between k and k, s.t., if 1) , 2) ,
3) .
a) vertex labels are the same, i.e., ,
b) all property values in are equal, i.e.,
,
c) k and k are graph isomorphic w.r.t. the vertex label and properties in , i.e., there is a bijection between k and k, s.t., if 1) , 2) , 3) .
It is easy to see that is an equivalence relation on by inspection. Using , we can derive a partition of , s.t., each set in the partition is an equivalence class by , denoted by , s.t., and . For each , we can define its canonical label, e.g., the smallest vertex id, for comparing vertices.
In other words, vertices in each equivalence class by describe the homogeneous candidates which can be merged by . Its definition not only allows the users to specify property aggregations to obfuscate unnecessary details in different resolutions, but also allows the users to set provenance types k in order to preserve local structures and ensure the meaning of provenance of a merged vertex.
IvA2 Provenance Summary Graph ():
Next, we define the output of , the provenance summary graph, .
Desiderata: Due to the nature of provenance, the produced should be precise, i.e., we should preserve paths that exist in one or more segments, at the same time, we should not introduce any path that does not exist in any segment. On the other hand, should be concise; the more vertices we can merge, the better summarization result it is considered to be. In addition, as a summary, to show the commonality and the rareness of a path among the segments, we annotate each edge with its appearance frequency in the segments.
Minimum : combines segment vertices in their equivalence classes and ensures the paths in the summary output summary graph satisfy above conditions. Next we define a valid summary graph.
Given a set of segments , and a a query, a provenance summary graph, , is a directed acyclic graph, where

[leftmargin=18pt,label=)]

each represents a subset of an equivalence class w.r.t. over , and one segment vertex can only be in one vertex , i.e., , ; the vertex label function maps a vertex to its equivalence class;

an edge exists if there is a corresponding segment edgein some segment, i.e., ; the edge label function annotates the edge’s frequencies over segments, i.e., ;

there is a path mn from to iff , there is a path st from to in , and their path labels are the same . Note that in , we use equivalence classes’ canonical label (e.g., smallest vertex id) as the vertex label in .
It is easy to see , the union of all segments in , is a valid . However, wWe are interested in a concise summary with fewer vertices the better. The best one can get is the optimal solution of the following problem.
[Minimum ] Given a set of segments and a k query, find the provenance summary graph with minimum .
IvB Query Evaluation
Given , after applying and k, is a labeled graph and contains all paths in segments; to find a smaller , we have to merge vertices in in while keeping the invariant, i.e., not introducing new paths. In order tTo describe merging conditions, we introduce trace equivalence relations in a . a) intrace equivalence (): If for every path au ending at , there is a path bv ending at with the same label, i.e., , we say is intrace dominated by , denoted as . The and are intrace equivalent, written , iff . b) outtrace equivalence (): Similarly, if for every path starting at , there is a path starting at with the same label, then we say is outtrace dominated by , written . and are outtrace equivalent, i.e., , iff .
Merging to does not introduce new paths, if and only if 1) , or 2) , or 3) .
The lemma defines a partial order over the vertices in . By applying the above lemma, we can merge vertices in a greedily until no such pair exist, then we derivethe minimum a minimal . However, the problem of checking in/outtrace equivalence is PSPACEcomplete [47], which implies that the decision and optimization versions of the minimum problem are PSPACEcomplete.
Minimum is PSPACEcomplete.
Instead of checking trace equivalence, we use simulation relations as its approximation [48, 49], which is less strict than bisimulation and can be computed efficiently in O() in [48]. A vertex is insimulate dominated by a vertex , written , if a) their label in is the same, i.e., and b) for each parent of , there is a parent of , s.t., . We say , insimulate each other, , iff . Similarly, is outsimulate dominated () by , if and for each child of , there is a child of of , s.t., ; and , outsimulate each other iff they outsimulate dominate each other. Note that a binary relation approximates , if implies [49]. In other words, if in/outsimulates each other, then is in/outtrace equivalence. By using simulation instead of trace equivalence in Lemma IVB to merge as the merge condition, we can ensure the invariant.
If 1) , or 2) , or 3) , merging to does not introduce new paths.
We develop the algorithm by using the partial order derived from Lemma IVB merge condition in a (initialized as ) to merge the vertices. To compute and , we apply the similarity checking algorithm in [48] twice in O() time. From Lemma IVB, we can ensure there is no new path introduced, and the merging operation does not remove paths, so algorithm finds a valid . Note that unlike Lemma IVB, as the reverse of Lemma IVB does not hold, so we may not be able to find the minimum , as there may be is in trace equivalence but not in simulation.
V Experimental Evaluation
In this section, we study the proposed operators and techniques comprehensively. All experiments are conducted on a Ubuntu Linux 16.04 machine with an 8core 3.0GHz AMD FX380 processor and 16GB memory. For the backend property graph store, we use Neo4j 3.2.5 community edition in embedded mode and access it via its Java APIs. Proposed query operators are implemented in Java in order to work with Neo4j APIs. To limit the performance impact from the Neo4j, we always use the node id to seek the nodes, which can be done in constant time in Neo4j’s physical storage. Unless specifically mentioned, the page cache for Neo4j is set to 2GB and the JVM version is 1.8.0_25 and Xmx is set to 4GB.
Dataset Description: Unless lifecycle management systems (e.g., Ground [1], [2]) are used by the practitioners for a long period of time, it is difficult to get realworld provenance graph from data science teams. Though using VCS (e.g., git) is common practice, VCS repositories only consist of versions of artifacts, but not the activities that occurred between commits. Publicly available realworld provenance graph datasets in various application domains [25] are very small (KBs). We instead develop several synthetic graph generators to examine different aspects of the proposed operators. The datasets and the generators are available online^{3}^{3}3Datasets: http://www.cs.umd.edu/~hui/code/provdbquery.
(a) Provenance Graphs & Queries: To study the efficiency of , we generate a provenance graphs dataset () for collaborative analytics projects by mimicking a group of project members performing a sequence of activities in a lifecycle management system. Each project artifact has many versions and each version is an entity in the graph. An activity uses one or more input entities and produces one or more output entities.

Varying Selection Skew 






To elaborate, given , the number of vertices in the output graph, we introduce agents. To determine who performs the next activity, we use a Zipf distribution with skew to model their work rate. Each activity is associated with an agent and uses input entities and generates output entities. and
are generated from two Poisson distributions with mean
and to model different input and output size. In total, the generator produces activities, so that at the end of generation, the sum of entities , activities and agents is close to . Theinput entities are picked from existing entities; the probability of an entity being selected is modeled as the pmf of a Zipf distribution with skew
at its rank in the reverse order of being. If is large, then the activity tends to pick the latest generated entity, while is small, an earlier entity has better chance to be selected. The output entities are always new entities, which would be the first version of an artifact, or a new version of an existing artifact. For the latter, we add a derivation edge to an ancestor entity uniformly.We use the following values as default for the parameters: , , , and . We refer as the graph with about vertices. In graphs, we pick pairs (, ) as queries to evaluate. Unless specifically mentioned, given a dataset, are the first two entities, and are the last two entities, as they are always connected by some path and the query is the most challenging instance. In one evaluation, we vary to show the effectiveness of the proposed pruning strategy.
(b) Similar Segments & Queries: To study the effectiveness of , we design a synthetic generator () with the ability to vary shapes of conceptually similar provenance graph segments. In brief, the intuition is that as at different stages of the project, the stability of the underlying pipelines tends to differ, the effectiveness of summary operator could be affected; e.g., at the beginning of a project, many activities (e.g., clean, plot, train data) would happen after another one in no particular order, while at later stages, there are more likely to be stable pipelines, i.e., an activity type (e.g., preprocessing) is always followed by another activity type (e.g., train). For , the former case is more challenging than the latter one.
In detail, we model a segment as a Markov chain with
states and a transition matrix among states. Each row of the transition matrix is generated from a Dirichlet prior with the concentration parameter , i.e., the th row is a categorical distribution for state ; each represents the probability of moving to state , i.e., pick an activity of type . We set a singlefor the vector
; for higher, the transition tends to be a uniform distribution, while for lower
, the probability is more concentrated, i.e., fewer types of activities would be picked from. Given a transition matrix, we can generate a set of segments , each of which consists of activities labeled with types, derived step by step using the transition matrix. For the input/output entities and edges of each activity, we use , , and the same way in , and all introduced entities have the same equivalent class label.We vary , , and to study the effectiveness on different sets of segments. A query is applied on each , and produces a . The effects of property aggregation and provenance types are reflected in the above label assignment process.
Segmentation Operator: We compare our algorithms and with the stateoftheart general CFLRcontextfree language reachability algorithm, [42], and the Cypher query in Sec. III in Neo4j. It uses bitbased set operations to improve the Reps’ [43] dynamic programming algorithm. We implement the fast set using a) Java BitSet in order to have constant random access time, b) RoaringBitMap which is the stateoftheart compressed bitmap (Cbm) with slower random access but better memory usages [50, 51]. We also compare with the Cypher query in Sec. III in Neo4j.
(a) Varying Graph Size : In Fig. 6(a), we study the scalability of all algorithms. axis denotes of the graph, while axis shows the runtime in seconds to finish the query. Note the figure is logscale for both axes. As we see, and run at least one order of magnitude faster than on all datasets, due to the utilization of the properties of the grammar and efficient pruning strategies. Note runs out of memory on due to much faster growth of the worklist than , as the normal forms introduce an extra level; without Cbm runs out of memory on due to O() space complexity and 32bit integer interface in BitSet. With Cbm, except , still runs out of memory on due to the worklist growth, Both algorithms reduce memory usages however become several times slower; In particular, uses 64bit RoaringBitMapspace usage drops to O() and is scalable to larger graphs. For very large graphs, diskbased batch processing is needed and Datalog evaluation can be used (Sec. VI).
runs slightly faster than for small instances while becomes much slower for large instances, e.g., , it is 3x slower than for the query. The reason that small instances slightly faster is the because run times on the graph and each run’s performance gain is not large enough. When the size of the graph instance increases, the superiority of the by using the transitivity property becomes significant.
On the other hand, the Cypher query can only return correct result for the very small graph and takes orders of magnitude longer. Surprisingly, even for small graph , it runs over 12 hours and we have to terminate it. By using its query profile tool, we know that Neo4j uses a path variable to hold all paths and joins them later which is exponential w.r.t. the path length and average outdegree. Due to the expressiveness of the path query language, the grammar properties cannot be used by the query planer.
(b) Varying Input Selection Skew : Next, in Fig. 6(b), we study the effect of different input entities selection behaviors on . The axis is and the axis is the runtime in seconds in logscale. In practice, some analytics activities tend to try many model alternatives to get the best performance for an analytics task, e.g., through a grid search over hyperparameters, or changing a neural network architecture; while there are other analytics activities havingwhere there are long chains of data transformation pipelines, e.g., feature engineering efforts. The formersome types of projects tend to always take an early entity as input (e.g., dataset, label), while the lattersome others tend to take new entities (i.e., the output of the previous pipeline step) as inputs. Tuning in opposite directions can mimic those project behaviors, as it tunes the probability of earlier entities been selected as inputs. In Fig. 6(b), we vary from to , and the result is quite stable for , and , which implies the query formulation and techniquesalgorithms can be applied to different project types with similar performance.
(c) Varying Activity Input Mean : We study the effect of varying density of the graph in Fig. 6(c) on . The axis varies the mean of the number of input entities. The axis shows the runtime in seconds. Having a larger , the number of edges will increase linearly, thus the algorithms runtime increaseslinearly as well. In Fig. 6(c), we see grows much more slowly than . Due to the pruning strategies, the growth in worklist is reduced. utilization is avoided in . has the best performancefurther improves the due to the utilization of the transitivity.
(d) Effectiveness of Early Stopping: The above evaluations all use the most challenging query on start and end entities. In practice, we expect the users will ask queries whose result they can understand by simple visualization. and general CFL don’t have early stopping properties. and Our algorithms use the temporal constraints of the provenance graph to support early stopping growing the result. In Fig. 6(d), we vary the of a query and study the performance on . The axis is the starting position among all the entities, e.g., means is selected at the end of 20% percentile w.r.t. the ranking of the order of being. The axis is the runtime in seconds. As we can see, the shorter of the temporal gap between and , the shorter our algorithms’ the and runtime. By utilizingUsing the property of graphs, we get better performance empirically even though the worst case complexity does not change.
Summarization Operator: Given a , generates a precise summary graph by definition. Here we study its effectiveness in terms of conciseness. We use the compaction ratio defined as . As there are few graph query result summarization techniques available, the closest applicable algorithm we can find is in our study, we compare with [52] which is designed for summarizing a set of graphs from keyword search graph queries. works on undirected graphs and preserves path among keyword pairs and was shown to be more effective than summarization techniques on one large graph, e.g., SNAP [45]. To make work on segments, we introduce a conceptual (start, end) vertex pair as the keyword vertices, and let the start vertex connect to all vertices in having indegree, and similarly let the end vertex connect to all vertices having outdegree. In the rest of the experiments, by default, , , and , and yaxis denotesthe compaction ratio in all figures.
(a) Varying Transition Concentration : In Fig. 6(e), we change the concentration parameter to mimic segment sets at various stage of a project with different stableness. axis denotes the value of in logscale. Increasing , the transition probability tends to be uniform, in other words, the pipeline is less stable, and paths are more likely be different, so the vertex pairs which would be merged become infrequent. As we can see, algorithm always performs better than , and the generated is about half the result produced by , as cannot combine some and pairs, which are important for workflow graphs. The same finding is consistent in other experiments.
(b) Varying Activity Types : Next, in Fig. 6(f), we vary the possible transition states, which reflects the complexity of the underlying pipeline. It can also be viewed as the effect of using property aggregations on activities (e.g., distinguish the commands with the same name but different options). Increasing leads to more different path labels, as shown in the Fig. 6(f), and it makes the summarization less effective. Note that when varying , the number of activities in a segment is set to be , so the effect of on compaction ratio tends to disappear when increases.
(c) Varying Segment Size : We vary the size of each segment when fixing and to study the performance of . Intuitively, the larger the segment is, the more intermediate vertices there are. The intermediate vertices are less likely to satisfy the merging conditions due to path constraints. In Fig. 6(g), the compaction ratio increases as the input instances are more difficult.
(d) Varying Number of Segments : With all the shape parameters set (), we increase the number of similar segments. As the segments are derived bygenerated from the same transition matrix, they tend to have similar paths. As shown iIn Fig. 6(h), the compaction ratio becomes better when more segments are given as input.
Vi Related Work
Provenance Systems: Provenance studies in the literature can be roughly categorized in two types: often distinguish themselves by data provenance (a.k.a. finegranularity) and workflow provenance (a.k.a. coarsegranularity). On one hand, dData provenance is discussed in dataflow datacentric systems having dataflow query facilitiessystems, such as RDBMS, Pig Latin, and Spark [10, 6, 7]. On the other hand, , while workflow provenance studies address complex interactions among highlevel conceptual components in various computational tasks, such as scientific workflows, business processes, and systemcybersecurity [11, 32, 14]. Unlike retrospective query facilities in scientific workflow provenance systems [11], their processes are predefined in workflow skeletons, and multiple executions generate different instancelevel provenance run graphs and have clear boundaries. Taking advantages of the skeletondefinition, there are lines of research to aidfor advanced ancestry query processingare important topics of study, such as defining user views over such skeleton to aid queries on verbose run graphs [22], querying reachabilityexecuting reachability query on the run graphs efficiently [53], storing run graphs generated by the skeletons compactly [21], and using visualization as examples to ease query construction [30].
Most relevant work is querying evolving script provenance [35, 54]. Because script executions form clear run graph boundary, query facilities to visualize and difference execution run graphs are proposed. In our context, as there are no clear boundaries of run graphs, it is crucial to design query facilities allowing the user to express the logical run graph segments and specify the boundary conditions first. Our method can also be applied on script provenance by segmenting within and summarizing across evolving run graphs.
Data Science Lifecycle Management: Recently, there is emerging interest in developing systems for managing different aspects in the modeling lifecycle, such as building modeling lifecycle platforms [55], accelerating iterative modeling process [56], managing developed models [5, 8], organizing lifecycle provenance and metadata [1, 2, 4], autoselecting models [57], hosting pipelines and discovering reference models [58, 59], and assisting collaboration [60]. Issues of querying evolving and verbose provenance effectively are not considered in that work.
Context Free Language & Graph Query: Parsing CFL on graphs and using it as query primitives has been studied in early theory work [61, 40], later used widely in programming analysis [41] and other domains such as bioinformatics [62] which requires high expressiveness language to constrain paths. Recently it is discussed as a graph query language [63] and SPARQL extension [64] in graph databases. In particular, CFLR is a general formulation of many program analysis tasks on graph representations of programs. Most of the CFL used in program analysis is a Dyck language for matching parentheses [41]. On provenance graphs, our work is the first to use CFL to constrain the path patterns to the best of our knowledge. CFL allows us to capture path similarities and constrain lineages in evolving provenance graphs. We envision many interesting provenance queries would be expressed in CFLR and support humanintheloop introspection of the underlying workflow.
Answering CFLR on graphs in databases has been studied in [40], and shown equivalent to evaluating Datalog chain programs. Reps [41, 43] describe a cubic time algorithm to answer CFLR and is widely used in program analysis. Later it is improved in [42] to subcubic time. Because Though general CFLR is generalization of CFL parsing, it which is difficult to improve [41]. Due to its importance in PL, faster algorithms for the Dyck language reachability on specific data models and tasks are discussed in the programing language community On specific data models and tasks, faster algorithms for Dyck language reachability are discussed in the PL community [65, 66]. Our work can be viewed as utilizing provenance graph properties and rewriting CFG to improve CFLR evaluation.
Query Results & Graph Summarization: Most work on graph summarization [33] focuses on finding smaller representations for a very large graph by methods such as compression [67], attributeaggregation [45] and bisimulation [68]; while there are a few works aiming at combining a set of queryreturned trees [69] or graphs [52] to form a compact representation. Our work falls into the latter category, and is tailored for segments which consist of similar or alternative steps among a set of entities of interest. Unlike other summarization techniques, our operator is designed for provenance graphs which include multiple types of vertices rather than a single vertex type [70]; it works on query results rather than entire graph structure [45, 67, 25]; the summarization requirements are specific to provenance graphs rather than returned trees [69] or keyword search results [52]. We also consider property aggregations and provenance types in our query constructs to allow tuning provenance meanings, which is not studied before to the best of our knowledge.
Vii Conclusion
We described the key challenges in querying provenance graphs generated in evolving workflows without predefined skeletons and clear boundaries, such as the ones collected by lifecyle management systems in collaborative analytics projects. At query time, as the users only have partial knowledge about the ingested provenance, due to the schemalater nature of the properties, multiple versions of the same files, unfamiliar artifacts introduced by team members, and enormous provenance records collected continuously. Just using standard graph query model is highly ineffective in utilizing the valuable information. We presented two highlevel graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduced a graphthe segmentation operator that allows the users to only provide the vertices they are familiar with and then induces a subgraph representing the retrospective provenance of the vertices of interest. We formulated the semantics of such a query in a context free language, and developed efficient algorithms on top of a property graph backend. Second, we described a graphthe summarization operator that combines the results of multiple segmentation queries and preserves provenance meanings to help users understand similar and abnormal behavior in those conceptually similar segments and allows to tune the provenance meanings. with multiresolution capabilities. Extensive experiments on synthetic provenance graphs with different project characteristics show the operators and evaluation techniques are effective and efficient. The operators are also applicable for querying provenance graphs generated in other scenarios where there are no workflow skeletons, e.g., cybersecurity and system diagnosis.
References
 [1] J. M. Hellerstein, V. Sreekanti, J. E. Gonzalez, J. Dalton, A. Dey, S. Nag, K. Ramachandran, S. Arora, A. Bhattacharyya, S. Das, M. Donsky, G. Fierro, C. She, C. Steinbach, V. Subramanian, and E. Sun, “Ground: A data context service,” in CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 811, 2017, Online Proceedings, 2017.
 [2] H. Miao, A. Chavan, and A. Deshpande, “Provdb: Lifecycle management of collaborative analysis workflows,” in Proceedings of the 2nd Workshop on HumanIntheLoop Data Analytics, HILDA@SIGMOD 2017, Chicago, IL, USA, May 14, 2017, 2017, pp. 7:1–7:6.

[3]
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data management challenges in production machine learning,” in
Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 1419, 2017, 2017, pp. 1723–1726.  [4] S. Schelter, J. Boese, J. Kirschnick, T. Klein, and S. Seufert, “Automatically tracking metadata and provenance of machine learning experiments,” in NIPS Workshop on ML Systems (LearningSys), 2017.
 [5] M. Vartak, H. Subramanyam, W.E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia, “ModelDB: a system for machine learning model management,” in Proceedings of the Workshop on HumanIntheLoop Data Analytics. ACM, 2016, p. 14.
 [6] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen, “Putting lipstick on pig: Enabling databasestyle workflow provenance,” PVLDB, vol. 5, no. 4, pp. 346–357, 2011.
 [7] M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. D. Millstein, and T. Condie, “Titian: Data provenance support in spark,” PVLDB, vol. 9, no. 3, pp. 216–227, 2015.

[8]
H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards unified data and lifecycle management for deep learning,” in
33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 1922, 2017, 2017, pp. 571–582.  [9] Z. Zhang, E. R. Sparks, and M. J. Franklin, “Diagnosing machine learning pipelines with finegrained lineage,” in Proceedings of the 26th International Symposium on HighPerformance Parallel and Distributed Computing, HPDC 2017, Washington, DC, USA, June 2630, 2017, 2017, pp. 143–153.
 [10] J. Cheney, L. Chiticariu, and W. C. Tan, “Provenance in databases: Why, how, and where,” Foundations and Trends in Databases, vol. 1, no. 4, pp. 379–474, 2009.
 [11] J. Freire, D. Koop, E. Santos, and C. T. Silva, “Provenance for computational tasks: A survey,” Computing in Science and Engineering, vol. 10, no. 3, pp. 11–21, 2008.
 [12] L. Moreau, B. Clifford, J. Freire, J. Futrelle, Y. Gil, P. Groth, N. Kwasnikowska, S. Miles, P. Missier, J. Myers, B. Plale, Y. Simmhan, E. Stephan, and J. V. den Bussche, “The open provenance model core specification (v1.1),” Future Generation Computer Systems, vol. 27, no. 6, pp. 743 – 756, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0167739X10001275
 [13] L. Moreau and P. Groth, “PROVoverview,” W3C, W3C Note, 2013, http://www.w3.org/TR/2013/NOTEprovoverview20130430/.
 [14] A. M. Bates, D. Tian, K. R. B. Butler, and T. Moyer, “Trustworthy wholesystem provenance for the linux kernel,” in 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 1214, 2015., 2015, pp. 319–334.
 [15] “Provenance Challenge,” http://twiki.ipaw.info, accessed: 201707.
 [16] D. A. Holland, U. J. Braun, D. Maclean, K.K. MuniswamyReddy, and M. I. Seltzer, “Choosing a data model and query language for provenance,” in Provenance and Annotation of Data and Processes  Second International Provenance and Annotation Workshop, IPAW 2008, Salt Lake City, UT, USA, June 1718, 2008, 2010, pp. 206–215.
 [17] P. B. Baeza, “Querying graph databases,” in Proceedings of the 32nd ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, PODS 2013, New York, NY, USA  June 22  27, 2013, 2013, pp. 175–188.
 [18] R. Angles, M. Arenas, P. Barceló, A. Hogan, J. L. Reutter, and D. Vrgoc, “Foundations of modern query languages for graph databases,” ACM Computing Surveys, vol. 50, no. 5, pp. 68:1–68:40, 2017.
 [19] O. van Rest, S. Hong, J. Kim, X. Meng, and H. Chafi, “PGQL: a property graph query language,” in Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, Redwood Shores, CA, USA, June 24, 2016, 2016, p. 7.
 [20] K. MuniswamyReddy, D. A. Holland, U. Braun, and M. I. Seltzer, “Provenanceaware storage systems,” in Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, USA, May 30  June 3, 2006, 2006, pp. 43–56.
 [21] M. K. Anand, S. Bowers, T. M. McPhillips, and B. Ludäscher, “Efficient provenance storage over nested data collections,” in EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 2426, 2009, Proceedings, 2009, pp. 958–969.
 [22] O. Biton, S. C. Boulakia, S. B. Davidson, and C. S. Hara, “Querying and managing provenance through user views in scientific workflows,” in Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 712, 2008, Cancún, México, 2008, pp. 1072–1081.
 [23] M. K. Anand, S. Bowers, and B. Ludäscher, “Techniques for efficiently querying scientific workflow provenance graphs,” in EDBT 2010, 13th International Conference on Extending Database Technology, Lausanne, Switzerland, March 2226, 2010, Proceedings, 2010, pp. 287–298.
 [24] P. Missier, J. Bryans, C. Gamble, V. Curcin, and R. Dánger, “Provabs: Model, policy, and tooling for abstracting PROV graphs,” in Provenance and Annotation of Data and Processes  5th International Provenance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 913, 2014, 2014, pp. 3–15.
 [25] L. Moreau, “Aggregation by provenance types: A technique for summarising provenance graphs,” in Proceedings Graphs as Models, GaM@ETAPS 2015, London, UK, 1112 April 2015., 2015, pp. 129–144.
 [26] R. Abreu, D. Archer, E. Chapman, J. Cheney, H. Eldardiry, and A. Gascón, “Provenance segmentation,” in 8th Workshop on the Theory and Practice of Provenance, TaPP’16, Washington, D.C., USA, June 89, 2016, 2016.
 [27] V. Chaoji, R. Rastogi, and G. Roy, “Machine learning in the real world,” PVLDB, vol. 9, no. 13, pp. 1597–1600, 2016.
 [28] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht, “Keystoneml: Optimizing pipelines for largescale advanced analytics,” in 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 1922, 2017, 2017, pp. 535–546.
 [29] P. Missier and L. Moreau, “PROVdm: The PROV data model,” W3C, W3C Recommendation, 2013, http://www.w3.org/TR/2013/RECprovdm20130430/.
 [30] L. Bavoil, S. P. Callahan, C. E. Scheidegger, H. T. Vo, P. Crossno, C. T. Silva, and J. Freire, “Vistrails: Enabling interactive multipleview visualizations,” in 16th IEEE Visualization Conference, VIS 2005, Minneapolis, MN, USA, October 2328, 2005, 2005, pp. 135–142.
 [31] A. M. Bates, W. U. Hassan, K. R. B. Butler, A. Dobra, B. Reaves, P. T. C. II, T. Moyer, and N. Schear, “Transparent web service auditing via network provenance functions,” in Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 37, 2017, 2017, pp. 887–895.
 [32] C. Beeri, A. Eyal, S. Kamenkovich, and T. Milo, “Querying business processes,” in Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 1215, 2006, 2006, pp. 343–354.
 [33] A. Khan, S. S. Bhowmick, and F. Bonchi, “Summarizing static and dynamic big graphs,” PVLDB, vol. 10, no. 12, pp. 1981–1984, 2017.
 [34] L. Moreau, J. Cheney, and P. Missier, “Constraints of the PROV data model,” W3C, W3C Recommendation, 2013, http://www.w3.org/TR/2013/RECprovconstraints20130430/.
 [35] L. Murta, V. Braganholo, F. Chirigati, D. Koop, and J. Freire, “noworkflow: Capturing and analyzing provenance of scripts,” in Provenance and Annotation of Data and Processes  5th International Provenance and Annotation Workshop, IPAW 2014, Cologne, Germany, June 913, 2014., 2014, pp. 71–83.
 [36] H. Wang and C. C. Aggarwal, “A survey of algorithms for keyword search on graph data,” in Managing and Mining Graph Data. Springer, 2010, pp. 249–273.
 [37] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, 3rd ed. Pearson Addison Wesley, 2007.
 [38] P. T. Wood, “Query languages for graph databases,” SIGMOD Record, vol. 41, no. 1, pp. 50–60, 2012.
 [39] P. Barceló, L. Libkin, A. W. Lin, and P. T. Wood, “Expressive languages for path queries over graphstructured data,” ACM Transactions on Database Systems (TODS), vol. 37, no. 4, pp. 31:1–31:46, 2012.
 [40] M. Yannakakis, “Graphtheoretic methods in database theory,” in Proceedings of the Ninth ACM SIGACTSIGMODSIGART Symposium on Principles of Database Systems, April 24, 1990, Nashville, Tennessee, USA, 1990, pp. 230–242.
 [41] T. W. Reps, “Program analysis via graph reachability,” Information & Software Technology, vol. 40, no. 1112, pp. 701–726, 1998.
 [42] S. Chaudhuri, “Subcubic algorithms for recursive state machines,” in Proceedings of the 35th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL 2008, San Francisco, California, USA, January 712, 2008, 2008, pp. 159–169.
 [43] D. Melski and T. W. Reps, “Interconvertibility of a class of set constraints and contextfreelanguage reachability,” Theor. Comput. Sci., vol. 248, no. 12, pp. 29–98, 2000.
 [44] V. L. Arlazarov, E. A. Dinic, M. A. Kronrod, and I. A. Faradzev, “On economical construction of the transitive closure of a directed graph,” Doklady Akademii Nauk SSSR, vol. 194, no. 3, p. 487, 1970.
 [45] Y. Tian, R. A. Hankins, and J. M. Patel, “Efficient aggregation for graph summarization,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 1012, 2008, 2008, pp. 567–580.
 [46] W. Fan, Z. Fan, C. Tian, and X. L. Dong, “Keys for graphs,” PVLDB, vol. 8, no. 12, pp. 1590–1601, 2015.

[47]
L. J. Stockmeyer and A. R. Meyer, “Word problems requiring exponential time:
Preliminary report,” in
Proceedings of the 5th Annual ACM Symposium on Theory of Computing, April 30  May 2, 1973, Austin, Texas, USA
, 1973, pp. 1–9.  [48] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in 36th Annual Symposium on Foundations of Computer Science, Milwaukee, Wisconsin, 2325 October 1995, 1995, pp. 453–462.
 [49] T. Milo and D. Suciu, “Index structures for path expressions,” in Database Theory  ICDT ’99, 7th International Conference, Jerusalem, Israel, January 1012, 1999, Proceedings., 1999, pp. 277–295.
 [50]
Comments
There are no comments yet.