I Introduction
Many applications are encoded as a workflow which executes a sequence of data manipulation operations on raw input data. Provenance is an important requirement for workflow management systems as it enables various usecases e.g., dataquality, compliance, problem diagnosis etc. For example, if the value of a dataitem is erroneous, we can examine its lineage to investigate which transformation has introduced the error and hence fix this transformation. In this paper, we present efficient Spark algorithms for processing large scale workflow provenance data and answer lineage queries.
For a representative example, consider the table Person1. Numbers in bracket represent an id assigned to each attributevalue. Next consider a transformation R1 which filters out persons with age less than 25 and populates the table Person2. Values for attributes Name, City and Age in tuples T5, T6 and T7 are hence derived from values for attributes Name, City and Age in tuples T1, T2 and T3 respectively. Further consider a transformation R2 which works on table Person2 and computes the average age of persons in each city. The resulting output is shown in Table III. The value for attribute City in tuple T8 is derived from values of attribute City in tuples T5 and T6. Similarly the value for attribute Age in tuple T8 is derived from values of attribute Age in tuples T5 and T6. Values for attributes City and Age in tuple T9 is derived from one value each  value of attribute City and Age in tuple T7. The workflow provenance data captures these lineages among input and output attributevalues across each transformation, as they are executed. .
Provenance Data Model: We assume that the provenance data is specified as a set of triples , , where and represent the ids of the parent and child dataitems and represents the transformation applied along with any metadata (e.g., runtime parameters, timestamp etc). Table V shows the provenance data associated with the representative example. We also visualize the provenance data as a directed acyclic graph wherein dataitems (i.e., and ) in provenance triples form the vertices and the provenance triples form the edges (Table V).
Name  City  Age  

T1  Steve (1)  NY (2)  30 (3) 
T2  Mark (4)  NY (5)  40 (6) 
T3  Shane (7)  LA (8)  40 (9) 
T4  Mary (10)  NY (11)  20 (12) 
Name  City  Age  

T5  Steve (13)  NY (14)  30 (15) 
T6  Mark (16)  NY (17)  40 (18) 
T7  Shane (19)  LA (20)  40 (21) 
City  Age  

T8  NY (22)  35 (23) 
T9  LA (24)  40 (25) 
src  dst  op  ccid 

1  13  R1  1 
4  16  R1  2 
7  19  R1  3 
2  14  R1  4 
5  17  R1  4 
14  22  R2  4 
17  22  R2  4 
8  20  R1  5 
20  24  R2  5 
3  15  R1  6 
6  18  R1  6 
15  23  R2  6 
18  23  R2  6 
9  21  R1  7 
21  25  R2  7 
Provenance Query: Given a query dataitem , we want to track its lineage i.e., all its ancestors and the details of all transformations involved. For example, lineage of dataitem 23 (i.e., the value of attribute of tuple in entity ) will return that dataitem 23 is derived from dataitems 15 and 18 via transformation R2 and dataitems 15 and 18 are derived from dataitems 3 and 6 respectively via transformation R1.
Contributions: A naive approach to answer a provenance query is to recursively process the provenance data. We start with the queried dataitem , find those provenance triples which describe its immediate lineage and obtain its parents. We then find the parents of ’s parents and follow this process until we can no longer trace the lineage further. This approach is adopted by many systems e.g., Trio [1], GridDB [2], Titian [3] etc. This obviously takes time as we need to issue many queries. Secondly, as Spark does not support indexing, Spark needs to scan the data to find the parents of a dataitem. This hence does not scale for large volumes of data. A second approach is to precompute and materialize the transitive closure of the lineage dependencies (i.e., the provenance of each dataitem). This allows retrieval of a dataitem’s lineage using a single query. However this results in a huge increase in the storage cost as the information regarding common ancestors gets replicated multiple times. This approach hence also does not scale.
In this paper, we propose a novel approach wherein we first quickly determine a small volume of data which contains the entire provenance output of the queried dataitem. We then extract and recursively query this small volume of data. As the recursive querying happens on a small volume of data, we do not incur a large data processing cost. Contributions of this paper are hence as follows.

We propose a novel provenance framework wherein we first compute weakly connected components in provenance graph and further partition the large components as a collection of weakly connected sets (section 3). We then effectively navigate the weakly connected components and sets, thus computed, to determine a minimal volume of data containing the entire provenance output of the queried dataitem (section 3 and 4).

We propose a novel provenance graph partitioning approach wherein we exploit the workflow dependency graph to recursively partition the large components in workflow provenance graph (section 3).

Our experiments on provenance graphs obtained from a reallife text curation workflow and containing upto 500M nodes and edges show that the proposed approaches significantly beat the naive approaches (section 5). The performance is realtime if all data can be cached in RAM.

Space overheads are (1) storing two setids with each provenance triple and (2) storing setdependencies i.e., how the sets derive each other. The number of set dependencies are upperbounded by the number of provenance triples and in practice, are only a small fraction of it. The proposed framework hence has a minimal space overhead.
Ii Background
Apache Spark : Spark uses the resilient distributed data set (RDD) as its basic data type. An RDD partitions the data across the cluster nodes. In this paper, we will be mainly concerned with Spark filter and lookup operations. The filter operation scans each row of an RDD and checks whether the filter conditions are satisfied or not. A lookup is a specific kind of filter where one or more columns are checked for equality. To accelerate lookup operations, we can hashpartition an RDD on one or more columns and, this process moves all rows with same key to one partition. With hashpartitioning enabled, a lookup needs to scan only one partition. Hashpartitioning also accelerates filter performance, if the filter conditions involve checking column equality on hashed columns. RDDs can also be cached and this avoids recomputation of an RDD, each time it is accessed.
Weakly Connected Sets and Components: A semipath joining vertices and in a directed graph =(,) is a sequence of vertices ,, s.t. for each , 1 either there exists an edge in or there exists an edge in . A set of vertices is called weakly connected if there exists a semipath between each vertex pair in . A maximal weakly connected set of vertices is a weakly connected component in .
Notation: We use the terms “connected component” and “connected set” as a shorthand for “weakly connected components” and “weakly connected sets”, though they are different abstractions in graph theory.
Iii The Provenance Framework
Iiia Recursive Querying on Spark (RQ)
We first discuss the challenges in executing recursive querying (RQ) on Spark. Let us denote the provenance data RDD as provRDD. As discussed, RQ involves executing many queries to trace the entire lineage of a dataitem . The number of queries are equal to the length of the largest provenance path in the lineage of dataitem . Each such query involves finding parents of one or more dataitems . As discussed above, if we hashpartition the provRDD on field dst, this moves all provenance triples with the same dst field to one partition and we can hence find the parents of a dataitem by scanning one partition of provRDD. To find parents of all dataitems in , we need to scan at most number of partitions. This is because, some dataitems in may be in the same partition and the parents of these dataitems can hence be obtained by scanning this partition only once. If the lineage size (i.e., number of ancestors) of queried dataitem is , we hence require scanning a maximum of number of partitions. The overall RQ cost will hence depend upon the number of queries executed, set of lookups made as part of each query, and the distribution of field dst across the provRDD.
IiiB Connected Components and Provenance
We observe that the workflow provenance graph, formed by attributevalues, is a large collection of weakly connected components. This is because many attributevalues do not share any common ancestors. This is best evidenced by looking at Table V which shows the provenance graph for the representative example. This graph contains 10 weakly connected components. We notice that a dataitem and all its ancestors as well as descendants, share the same weakly connected component. This property can be used to speed up the processing of provenance queries. Given a queried dataitem , we first find out its weakly connected component id and then retrieve all provenance triples in this component. We then process the triples in this component recursively to figure out the provenance of dataitem . As the size of a component is much smaller than the whole provenance graph, the recursive querying executes faster. We hence compute weakly connected components on the provenance graph and then append the connected component id with each provenance triple as shown in Table V. This computation is part of preprocessing and needs to be done only once.
Algorithm 1 outlines the algorithm for computing the lineage of a dataitem and it takes the provenance data RDD provRDD, hashpartitioned on column as input. We first find out the id of the connected component, the dataitem lies in and let it be . This can be found by scanning a single partition of provRDD. We then find all provenance triples in component and let it be provRDD. This is done via a Spark filter operation on provRDD and this preserves the hashpartitioning logic. We then recursively process provRDD to find the lineage of dataitem .
IiiC Connected Sets and Provenance
Though CCProv provides better performance visavis RQ, it may not be goodenough when the component size is large as CCProv processes large volume of data (i.e., RDD provRDD). We next discuss CSProv which improves on this aspect. The idea is to preprocess and partition the large components into a collection of weakly connected sets. At query time, we exploit the information regarding how these sets derive each other to quickly find a minimal volume of data containing the entire lineage of the queried dataitem. We explain the intuition via a representative example.
src  dst  op  src_csid  dst_csid 

1  2    S1  S1 
1  3    S1  S1 
2  4    S1  S2 
3  4    S1  S2 
4  5    S2  S2 
4  6    S2  S2 
5  7    S2  S3 
7  8    S3  S3 
7  9    S3  S3 
6  10    S2  S4 
10  11    S4  S4 
10  12    S4  S4 
src_csid  dst_csid 

S1  S2 
S2  S3 
S2  S4 
Consider a weakly connected component C as shown in Table VIII. Consider, we partition the component C in 4 weakly connected sets  S1, S2, S3 and S4. These sets are formed by dataitems {1, 2, 3}, {4, 5, 6}, {7, 8, 9} and {10, 11, 12} respectively. We also maintain the set dependencies  how these sets contribute to the derivation of other sets. The set S1 contributes to the derivation of set S2 as dataitems 2 and 3 in set S1 derive dataitem 4 in set S2. Set S2 derives set S3 as dataitem 5 in set S2 derives dataitem 7 in set S3. Similarly set S2 derives set S4. Note that sets S3 and S4 do not contribute to the derivation of any set (Table VIII).
Consider that we query the provenance of dataitem 8. This belongs to the set S3. From setdependencies, we find that set S2 derives set S3 and set S1 derives set S2. Hence sets S1 and S2 are relevant to the derivation of set S3. These three sets together contain all ancestors of the dataitem 8. We only process those triples whose derived () dataitem is in sets S1, S2 and S3. We do not need to process set S4 triples as the setdependencies tell us that set S4 neither contributes to the derivation of set S3 nor to the derivation of any ancestor set of set S3. We hence end up processing a smaller volume of data, in this example 3 less provenance triples.
CSProv requires the following updates on the provenance data model discussed in section I.

Provenance Data: Dataitems src and dst in a provenance triple may lie in two different weakly connected sets and we hence maintain the set id of both items. We add the columns src_csid and dst_csid in the schema and drop the field ccid from the provenance triple (Table VIII).

Set Dependencies: We also maintain how the weakly connected sets are derived from each other (Table VIII). We say a set is derived from if there exists at least one dataitem in and at least one dataitem in s.t. there is a provenance triple where equals and equals . There are two columns in the schema  src_csid and dst_csid which denote the setids of parent and child connected sets.
Algorithm 2 outlines the algorithm CSProv. It takes provenance data provRDD and set dependencies setDepRDD as input, both hashpartitioned on field . Given queried dataitem , we first find out its connected set . We then construct set which includes set and its setlineage i.e., all sets which contribute to the derivation of set , directly or indirectly. This is done by executing RQ logic on setDepRDD. RQ on setDepRDD is lightweight due to two reasons. First, the size of setDepRDD is likely to be much smaller visavis provRDD. Secondly, the size of setlineage of set is likely to be much smaller than the size of lineage of dataitem and hence much smaller number of queries need to be executed.
For each set in , we find the provenance triples s.t., the dataitem is in connectedset . As provRDD is hashpartitioned on field , this requires scanning at most number of partitions. As discussed, the size of set is small and this operation is hence lightweight as well. A union of all these provenance triples i.e., provRDD contains the entire lineage of dataitem . We then recursively process provRDD to compute the lineage of dataitem . Again, the size of provRDD is likely to be much smaller that the size of the component, the queried dataitem lies in. Recursive querying on provRDD is hence lightweight as well.
Note that when the queried dataitem lies in a small component , CSProv reduces to CCProv. Small components are not partitioned and each small component is managed as a single weakly connected set (i.e., itself). The set hence only contains the set/component .
Iv Partitioning Large Components
In section IIIC, we identified the following criteria for algorithm CSProv to work efficiently.

C1  Number of setdependencies should be small.

C2  The setlineage of a set should be small.

C3  The size of each connected set should be small.
Criteria C1 and C2 imply that CSProv can construct the setlineage of a set cheaply. Criteria C2 and C3 imply that small number of triples (i.e., the size of provRDD) need to be recursively processed. We next discuss how we partition the large components, so as the resulting sets satisfy these criteria. We exploit the workflow dependency graph for the same. The dependency graph specifies dependencies among the tables and hence an order in which various tables are generated e.g., the dependency graph in Figure 1 specifies that the table MTRCS can be generated only after table F10WMTR is generated. We first develop the following notation.
Notation: Let represent the workflow dependency graph. Let a split be a subset of tables in dependency graph s.t., these tables are weaklyconnected in graph . Figure 1 shows a partitioning of the dependency graph across three splits  , , . Note that the tables in each split are weaklyconnected. Let be the set of those vertices in provenance graph which belong to component and belong to a table in split sp. Let be the subgraph induced by the vertices . We also call the provenance subgraph induced by split and component . Let be the set of weakly connected components in subgraph .
Algorithm 3 outlines the details. We first partition the dependency graph into a set of disjoint splits . Algorithm PartitionLargeComponent takes a large component and the dependency graph splits as input, and returns the set of weakly connected sets as output. For each split in , we first construct the subgraph and then compute the weakly connected components in it. The procedure then iterates over each component in . If the number of vertices in component is less than a threshold , it is not processed further and is inserted in the output set . If not, we further partition split into a set of disjoint and weakly connected subsplits and recursively call the procedure PartitionLargeComponent with component and splitset as input.
Computing Set Dependencies: After all large components are partitioned, the fields src_csid and dst_csid associated with each provenance triple are populated using the connected sets, thus generated. We then find those provenance triples wherein the columns and take different values. The set of distinct (, ) pairs in such triples, form the set dependencies.
Discussion: The constraint that all tables in each split are weakly connected, is a key part of the algorithm. Note that for any given large component and split , no two components in contribute a setdependency i.e., there is no setdependency (, ) s.t. both and are in . This is because, the set is obtained by computing weakly connected components on subgraph and any two components in are hence, by definition, disconnected. This ensures that the number of setdependencies are small (criteria C1). Secondly, this increases the likelihood that a dataitem’s local lineage (i.e., its few immediate ancestors) can be found in the same weakly connected set, this dataitem lies in and hence only few sets returned by the procedure are relevant to the lineage of a queried dataitem (criteria C2). Finally the condition that the size of each set has to be less than a threshold , ensures that the size of each set is small (criteria C3).
Note that, if we consider each table in dependency graph as a separate split, CSProv reduces to RQ. Each attributevalue is a connected component and provenance triples capture the set dependencies. If we consider all tables in dependency graph as part of one split, CSProv reduces to CCProv.
V Experimental Evaluation
Provenance Data Set: We used a provenance trace obtained from a reallife workflow deployed in our lab for creating financial domain knowledgebases [4]. The workflow parses SEC filing documents [5]. Each SEC document contains data pertaining to many thousands of financial metrics and the workflow curates this data. Figure 1 shows the workflow dependency graph comprising 25 entities (tables). For each entity, we have only shown its acronym so as to remove any confidential information. The workflow contains various transformations involving entity annotation, extraction and resolution. For each transformation, the lineage relationships among the child and parent attributevalues are captured. The workflow contains many UDFs and the lineage service assumes that each attributevalue in an UDF output is dependent on each attributevalue in the UDF input. The entity FINDocs (marked *) forms the workflow input.
This workflow is executed on a set of 532 financial documents. The obtained provenance trace is 1.6GB in size and contains 6.4M triples with 4.6M attributevalues. The provenance graph hence contains 4.6M nodes and 6.4M edges. These attributevalues have widely different derivation patterns. 32 attributevalues are being directly derived from more than 100 parent values, with the maximum being 450. 3963 values are directly derived from more than 10 parents but less than 100 parents. Rest of the attributevalues have less than 10 parents.
Spark Cluster The cluster runs Spark v2.0.2, has 8 nodes with 12 cores each, 2.4GHz processor and 120 GB RAM.
Weakly Connected Components: We computed weakly connected components in the provenance graph, using Spark implementation provided at [6] and it took 6 mins to compute them. Three of these components are large containing 1.2M, 0.9M and 0.7M nodes, and 2.7M, 1.4M and 1.2M edges (triples) respectively. We denote these three large components by notations LC1, LC2 and LC3 respectively. 132 components contain between 910 and 7453 nodes. Rest of the components have 100 or lesser number of nodes.
Number of sets, # sets with 1000 nodes, # nodes in largest set  
Split sp1  Split sp2  Split sp3  
LC1  20, 0, 490  29696, 4, 21734  219879, 11, 3291 
LC3  10, 0, 313  15491, 1, 2578  128264, 0, 643 
LC2  1, 0, 4  1,0,211  1, 1, 0.9M 
LC2_lc1  Split sp4  Split sp5  
64737, 0, 30  132599, 2, 24733 
10M  100M  250M  500M  
RQ  2.3  8.9  10.8  16.5 
CCProv  0.3  0.4  0.6  0.9 
CSProv  0.3  0.4  0.6  0.9 
10M  100M  250M  500M  
RQ  2.1  8.3  11.4  16.0 
CCProv  2.3  5.0  6.2  7.9 
CSProv  0.6  0.8  1.1  1.6 
10M  100M  250M  500M  
RQ  2.7  9.1  12.7  20.0 
CCProv  2.5  5.5  7.0  9.1 
CSProv  0.8  1.3  1.7  2.2 
10M  100M  250M  500M  

RQ  7  20  47  101 
CCProv  5.5  9  16  31 
CSProv  3  6  11  17 
Weakly Connected Sets: We next partitioned the three large components using Algorithm 3. We partitioned the workflow dependency graph in three weakly connected splits , , as shown in Figure 1. We set threshold to 25K nodes. Table IX presents the statistics on the connected sets obtained. For each large component and for each split , we note  (a) number of sets computed i.e., , (b) number of sets in with 1000 nodes and (c) number of nodes in the largest set in (i.e. the set containing maximum nodes).
The component LC1 got partitioned in a total of 249595 weakly connected sets with splits , and accounting for 20, 29696 and 219879 sets respectively. Largest sets in , and turned out to contain 490, 21734 and 3291 nodes respectively and hence did not need not further partitioning. The component LC3 got partitioned in 143765 sets with the largest sets in , and containing 313, 2578 and 643 nodes. No set in , and hence required further partitioning.
However, the subgraph yielded only a single connected component of size 0.9M. Let us denote it as LC2_lc1. This component hence needs to be partitioned further. Split is partitioned in two weakly connected subsplits and as shown in Figure 1 and the procedure PartitionLargeComponent is called on component LC2_lc1 and splitset {,} as input. This time, LC2_lc1 got partitioned into 197336 sets with subsplits and accounting for 64737 and 132599 sets respectively. None of these sets contained more than nodes and hence no further partitioning is needed. Overall, the three large components LC1, LC2 and LC3 get partitioned into 590698 sets and these sets involve 645303 setdependencies. Number of these setdependencies are hence an order of magnitude smaller than the number of provenance triples and the size on disk is 0.03GB.
Scaled Datasets: We replicated the provenance trace by a factor of 9, 24 and 48 and this generated three scaled provenance graphs containing 100M, 250M and 500M nodes and edges respectively. The sizes on disk are 15, 35 and 71GB respectively. As the data is replicated, these scaled datasets contain 27, 72 and 144 large components respectively. These large components are partitioned and the statistics regarding the resulting sets mirror the stats given in Table IX. Number of set dependencies are hence 9, 24 and 48 times visavis the base dataset and the size on disk are 0.25, 0.67 and 1.3GB respectively. The computation of the connected components and connected sets on these three scaled datasets took 16, 28 and 50 mins respectively.
Provenance Queries: We chose three classes of lineage queries to illustrate the effectiveness of proposed approaches. For each class, we chose 10 dataitems and queried their lineage using RQ, CCProv and CSProv, on base as well as scaled datasets. The largest provenance path for all LCLL queries is 10 while it is 7 for all SCSL and LCSL queries.

SCSL: We chose dataitems in a small component containing 7453 nodes and 8122 edges. Number of ancestors as well as transformations in lineage of these dataitems are between 100 and 200. These queries hence track lineage of dataitems with small lineage size.

LCSL : We chose dataitems in large components LC1, LC2, LC3 s.t., both the number of ancestors and transformations in their lineage are between 100 and 200. These queries track lineage of dataitems in large components, but with small lineage size.

LCLL: We chose dataitems in large components s.t., both the number of ancestors and transformations in their lineage are between 5000 and 10000. These queries track lineage of dataitems in large components, but with considerably larger lineage size visavis class LCSL.
RDDs Cached in RAM: We first ran experiments with 80 GB executor memory. For all scaled datasets, all RDDs fit in memory with this configuration. The RDDs were hashpartitioned and cached in RAM. All RDDs were loaded with 96 partitions. We executed the lineage queries and measured the average of the time taken by the 10 queries for each class. Tables XII, XII and XII present the results. We note that CSProv performance is realtime, degrades gracefully with datasize and significantly better than RQ and CCProv.
RDDs Cached on Disk: A cluster may not have enough RAM to cache all RDDs. We next repeated the experiments but cached the hashpartitioned RDDs on disk. Table XIII presents the results. For lack of space, we show results only for class LCLL. For all steps, RQ, CCProv and CSProv read the data from disk. As the datasize increases, the gap between RQ and CSProv widens.
Discussion: We next explain the details of CSProv using a query for each class. One of the 10 dataitems queried for LCSL class, belongs to a connected set in and this set contains 79 nodes and 102 edges. 13 sets in derive the set and these 13 sets are found to be derived from one set in . Set and these 14 sets in its setlineage hence construct the set (Algorithm 2), and these 15 sets are found to contain a total of 1816 nodes and 4177 edges. For all datasets, CSProv hence needs to recursively query only 4177 provenance triples while CCProv needs to query 2.7M triples. This leads to the improved performance of CSProv.
A dataitem queried for class LCLL belongs to a connected set in and it contains 3291 nodes and 4403 edges. 4 sets in derive set and these 4 sets are found to be derived from 20 sets in . These 25 sets contain a total of 44196 nodes and 60169 edges. CSProv hence needs to recursively query only 60169 triples while CCProv needs to process 2.7M triples. For SCSL class, as a small component is not partitioned, both CCProv and CSProv recursively process 8122 triples.
GraphX: It is to be noted that GraphX library supports graphparallel computation on top of Spark but as CCProv/CSProv do not have any graphparallel computation, we use core Spark RDDs and not GraphX, for our implementations.
Vi Related Work
Titian [3] is the only major prior work to have looked at provenance data management and querying on Spark. However, Titian focuses on efficiently capturing provenance data in a Spark workflow. Once captured, it uses the recursive querying approach to trace the lineage of a record in an RDD. In comparison, our focus is on leveraging Spark platform for efficiently processing provenance data obtained from a workflow management system and not on capturing provenance data in a Spark workflow. We propose a novel framework for optimizing workflow provenance queries on Spark which exploits the workflow dependency graph to manage the provenance graph as a collection of weakly connected sets. As discussed, this easily beats the recursive querying approach.
Few systems focus on capturing minimal volume of lineage data and optimizing the storage using domain properties and the detailed knowledge of transformations applied e.g., SubZero [7], Anand et al. [8] etc. Our paper is domainagnostic and is targeted towards the blackbox lineage scenario wherein the lineage service does not have the details of internals of the transformations/UDFs being applied. Few systems e.g., [9, 10, 11] start with the provenance data representation wherein the transitive closure of the provenance graph (i.e., for each dataitem, its full provenance) is materialized and then propose techniques to reduce the storage cost. Our paper focuses on the scenario wherein the provenance data comprises of provenance triples capturing lineages across individual transformations.
Vii Conclusions
We proposed a provenance framework wherein we manage the workflow provenance graph as a collection of weakly connected sets, by exploiting the workflow dependency graph. The proposed approach is effective and provides significant speedups visavis existing recursive querying based methods.
References
 [1] P. Agrawal and et al., “Trio: A system for data, uncertainty, and lineage,” in VLDB, 2006, pp. 1151–1154.
 [2] D. Liu and M. J. Franklin, “GridDB:a datacentric overlay for scientfic grids,” in VLDB, 2004, pp. 600–611.
 [3] M. Interlandi and et al., “Titian: Data provenance support in spark,” Proc. VLDB Endow., vol. 9, no. 3, pp. 216–227, 2015.
 [4] S. Bharadwaj and et al., “Creation and interaction with largescale domainspecific knowledge bases,” PVLDB, vol. 10, no. 12, 2017.
 [5] “SEC. https://www.sec.gov/.”
 [6] “https://github.com/kwartile/connectedcomponent.”
 [7] E. Wu, S. Madden, and M. Stonebraker, “SubZero: A finegrained lineage system for scientific databases,” in ICDE, 2013, pp. 865–876.
 [8] M. K. Anand and et al., “Efficient provenance storage over nested data collection,” in EDBT, 2009, pp. 958–969.
 [9] A. Chapman and et al., “Efficient provenance storage,” in SIGMOD, 2008.
 [10] Y. Chen and et al., “An efficient algorithm for answering graph reachability queries,” in ICDE, 2008, pp. 893–902.
 [11] H. V. Jagadish and et al., “Compression technique to materialize transitive closure,” ACM Trans. on Database Systems, vol. 15(4), 1990.