# Smoke: Fine-grained Lineage at Interactive Speed

Data lineage describes the relationship between individual input and output data items of a workflow, and has served as an integral ingredient for both traditional (e.g., debugging, auditing, data integration, and security) and emergent (e.g., interactive visualizations, iterative analytics, explanations, and cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to capture the relationships between input and output data items across a workflow with the goal to streamline queries over lineage. Unfortunately, current lineage systems either incur high lineage capture overheads, or lineage query processing costs, or both. As a result, applications, that in principle can express their logic declaratively in lineage terms, resort to hand-tuned implementations. To this end, we introduce Smoke, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised. To do so, Smoke introduces tight integration of the lineage capture logic into physical database operators; efficient, write-optimized lineage representations for storage; and optimizations when future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and streamlines lineage queries by multiple orders of magnitude compared to state-of-the-art alternatives. Our experiments on real-world applications highlight that Smoke can meet the latency requirements of interactive visualizations (e.g., <150ms) and outperform hand-written implementations of data profiling primitives.

## Authors

• 6 publications
• 18 publications
• ### GeoBlocks: A Query-Driven Storage Layout for Geospatial Data

City authorities need to analyze urban geospatial data to improve transp...
08/21/2019 ∙ by Christian Winter, et al. ∙ 0

• ### IDEBench: A Benchmark for Interactive Data Exploration

Existing benchmarks for analytical database systems such as TPC-DS and T...
04/07/2018 ∙ by Philipp Eichmann, et al. ∙ 0

• ### LevelHeaded: Making Worst-Case Optimal Joins Work in the Common Case

Pipelines combining SQL-style business intelligence (BI) queries and lin...
08/25/2017 ∙ by Christopher R. Aberger, et al. ∙ 0

• ### NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond

Range aggregate queries (RAQs) are an integral part of many real-world a...
07/10/2021 ∙ by Sepanta Zeighami, et al. ∙ 0

• ### DIEL: Transparent Scaling for Interactive Visualization

We live in an era of big data and rich data visualization. As data sets ...
06/28/2019 ∙ by Yifan Wu, et al. ∙ 0

• ### Heuristic and Cost-based Optimization for Diverse Provenance Tasks

A well-established technique for capturing database provenance as annota...
04/17/2018 ∙ by Xing Niu, et al. ∙ 0

• ### Sort-based grouping and aggregation

Database query processing requires algorithms for duplicate removal, gro...
10/01/2020 ∙ by Thanh Do, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Data lineage describes the relationship between individual input and output data items of a computation. For instance, given an erroneous result record of a workflow, it is helpful to retrieve the intermediate or base records to investigate for errors; similarly, identifying output records that were affected by corrupted input records can help prevent erroneous conclusions. These operations are expressed as lineage queries over the workflow: backward queries return the subset of input records that contributed to a given subset of output records; forward queries return the subset of output records that depend on a given subset of input records.

Virtually, any application that requires an understanding over the input-output derivation process can be expressed in lineage terms. As such, data lineage has been an integral ingredient for applications such as debugging [89, 50, 46, 58, 18], data integration [22], auditing [27], security [16, 50], explaining query results [87, 88, 78, 23], data cleaning [12, 37], iterative analytics [19], and interactive visualizations [90] that highlight the importance of lineage-enabled systems.

Lineage-enabled systems answer lineage queries by automatically capturing record-level relationships throughout a workflow. A naive approach materializes pointers between input and output records for each operator during workflow execution, and follows these pointers to answer lineage queries. Existing systems primarily differ based on when the relationships are materialized (e.g., eagerly during workflow execution or lazily reconstructed when executing a lineage query), and how they are represented (e.g., tuple annotations [2, 9, 32, 44] or explicit pointers [89, 58]). Each design trades off between the time and storage overhead to capture lineage and lineage query performance. For instance, a query execution engine may augment each operator to materialize a hash index that looks up input records for a given output record, thus speeding up backward lineage query execution. However, the overhead of constructing the index can dwarf the operator execution cost by or more [89]—particularly if the operator is heavily optimized for latency or throughput.

As data processing engines become faster, an important question—and the main focus of this paper—is whether it is possible to achieve the best of both worlds: negligible lineage capture overhead as well as fast lineage query execution.

Unfortunately, current lineage systems incur either high lineage capture overhead, or high lineage query processing costs, or both. As a result, applications that could be expressed in lineage terms resort to manual implementations:

###### Example 1

Figure 1 shows two views and generated from queries over a database. Linked brushing is an interaction technique where users select a set of marks (e.g., circles) in one view, and marks derived from the same records are highlighted in the other views. Although this functionality is typically implemented manually, it can be logically expressed as a backward lineage query from selected points in to input records followed by a forward query to highlight the corresponding bars in .

To avoid the shortcomings of current lineage systems and tackle the competing requirements of lineage applications, we employ a careful combination of four design principles:

P1. Tight integration. In high throughput query processing systems, per-tuple overheads incurred within a tight loop—even a single virtual function call to write lineage metadata to a separate lineage subsystem [89, 46, 58]—can slow down operator execution by more than an order of magnitude. In response, we introduce a physical algebra that tightly integrates lineage capture into query execution, and we design simple, write-efficient data structures for lineage capture to avoid the overhead of crossing system boundaries.

P2. Apriori knowledge. Lineage applications such as debugging need to capture lineage to answer ad-hoc lineage queries that can trace back to any base or intermediate table. In applications such as interactive visualizations or data profiling we typically know the possible set of lineage queries up front. Having apriori knowledge enables us to avoid materializing lineage that does not contribute to these lineage queries.

P3. Lineage consumption. Lineage applications rarely require all results of a lineage query (e.g., all records that contributed to an aggregation result), unless the results have low cardinality. Instead, the results are filtered, transformed, and aggregated by additional SQL queries. We term these queries lineage consuming queries. If such queries are known up-front, as is typically the case for applications based on templated analysis (e.g., Tableau or Power BI), physical design logic based on these templates can be pushed into the lineage capture phase. Such physical design logic may include materialization of aggregate statistics or prune/re-partition lineage indexes to speed up future lineage consuming queries.

P4. Reuse. Finally, we have found that significant lineage capture costs arise from generating and storing unnecessary amounts of lineage data such as expensive annotations and denormalized forms of lineage. Following the concept of reusing data structures [26], we identify cases where data structures constructed during normal operator execution can be augmented and reused, but this time with the goal set to capture lineage with low overhead.

This paper presents Smoke, a lineage-enabled system that embodies the above four principles to support lineage capture and querying with low latency. More specifically, Smoke is an in-memory query compilation database engine that tightly integrates the lineage capture logic within query execution and uses simple, write-efficient lineage indexes for low-overhead lineage capture (P1). In addition, Smoke enables workload-aware optimizations that prune captured lineage and push the logic of lineage consuming queries down into the lineage capture phase (P2,P3). Finally, Smoke identifies data structures, which are constructed during normal operator execution (i.e., hash tables), and (re)uses them, whenever possible, for low-overhead lineage capture (P4).

In the rest of the paper, we start by discussing necessary background and related work (Section 2). Then, we present our techniques and contributions as follows:

• [leftmargin=*]

• We introduce a physical algebra that tightly integrates the lineage capture logic within the processing of single and multi-operator plans. Each logical operator has a dual form to both execute its logic and generate lineage. In turn, physical operators are combined to implement this extended logic. Furthermore, we design write-efficient data structures to materialize lineage with low overhead.(Section 3)

• We design a suite of simple optimizations based on the availability of future lineage consuming queries to 1) prune lineage that will not be used and 2) materialize aggregates and prune or re-partition our lineage data structures to answer lineage consuming queries faster. (Section 4)

• We conduct experiments to 1) compare Smoke with state-of-the-art lineage systems and 2) show how it can enable real-world applications (i.e., interactive visualizations and data profiling). The former experiments show that Smoke reduces the lineage capture overhead and streamlines lineage queries by multiple orders of magnitude compared to state-of-the-art lineage systems. The latter suggest that Smoke lets applications express their logic using declarative lineage constructs and also speeds up these applications—to the extent that Smoke is on par with or outperforms hand-written implementations. (Sections 5 and 6)

We conclude with open questions for future work (Section 7).

## 2 Background

In this section, we provide background on the fine-grained lineage capture problem, the approach of Smoke for this problem, and applications of focus in this paper.

### 2.1 Fine-Grained Lineage Capture

Our lineage semantics adhere to the transformational provenance semantics of [19, 32, 43] over relational queries.

Base queries. Formally, let the base query be a relational query over a database of relations that generates an output relation . An application can initially execute multiple base queries . For instance, in Figure 1 consists of two queries that generate two output relations rendered as visualization views.

Lineage and lineage consuming queries. After a base query runs, the user may issue a backward lineage query that traces from a subset of an output relation to a base table , or a forward lineage query that traces from a subset of an input relation to the query’s output relation . A lineage query results in a relation that can be used is another query which we term a lineage consuming query; a lineage query is a special case of lineage consuming queries: . Finally, itself can be used as a base query, meaning that another lineage consuming query can use as a base query.111Smoke’s query model includes multi-backward and multi-forward queries as well as refresh and forward propagation [43]. We limit the discussion to , , and in that they form the basis to express general query constructs.

###### Example 2

Let and be the base queries in Figure 1. The linked brushing interaction is expressed as a backward query from the selected circles back to the input records in that generated them. The forward lineage query retrieves the linked bars in . A lineage consuming query can then be used to change the color of the bars to red, similarly to the ones in [90].

We can model interactive visualization applications as base queries (e.g., , above) that load the initial visualization, followed by lineage consuming queries that express user interactions [90]. Therefore, optimizing the visualization responsiveness corresponds to quickly executing the base queries followed by streamlining lineage consuming queries.

Lazy and Eager lineage query evaluation. How can we answer lineage queries quickly? Lazy approaches rewrite lineage queries as relational queries over the input relations—the base queries do not incur overhead at the cost of potentially slower lineage query execution costs [43, 22, 17]. In contrast, we might Eagerly materialize data structures during base query execution to speed up future lineage queries [17, 43]. We refer to this as lineage capture, and we seek to minimize capture overhead on the base query to speed up lineage queries.

Lineage capture overview. The eager approach incurs overhead to capture the base query’s lineage graph. Logically, each edge maps an operator ’s input record to ’s output record that is derived from . Backward lineage connects tuples in the query output with tuples in each input base relation by identifying all end-to-end edges for which a path exists between the two records. Forward lineage reverses these arrows. Materializing such end-to-end forward and backward lineage indexes can help speed up lineage consuming queries.

We will present techniques that can efficiently capture lineage indexes in a workload-agnostic setting by carefully instrumenting operator implementations, and in a workload-aware setting by tailoring the indexes for future lineage consuming queries if they are known up-front. In general, lineage capture techniques fall into two categories: logical and physical.

Logical lineage capture. This class of approaches stay within the relational model by rewriting the base query into , so that its output is annotated with additional attributes of input tuples. Some systems [2, 18] generate a normalized representation of the lineage graph such that a join query between and each base relation can create the lineage edges between and . The correct output relation can be retrieved by projecting away the annotation attributes from . Alternative approaches [32, 18] output a single denormalized representation that extends with attributes of the input relations. Recent work has shown that the latter rewrite rules (Perm [32]) and optimizations leveraging the database optimizer (GProm [66]) incurs lower capture overheads than the former normalized approach.

Although these approaches can run on any relational database and benefit from the database optimizer, they suffer from several performance drawbacks. The normalized representation requires expensive independent joins when running lineage queries. The denormalized representation can incur significant data duplication—an aggregation output computed over input records will be duplicated —and require further projections to derive from . Furthermore, indexes are needed to speed up lineage queries.

Physical lineage capture. This approach directly instruments physical operators write lineage edges to a lineage subsystem through an API; the subsystem stores and indexes the edges and answers lineage queries [58, 46, 89, 44, 45]. This can support black-box operators and decouples lineage capture from its physical representation. However, we find that virtual function calls alone (ignoring the cross-process overheads) can slow data-intensive operators by up to . Further, lineage capture cannot easily leverage and be co-optimized with base query execution.

### 2.2 Approach of Smoke

To this end, we introduce Smoke, an in-memory database engine that avoids the drawbacks of logical and physical approaches. Smoke improves upon logical approaches by physically representing the lineage edges as read- and write-efficient indexes instead of relationally-encoded annotations. We improve upon physical approaches by introducing a physical algebra that tightly integrates lineage capture and relational operator logic to avoid API calls and in a way amenable to co-optimization. Finally, Smoke can exploit knowledge of lineage consuming queries to (a) prune or partition lineage indexes or (b) materialize views during the base query execution that will benefit the lineage consuming queries.

The primary focus of this work is to explore mechanisms to instrument physical operator plans with lineage capture logic. To do so, we have implemented Smoke as a query compilation execution engine using the produce-consumer model [65] (Figure 2). It takes as input the base query and an optional workload of lineage consuming queries ; parses and optimizes to generate a physical query plan; instruments the plan to directly generate indexes to speed up backward and forward lineage queries; and compiles the instrumented plan into machine code that, when executed, generates as well as lineage indexes. Internally, Smoke

uses a single-threaded, row-oriented execution model, and leverages hash-based operator implementations that are widely used in fast query engines and are amenable to low-overhead lineage capture. Execution models that use advanced features such as compression or vectorization are interesting future work.

### 2.3 Lineage Applications

Many applications logically rely on lineage (and generally provenance), including but not limited to: debugging [89, 50, 46, 58, 18], diagnostics [83], data integration [22], security [16, 50], auditing [27] (the recent EU GDP regulation [27] mandates tracking lineage), data cleaning [12, 37], explaining query results [87, 88, 78, 23]

, debugging machine learning pipelines

[91, 54], and interactive visualizations [90].

Unfortunately, there is a disconnect between modeling applications in terms of lineage, and the performance of existing lineage capture mechanisms—the overhead is enough that applications resort to manual implementations instead. For this reason, we center the paper around interactive visualizations: it is a domain that can directly translate to lineage [40, 90], yet is dominated by hand-written implementations. Furthermore, it imposes strict latency requirements on lineage capture (to show the initial visualization) and lineage consuming queries (to respond to user interactions). Finally, our experiments seek to argue, using visualization and data profiling applications, that lineage is not only an elegant logical description of many use cases, but can be on a par with or even improve on performance compared to hand-tuned implementations.

## 3 Fast Lineage Capture

This section describes lineage capture without knowledge of the future workload. We present (a) lineage index representations to map output-to-input or input-to-output record ids (rids) that are read- and write-efficient, and (b) a physical algebra that tightly integrates the lineage capture logic with the base query execution. To do so, we will describe how to instrument individual as well as multiple operators to capture lineage with low overhead.

### 3.1 Lineage Index Representations

Smoke uses two main lineage index representations. Figure 3 illustrates the input and output relations and , respectively, and the two -based representations for 1-to-N and 1-to-1 operators. We index rids because the lineage indexes are cheap to write, and lookups—which simply index into the relation’s array—are fast. In contrast, indexing full tuples incurs high write costs, while indexing primary keys is not beneficial without building primary key indexes apriori or when the primary keys are wide. Furthermore, in-memory columnar engines [1, 28] already create rid lists as part of query processing that resemble our lineage indexes, and enable reuse opportunities to reduce lineage capture costs.

Rid Index. 1-to-N relationships are represented as inverted indexes. Consider the backward lineage of GROUPBY. The index’s entry corresponds to the output group, and points to an rid array containing rids of the input records that belong to the group. The Rid Index is used for 1-to-N forward lineage relationships as well, such as the JOIN operator. Following high performance libraries [29], the index and rid arrays are initialized to 10 elements and grow by a factor of on overflow. Our experiments show that array resizing dominates lineage capture costs, and statistics that allow Smoke to pre-allocate appropriate sized arrays can reduce lineage capture costs by up to . To avoid offline statistics computation, we show how useful statistics can be collected during query processing.

Rid Array. 1-to-1 relationships between output and input are represented as a single array. Each entry is an input record rid rather than a pointer to an rid array.

### 3.2 Single Operator Instrumentation

We now introduce instrumentation techniques to generate lineage indexes when executing single and multi-operator plans. Our designs are based on two paradigms: Defer defers portions of the lineage capture until after operator execution while Inject incurs the full cost during execution. Defer is preferable when the base query execution overhead must be minimized, or when it is possible to collect cardinality statistics during base query execution to allocate appropriately sized lineage indexes and avoid resizing costs. In contrast, Inject typically incurs lower overall overhead, but the client needs to wait longer to retrieve the results of the base query.

We now describe how both paradigms illustrate the application of the tight integration and reuse principles from the Introduction for core relational operators. Our focus is on the mechanisms and Section 7 discusses future work on automatically choosing between these paradigms. Code snippets of the compiled code and details for additional operators (including , , , , , ) are covered in our technical report [75]. Section 3.3 extends our support to multi-operator plans.

#### 3.2.1 Projection

Projection under bag semantics does not need lineage capture because the input and output order and cardinalities are identical—the rid of an output (input) record is its backward (forward) lineage. Projection with set semantics is implemented using grouping, and we use the same mechanism as that for group-by aggregation below.

#### 3.2.2 Selection

Selection is an if condition in a for loop over the input relation, and emits a record if the predicate evaluates to true [64]. Both forward and backward lineage use rid arrays; the forward rid array can be pre-allocated based on the cardinality of the input relation. Inject adds two counters, and , to track the rids of the current input and output records, respectively. If a record is emitted, we set the element of the forward rid array to , and append

to the backward rid array. Available selectivity estimates can be used to pre-allocate the backward rid array and avoid reallocation costs during the append operation. We don’t implement

Defer because it is strictly inferior to Inject.

#### 3.2.3 Group-By Aggregation

Query compilers decompose GROUPBY into two physical operators: builds the hash table that maps group-by values to the group’s intermediate aggregation state; scans the hash table, finalizes aggregation results for each group, and emits output records. Figure 4 shows the plans for both instrumentation paradigms; the lineage indexes consist of a forward rid array and a backward rid index.

Defer: Consider the Defer plan in Figure 4.a. for Defer extends to store an number to each group’s intermediate aggregation state. When scans the hash table to construct the output records, it uses a counter to track the output record’s rid and assign it to the group’s value (i.e., tracks the output rid of the group in the result). Smoke then pins the hash table in memory. At a later time, can scan each record in , reuse the hash table to probe and retrieve the associated group’s , and populate the backward rid index and forward rid array.

Although Defer must scan twice, the operator’s input and output cardinalities are used to avoid resizing costs during . Also, can be freely scheduled (e.g., immediately after or during user think time when system resources are free).

Inject: Consider the Inject plan in Figure 4.b. this time augments each group’s intermediate state with an rid array that contains the rids of the group’s input records (i.e., backward lineage). tracks the current output record id in order to set the pointer in the backward index to the bucket’s rid list and the values in the forward rid list. Since knows the input and output cardinalities, it can correctly allocate arrays for the backward and forward indexes. The primary overhead is due to reallocations of during the build phase in . We find that knowing group cardinalities can decrease the lineage capture overhead up to 60%.

#### 3.2.4 Join

Smoke instruments hash joins in a similar way as hash aggregation. A hash join is split into two physical operators: builds the hash table on the left relation , and uses each record of the right relation to probe the hash table. We now introduce Inject and Defer techniques for lineage capture that can be used for general M:N joins, and further optimizations for primary key-foreign key (pk-fk) joins. Smoke generates backward rid arrays and forward rid indexes (an input record can generate multiple join results).

Inject: Consider the Inject plan for joins in Figure 4.d. The build phase augments each hash table entry with an rid array that contains the input rids from for that entry’s join key. The probe phase tracks the for each output record, and populates the forward and backward indexes as expected. Note that output cardinalities are not yet known within the phase and we cannot pre-allocate our lineage indexes. As a result, although the backward rid array is relatively cheap to resize, the forward rid indexes can potentially trigger multiple reallocations if an input record has many matches and penalize the performance.

Defer: Our main observation is that exact cardinalities needed to pre-allocate the forward rid indexes are known after the probe phase and can be used by Defer. Note that deferring the whole construction after the probe phase is similar to the logical approaches, and incurs the cost of re-running the join, or annotating, indexing and projecting the join output. Defer instead augments Inject by partially deferring index construction for the left input relation (see Figure 4.c).

The build phase adds a second rid list to the hash table entry, in addition to from Inject. When is scanned during the probe phase, its output records are emitted contiguously, thus need only store the rid of the first output record for each match with a record. After the phase, the forward and backward indexes for the left relation can then be pre-allocated and populated in a final scan of the hash table ( in Figure 4.c). Deferring for is also possible, however the benefits are minimal because we need to partition the output records for each hash table entry by the records that it matches, which we found to be costly.

Further optimizations. If the hash table is constructed on a unique key, then the do not need to be arrays and can be replaced with a single integer. Also, if the join is a primary-key foreign-key join, the forward index of the foreign-key table is an rid array; since the join cardinality is the same as the foreign-key table cardinality, backward indexes are pre-allocated. Finally, join selectivity estimates can help pre-allocate the forward rid indexes.

### 3.3 Multi-Operator Instrumentation

The naïve way to support multi-operator plans is to individually instrument each operator to generate its lineage indexes; lineage queries can use the indexes to trace backward or forward through the plan. This approach is correct and can be used to support any DAG workflow composed of our physical operators. However, it unnecessarily materializes all intermediate lineage indexes even though only the lineage between output and input records are strictly needed.

We address this issue with a technique that 1) propagates lineage information throughout plan execution so that only a single set of lineage indexes connecting input and final output relations are emitted, and 2) reduces the number of lineage index materialization points in the query plan.

To propagate lineage throughout plan execution, consider a two-operator plan with input relation . When runs, it will use ’s backward lineage index to populate its own lineage index with rids that point to rather than the intermediate relation ; ’s lineage indexes can be garbage collected when not needed further.

To reduce lineage index materialization points, recall that database engines pipeline operators to reduce intermediate results by merging multiple operators into a single pipeline [64]. Operators such as building hash tables are pipeline breakers because the input needs to be fully read before the parent operator can run. Within a pipeline there is no need for lineage capture, but pipeline breakers need to generate lineage along with the intermediate result. In Section 3.2, we showed how pipeline breakers (e.g., hash table construction for the left-side of joins and group-by aggregations) can augment the hash tables with lineage. Parent pipelines that use the same hash-tables for query evaluation (e.g., cascading joins) can also use the lineage indexes embedded in the hash tables to implement the lineage propagation above.

Implementation Details Our engine supports naive instrumentation for arbitrary relational DAG workflows, and we focused our optimizations for SPJA query blocks composed of pk-fk joins. This was to simplify our engineering, and because fast capture for SPJA blocks can be extended to nested blocks by using the propagation technique above. We focus on pk-fk joins due to their prevalence in benchmarks and real-world applications, and because the Inject and Defer instrumentation for pk-fk joins are identical (Section 3.2). Thus, the main distinction between Inject and Defer for SPJA blocks is how the final aggregation operator in the block is instrumented—the joins are instrumented identically, while select and project are pipelined. Details are in our technical report [75].

Lineage applications such as interactive visualizations will often support a pre-defined set of interactions (e.g., filter, pan, tooltip, cross-filter [20]) that amount to a pre-declared lineage consuming query workload . This section describes simple but effective optimizations that exploit knowledge of to avoid capturing lineage that is not queried, and generate lineage representations that directly speed up queries in . To simplify the discussion, we will center each optimization around different classes of lineage consuming queries over the base query .

### 4.1 Instrumentation Pruning

Instrumentation pruning disables lineage capture if the lineage indexes will not be used by . We present two types of pruning that do not generate lineage for specific input relations, and for backward/forward lineage.

Pruning input relations. Suppose the visualization only supports a tooltip interaction that shows detailed lineitem information when the user hovers over a visualization mark. This is expressed as a backward lineage query to lineitem. In this case, we can avoid capturing lineage for the orders table. In general, Smoke does not capture lineage for any relation not referenced in .

Pruning lineage direction. Extending the previous example, it is clear that will only execute a backward lineage query to lineitem and not vice versa. Thus, Smoke can also avoid generating the forward lineage index from lineitem to the base query output. The lineage indexes that can be pruned is evident from the lineage consuming queries in .

### 4.2 Push-Down Optimizations

User facing applications rarely present a large set of query results to the users—instead they will reduce the result cardinality with further filter, transform, and/or aggregation operations. This is also the case for lineage consuming queries, and presents opportunities to push these reduction operations into the lineage capture logic. We present three simple push-down optimizations for fixed filter predicates, templated predicates, and aggregation operations, and then discuss the relationship between push-down optimizations and common provenance semantics.

Selection push-down. Visualizations often update metrics that summarize data that the user selects. For instance, the following query retrieves Christmas shipment order information for parts of the visualization that the user interacts with: . This optimization pushes the predicate shipdate=‘xmas’ into the instrumented base query; before Smoke populates the backward lineage indexes, it checks whether the input tuple satisfies the predicate. If the predicate is on a GROUPBY key, Smoke can also avoid any lineage capture overhead for all other groups. This technique reduces lineage space requirements, and typically reduces capture overhead. However, if the predicate is expensive to evaluate (e.g., slow UDF), it is possible to introduce more capture overhead.

Data skipping using lineage. Push down selections require a static predicate, however interactive visualizations also use parameterized predicates. For instance, the user may use a slider to dynamically specify the filtered shipping date: . This pattern is ubiquitous in interactive visualizations and applies to faceted search, cross-filtering, zooming, or panning. Smoke pushes the parameterized predicate into lineage capture by partitioning the rid arrays (standalone and part of rid indexes) by the predicate attribute. For instance, Smoke would partition the rid arrays in the backward index for orders by the shipdate attribute, so that only reads the rid partition matching the parameter :p1. This technique is applicable to categorical attributes and continuous attributes that can be discretized. This is almost always possible because user-facing output is ultimately discretized at pixel granularity [48].

Group-by push-down. Interactions such as cross-filtering let users select marks in one view, trace those marks to the input records that generated them, and recompute the aggregation queries in other views based on the selected subset of input records. This pattern is precisely an aggregation query over the backward lineage of the user’s selection. Smoke pushes the group-by aggregation into lineage capture by partitioning the rid arrays on the group-by attributes, and incrementally computing the intermediate aggregation state. This technique works if the main difference between the base and lineage consuming query is the addition of grouping attributes. In effect, lineage capture generates data cubes to answer the linage consuming aggregation query. In contrast to building data cubes offline, which requires separate scans of the database, this approach piggy-backs on top of the base query’s existing table scans. As with prior work [34, 57, 38], this technique supports algebraic and distributive functions (e.g., SUM, COUNT, and AVG), and we evaluate this optimization extensively in synthetic (Section 6.4) and real-world settings (Section 6.5.1).

Relationships with Provenance Semantics. We observe that popular provenance semantics (e.g., which[22, 81] and why[10] provenance) can be expressed as lineage consuming queries and pushed down using the above optimizations. In other words, Smoke can operate as a system with alternative provenance semantics depending on the given lineage consuming query. For space reasons, we include a brief discussion in our technical report [75].

Applying Optimizations. Choosing the appropriate optimizations, manually or automatically, each poses challenges that we leave to future work. “What language extensions (e.g., CREATE BACKWARD INDEX ON (SELECT …)) are needed to capture lineage manually?” and “What cost models are needed to model the trade-offs for each capture and optimization choice?” constitute interesting research questions.

## 5 Experimental Settings

Our experiments seek to show that Smoke (1) incurs significantly lower lineage capture overhead than logical and physical lineage capture approaches, (2) can execute lineage queries faster than lazy, logical, and physical lineage query approaches, and (3) can leverage lineage indexes and workload-aware optimizations to speed up real-world applications as compared to current manual implementations.

To this end, we compare Smoke to state-of-the-art logical and physical lineage capture and query approaches using microbenchmarks on single operator plans, as well as end-to-end evaluations over a subset of TPC-H queries. Using TPC-H, we further show that our workload-aware optimizations can provide further lineage query speedups on the “Overview first, zoom and filter, and details on demand” interaction paradigm and respond within interactive latencies of  [56, 11, 61]. Finally, we express two real-world applications (cross-filter [20] and data profiling [83]) in lineage terms and show that Smoke can match or outperform hand-optimized implementations of the same applications.

Data. The microbenchmarks use a synthetic dataset of tables zipf(id,z,v)

containing zipfian distributions of varying skew.

z is an integer that follows a zipfian distribution and v

is a double that follows a uniform distribution in

. controls the zipfian skew, is the table size, and specifies the number of distinct values (i.e., groups). Tuple sizes are small to emphasize worst-case lineage overheads. End-to-end and workload-aware experiments use the TPC-H data generator and vary the scale factor. Our experiments on real-world applications use the Ontime [68, 67] (123.5m tuples, 12GB) and Physician [74] (2.2m tuples, 0.6GB) datasets.

To ensure a fair comparison, we implement and optimize alternative, state-of-the-art techniques in our query engine. Our implementation reduces the capture overheads (by several orders of magnitude) as compared to their original implementations, and is detailed in our extended report [75].

First, we describe the compared lineage capture techniques (see also Table 1 for a minimal description):

Smoke techniques. Smoke-I and Smoke-D instrument the plan using Inject and Defer instrumentation (Section 3). Unless otherwise noted, Smoke-I and Smoke-D don’t use optimizations from Section 3. Baseline evaluates base queries on Smoke without capturing lineage.

Baseline logical techniques. State-of-the-art logical approaches (Perm [32], GProm [66]) use query rewrites to annotate the base query output with lineage. However, they are built on production databases that incur unneeded capture overheads from e.g., transaction and buffer managers, lack of hash-table reuse, no query compilation. For this reason, we used Perm’s rewrite rules (and whenever applicable, GProm’s optimizations) to generate physical plans that annotate the output with either rids (Logic-Rid) or full input tuples (Logic-Tup), and implemented the plans in Smoke. Note that the output relation needs to be indexed to support fast lineage lookups. To this end, Logic-Idx scans the annotated output relation to construct the same end-to-end lineage indexes as those created by Smoke. Ultimately, our implementations are two orders of magnitude faster than Perm and GProm but still incur capture overheads higher than Smoke.

Baseline physical techniques. To highlight the importance of tightly integrating lineage capture and operator logic, we use two baseline physical techniques. Phys-Mem instruments each operator to make virtual function calls to store input-output rid pairs in Smoke lineage indexes from Section 3, which highlights the overhead of making a virtual function call for each lineage edge. Phys-Bdb instead indexes lineage data in BerkeleyDB to showcase the drawbacks of using a separate storage subsystem [89].

Moreover, we compare lineage querying techniques based on data models and indexes induced during lineage capture:

Lineage consuming queries. Smoke-I, Smoke-D, Logic-Idx, and Phys-Mem all capture the same lineage indexes from Section 3.1, thus their lineage consuming query performance will be identical. We call this group Smoke-L. We compare with a baseline lazy approach, Lazy, which uses standard rules [22, 43] to rewrite lineage consuming queries into relational queries that scan the input relations. We also compare with the data model that Logic-Rid and Logic-Tup produce and the indexes that Phys-Bdb generate. Finally, we consider Lazy and Smoke without optimizations as baselines to our workload-aware optimizations.

Settings for the real-world applications are provided inline.

Measures. For lineage capture, we report the absolute base query latency and relative overhead compared to not capturing lineage. For lineage and lineage consuming queries, we report absolute latency and speedup over baselines. All numbers are averaged over 15 runs, after 3 warm-up runs.

Platforms. We ran experiments on a MacBook Pro (macOS Sierra 10.12.3, 8GiB 1600MHz DDR3, 2.9GHz Intel Core i7), and a server-class machine (Ubuntu 14.04, 64GiB 2133MHz DDR4, 3.1GHz Intel Xeon E5-1607 v4). Both architectures have caches sizes 32KiB L1d, 32KiB L1i, and 256KiB L2—the MacBook has 4MiB L3 and the server-class has 10MiB L3. Our overall findings for lineage capture are consistent across the two architectures. Since lineage capture is write-intensive, we report results using the lower memory bandwidth setting (MacBook). For the crossfilter application, we report the server-class results because the Ontime dataset doesn’t fit in the MacBook memory.

## 6 Experimental Results

In this section, we first compare lineage capture techniques on microbenchmarks (Section 6.1) and TPC-H queries (Section 6.2). Then, we compare techniques on lineage query evaluation (Section 6.3) and showcase the impact of our workload-aware optimizations (Section 6.4). We conclude with experiments on real-world applications (Section 6.5).

### 6.1 Single Operator Lineage Capture

We first evaluate lineage capture with a set of single operator microbenchmarks for group-by (Section 6.1.1), pk-fk joins (Section 6.1.2), and m:n joins (Section 6.1.3).

#### 6.1.1 Group-by Aggregation

We use the following base query, which groups by drawn from a zipfian distribution so that cardinalities are skewed (). Visualizations often compute multiple statistics to avoid redundant scans when users ask for new statistics [82]:

= SELECT    z, COUNT(*), SUM(v), SUM(v*v),
SUM(sqrt(v)), MIN(v), MAX(v)
FROM      zipf
GROUP BY  z -- #groups follow a zipfian

Figure 5 reports the lineage capture latency (base query cost + overhead) for each instrumentation technique, and varies the input size (columns) and the number of groups (rows).

Smoke. Smoke-I incurs the lowest overhead among techniques ( on average). Smoke-D is slightly slower ( on average) due to the cost of join to construct lineage indexes.

Comparison with Logical systems. Logic-Rid and Logic-Tup use Perm’s aggregation rewrite rule, which computes to derive the denormalized lineage graph as a single relation. The cost of computing and writing the denormalized lineage graph is costly and can slow the base query by multiple orders of magnitude. Since zipf is narrow, Logic-Tup performs similarly to Logic-Rid, however we expect the cost to increase for wider input relations. Logic-Idx has extra indexing costs over Logic-Rid and is not plotted.

Comparison with Physical systems. The primary overhead for Phys-Mem is the cost of a virtual function call for each written lineage edge. The cost of building index data structures is comparable to Smoke’s write costs, however Smoke can reuse the hash table built by and incur lower costs for building the backward lineage rid index. Phys-Bdb incurs by far the highest overhead (up to slowdown), due to the overhead of communicating with BerkeleyDB. The same trends hold for the other operators, and we have not found physical approaches to be competitive. As such, we do not report physical approaches in the rest of the experiments.

Varying dataset size, skew, and groups. In general, the lineage capture techniques all incur a constant per input tuple overhead, and differ on the constant value. This is why increasing the input relation size increases costs linearly for all techniques. Increasing the number of groups increases the costs of building and scanning the group-by hash table as well as the output cardinality, and affects all techniques including the baseline. We find that the overhead is independent of the zipfian skew because it does not change the number of lineage edges that need to be written; it does affect querying lineage as we will see in Section 6.3.

Complexity of group-by keys and aggregate functions. We find that the techniques differ in their sensitivity to the size of the group-by keys and the number of aggregation functions in the project clause of the query. Smoke-I simply generates rid index and rid arrays, and is not affected by these characteristics of the base query. In contrast, Smoke-D and both logical approaches are sensitive to the size of the group-by keys, since they are used to join the output and input relations. Finally, the logical approaches are also affected by the number of aggregation functions because they affect the cost of the final projection. In short, we believe our setup is favorable to alternative approaches, and find that Smoke still shows substantial lineage capture benefits.

Cardinality Statistics. If Smoke knows the cardinality statistics for each group, then it can allocate correctly sized arrays in the lineage indexes (Section 3). This further reduces the lineage capture overhead by 52% on average and leads to overhead reduction from to for Smoke-I (not plotted).

#### 6.1.2 Primary-Foreign Key (Pk-Fk) Joins

We use the following primary-foreign key join query: SELECT * FROM gids,zipf WHERE gids.id=zipf.z. zipf.z is a foreign key that references gids.id and drawn from a zipfian distribution so that some keys are more popular than others. We vary the number of join matches (i.e., groups) by varying the number of unique values for gids.id. In addition to Baseline and Smoke-I, we evaluate Smoke-I-TC, which assumes that we know the number of matches for each join attribute value—this is to highlight the costs of array resizing. Note that Smoke-D is equivalent to Smoke-I due to the pkfk optimization (Section 3.2.4). We compare against Logic-Idx because Logic-Rid and Logic-Tup do not support forward queries without additional indexes.

Comparison with logical techniques. Logic-Idx incurs 1.4 relative overhead on average due to the costs of computing and materializing the denormalized lineage graph in the form of the annotated output relation, and scanning the annotated table to build backward and forward lineage indexes for both input relations. In contrast, Smoke-I incurs on average overhead; knowing join cardinalities reduces the overhead to on average. Finally, note that Smoke-I already knows the cardinalities for the backward indexes and the forward index of the right table for pkfk joins (Section 3.2.4), thus Smoke-I-TC’s lower overhead can be attributed to lower reallocation costs to build the forward index for the left table.

#### 6.1.3 Many-to-Many Joins

For M:N joins we use the following base query that performs a join over two z attributes with zipfian distributions: SELECT * FROM zipf1,zipf2 WHERE zipf1.z=zipf2.z. The join is highly skewed to stress lineage capture. Both z integer attributes are drawn from a zipfian distribution, where zipf1.z is within or , while zipf2.z. This means that tuples with have a disproportionate number of matches, whereas larger will have few matches. Furthermore, we fix the size of the left table to 1000 records and vary the right from to .

Section 3.2.4 described the Inject approach, which populates the lineage indexes within the join’s probe phase (), and the Defer approach, which computes cardinality statistics during the probe phase to correctly allocate and populate the lineage indexes for the left table (a_fw,a_bw) after the probe phase and avoid array resizing costs. Finally, to break down the benefits of Defer, we also evaluate Smoke-D-DeferForw which still defers a_fw but populates a_bw within the phase. To simplify the presentation, we only report Smoke-based techniques since the relationship with alternatives is consistent with the previous results.

Comparison of Smoke techniques. In contrast to the other operators, the MN join over skewed inputs is similar to a cross-product, and generates billion results. Materializing the result set renders the capture overheads non-informative so we do not materialize the output. In this case, the MN execution is nearly 0, so Figure 7 primarily reports instrumentation overhead for the three techniques. The overhead is predominantly due to rid array resizing, which is why deferring the forward and backward lineage index construction for the left table reduces overhead by up to . Finally, increasing the number of groups for zipf1.z reduces the costs of all techniques because the output cardinality is smaller, but their relative overheads are the same.

Other Operators. Our technical report [75] describes additional results and operators. The main additional finding is that it is preferable to overestimate selection cardinality estimates to avoid array resizings for the selection operator.

### 6.2 Multi-Operator Lineage Capture

We used four queries from TPC-H—Q1, Q3, Q10, and Q12. Their physical query plans contain group by aggregation as the root operator, selections that vary in predicate complexity and selectivity, and up to three pk-fk joins. (Our hash-based execution precludes sort operations.)  Figure 8 summarizes the relative overhead of the best performing Smoke (i.e., Smoke-I) and logical (i.e., Logic-Idx) techniques for the four TPC-H queries. We use scale factor 1.

Overall Results. Smoke-I reduces the lineage capture overhead as compared to Logic-Idx by up to . In addition, Smoke-I incurs at most overhead across the four queries. To make the overhead results meaningful, we have tried to ensure that Smoke query engine has respectable performance—despite row-oriented execution it matches the performance of single-threaded MonetDB—non-instrumented Q1 runs in ms, while the slowest query Q12 runs in ms.222The purpose is not to compare Smoke with MonetDB, but to suggest that the reported overheads are with respect to reasonable baseline performance. Smoke-D (not shown) is slower than Smoke-I due to the join cost between the input and output relations, however it is still faster than the logical approaches. Finally, although Q1 is simple (e.g., it has no joins), its results are arguably the most informative because the plan is simple and has the highest selectivity, which most stresses lineage capture.

Impact of selections in lineage capture. We found that the selectivity of the query predicate has a large impact on the overhead of the logical approaches. Q1 shows a setting where the predicate has a high selectivity, thus the input to the subsequent aggregation operator has a high cardinality, each output group depends on a larger set of input records, and ultimately leads a large amount of data duplication to create the denormalized lineage graphs. In contrast, the other queries have low predicate selectivity, thus the cardinality of the subsequent aggregation operator is small and leads to a substantially smaller lineage graph. Smoke is less sensitive to this effect because its lineage indexes represent a normalized lineage graph and avoids this duplication.

Lineage Capture Takeaways (Sections 6.1 and 6.2): Smoke-based techniques outperform both logical and physical approaches by up to two orders of magnitude: Logical approaches that adhere to the relational model are affected by the denormalized lineage graph representation, extra indexing steps, and expensive joins. The physical approaches are affected by virtual function calls and write-inefficient lineage indexes. We find that array resizing contributes to a large portion of Smoke overheads, and accurate or overestimated cardinality estimates can reduce resizing costs.

### 6.3 Lineage Query Performance

We now evaluate the performance of different lineage query techniques. Recall that lineage queries are a special case of lineage consuming queries. We evaluate the query: SELECT * FROM ozipf, zipf, where is the query used in the group-by microbenchmark (Section 6.1.1). is an output record (a group). zipf contains 5000 groups, records, and we vary its skew . Varying highlights the query performance with respect to the cardinality of the backward lineage query.  Figure 9 reports lineage query latency for all possible assignments and different values.

Smoke-L evaluates the lineage query using a secondary index scan—it probes the backward index, and uses the input rids as array offsets into zipf. Recall that Smoke-L refers to any of Smoke-I, Smoke-D, Logic-Idx, or Phys-Mem (Section 5).

Smoke-L vs Lazy. In contrast to Smoke-L, Lazy performs a table scan of the input relation and evaluates an equality predicate on the integer group key. This is arguably the cheapest predicate to evaluate and constitutes a strong comparison baseline. We find that Smoke-L outperforms Lazy up to five orders of magnitude, particularly when the cardinality of the output group is small. We expect the performance differences to grow when the base query uses more complex group keys that increase the predicate evaluation cost [47, 14, 52], or when the input relation is wide, which increases scan costs. There is a cross over point when the input relation is highly skewed () and the backward lineage of some groups have high cardinality. This increases Smoke-L’s secondary index scan cost as compared to a linear table scan that can benefit from scan pre-fetching.

Smoke-L vs Logical Approaches. We also report the cost of scanning the annotated relations generated by Logic-Rid and Logic-Tup (highest two lines). Scanning these relations to answer lineage queries is worse than Lazy because the annotated relation is wider than the input relation, yet they have the same cardinality. Note that indexing the annotated relation is Logic-Idx, and represented by Smoke-L.

Smoke-L vs Physical Approaches. Phys-Mem is included as part of Smoke-L, so we report Phys-Bdb. Using an external lineage subsystem to perform a lineage query, we need to perform function calls to the external system to fetch the input rids for an output. As long as we have the input rids, we can perform a secondary index scan to evaluate the lineage query similarly to Smoke-L. In our experiments, we compare both fetching all input rids in a single function call, and with consecutive function calls to fetch the rids in a cursor-like fashion. The cursor-like approach outperformed the bulk approach since it avoids allocation costs for input rids. Smoke-L provides a lower bound for Phys-Bdb: both perform the same secondary index scan but Phys-Bdb pays the cost of function calls to the external lineage subsystem, and has worse lineage index read performance due to the B-Tree of BerkeleyDB.

Lineage Query Takeaways: Smoke outperforms logical and lazy lineage query evaluation strategies by multiple orders of magnitude, especially for low-selectivity lineage queries. We believe Smoke is a lower bound for physical approaches by avoiding functions calls and using read-efficient indexes.

We explore the effectiveness of the data skipping and group-by push-down optimizations by incrementally building up an example motivated by the “Overview first, zoom and filter, details on demand” [80] interaction paradigm. We focus only on zoom and filter because the base query generates the initial overview, while details on demand is the simple backward lineage query evaluated in Section 6.3. We report selection push-down and pruning in our technical report [75].

We use TPC-H Q1 as the initial “Overview” base query, and we render its output groups as a bar chart; there are four bars each generated from 48%, 24%, 24%, and 0.06% of the Lineitem input relation. Subsequent user interactions (zoom by drilling down, filter by adding predicates) will be expressed as lineage consuming queries that incrementally modify its preceding lineage consuming query.

No Optimization. We start off by evaluating the effectiveness of using lineage indexes (without optimizations) as compared to the lazy approach for lineage consuming queries (not plotted). Suppose the user will be interested in drilling into a particular bar to see the statistics broken down by month and year of the shipping date. This can be modeled as a lineage consuming query that extends Q1 in two ways: (1) replace the input relation with the backward lineage of the bar that the user is interested in (i.e., Lineitem Lineitem) and (2) add Month, Year to the GROUP BY.

We evaluate for every value of . Lazy runs Q1 as a table scan followed by filtering on Q1’s group by keys, grouping on year and month, and computing the same aggregates as Q1. Smoke-I is best when the group cardinality is low ( selectivity) and outperforms Lazy by 6.2. Higher cardinality groups incur random seek overheads. The performance converges for high cardinality because the performance is dominated by the query processing costs (i.e., aggregation in this case). To address the overhead of high cardinality lineage queries, we next evaluate workload-aware optimizations.

Data skipping. Suppose we know that the user wants to filter the result of (say, based on interactive filter widgets), then we can push this logic into lineage capture using the data skipping optimization. We evaluate Q1, which extends Q1 with two parameterized predicates: l_shipmode = :p1 AND l_shipinstruct = :p2. Q1 is the base query for Q1. To exercise push-down overheads, both are text attributes and thus more expensive to evaluate than for numeric attributes. The lineage capture overhead was 0.22 for Smoke-I, and 1.65 with the data skipping optimization due to the additional cost of partitioning the rid arrays on the text attributes, but still lower than logical approaches (Figure 8).

Figure 10 plots the lineage consuming query latency vs the selectivity of every possible combination of the predicate parameters. The Lazy baseline executes the lineage consuming query as a filter-groupby query over a table scan of Lineitem. Although lineage indexes substantially reduce query latency (No Data Skipping in Figure 10)—particularly for low predicate selectivities—it is bottlenecked by the secondary scan costs of backward lineage for high cardinality groups. In contrast, data skipping reduces even high selectivity queries by at least compared to Lazy, and is consistently below the interactive threshold [56]. This is because rid arrays are partitioned by l_shipmode, l_shipinstruct, so that the lineage query only performs an indexed scan over the rids needed to answer the query.

Aggregation push down. After users filter and identify interesting statistics from the filter interactions in Q1, they may want to drill down further. If we know this upfront, Smoke may pre-compute aggregates for new dimensions during lineage capture. To this end, we define Q1 by adding l_tax to the group by clause in Q1, and setting the input relation to Lineitem. We compare Lazy (rewrites Q1 as a table scan-based query) against Smoke-I with and without the group-by optimization. In this setup, the previous lineage consuming query Q1 is the base query for Q1.

Figure 11 compares the lineage query latency under Lazy (red dots) against Smoke-I without the optimization (blue triangles). The push-down optimization is not plotted because it takes ms (i.e., just fetches the materialized aggregates). For completeness we vary the parameters of the backward lineage statement for Q1 () as well as for the base query Q1 () of Q1 and report the lineage consuming query’s latency for all combinations. Although Lazy takes seconds per Q1 instance, Smoke-I’s index scan takes on average , and for low selectivity queries.

Pre-computing aggregation statistics is not free—Figure 12 plots the lineage capture latency for both Smoke variants as compared to the non-instrumented lazy approach. We report the result for all 4 parameters to the base query Q1’s backward lineage statement (). The overhead of Smoke-I is low compared to the cost of partitioning the rid arrays on l_tax and computing aggregates.

Push-down Takeaways: Our experiments highlight that lineage indexes are sufficient whenever the lineage cardinality is low for the complexity of future lineage consuming queries. For higher lineage cardinalities, our workload-aware optimizations provide a principled way to push-down computation into lineage capture and optimize future lineage consuming queries. They also highlight trade-offs that future optimizers would need consider (see also open research questions in Section 4.2).

### 6.5 Smoke-Enabled Applications

We now present evidence that lineage can be used to optimize two real-world applications—cross-filter visualizations (Section 6.5.1) and data profiling (Section 6.5.2)—enough to perform on par with or better than hand-tuned, application-specific implementations. We highlight the main results here, and defer details to the technical report [75].

#### 6.5.1 Crossfilter

Crossfilter is an important interaction technique to help explore correlated statistics across multiple visualization views [20]. In the common setup, multiple group-by queries along different attributes are each rendered as e.g., bar charts. When the user highlights a bar (or set of bars) in one view, the other views update to show the group-by results over only the subset that contributed to the highlighted bar(s). This is naturally expressed as backward lineage from the highlighted bar, followed by refreshing the other views by executing the group-by queries on the lineage subset.

Since the views are fundamentally aggregation queries, recent research have proposed variations of data cubes to accelerate the interactions [57, 55, 71], however it can take minutes or hours to construct the cubes. Such offline time is not available if the user has loaded a new dataset (e.g., into Tableau) and wants to explore using cross-filter as soon as possible. This has recently been referred to as the cold-start problem for interactive visualizations [7].

Setup. Following previous studies [57, 55, 71], we used the Ontime dataset and four group-by COUNT aggregations on <lat, lon> (65,536 bins), <date> (7,762 bins), <departure delay> (8 bins), and <carrier> (29 bins); only 8,100 bins have non-zero counts because <lat, lon> is sparse. Each group-by query corresponds to one output view. This setup favors cube construction because it involves only four views and coarse-grain binning on spatiotemporal dimensions (which decreases the number of cubes, and increases group cardinalities). We report the individual (Figure 13) and cumulative (Figure 14) latency to highlight each and every bar, respectively. The experiment was run on our server-class machine so that the dataset can fit in memory.

Techniques. We compare the following: Lazy uses lazy lineage capture and re-executes the group-by queries on the lineage subset. BT uses Smoke to capture backward lineage indices, but re-runs the group-by queries (which requires re-building group-by hashtables). BT+FT also captures forward lineage indices that map input records to the output bars that they contribute to, which can be used to incrementally update the visualization bars without re-running the aggregation query. Finally, we compare with Data Cube construction. We first ran imMens [57], NanoCubes [55], and hashedcubes [71] to construct the data cubes, however imMens and NanoCubes did not finish within 30 minutes, while hashedcubes required 4 minutes. For this reason, we implemented a custom partial cube construction based on our group-by aggregation push-down optimization that took minutes to construct. This construction resembles the low dimensional cube decomposition described by imMens, but using the sparse encoding recommended by NanoCubes.

Main Results. We make four main observations. First, BT outperforms Lazy by leveraging the backward index to avoid table scans, and is consistent with our TPC-H benchmarks. Despite the overhead of forward index capture, BT+FT outperforms BT because the forward index lets Smoke directly update the associated visualization bars without the need to re-run aggregation queries (and re-build group-by hash tables). Although the Data Cube response time is near-instantaneous, the offline construction cost is considerable and BT+FT is able to complete the benchmark before the cube is constructed (Figure 13). Second, BT+FT performs best (ms) when group-by queries output many groups (e.g., lat/lon, day) because they reduce group cardinality. This suggests that lineage can complement cases when data cubes are expensive (a cube dimension contains many bins) by computing the results online. Third, Figure 14 shows that BT+FT responds within ms (dotted line) for all but five bars, whose lineage depends on a large subset of the input tuples (>10% selectivity; >13M tuples). Fourth, the capture overhead for BT+FT and BT on base visualization queries are relatively low ( using Smoke-I). We expect that optimizations that leverage parallelization, sampling, or deferred capture scheduled during user “think time” can further reduce the capture costs.

#### 6.5.2 Data Profiling Applications

Data profiling studies the statistics and quality of datasets, including constraint checking; data type extraction; or key identification. Recent work, such as UGuide [83], proposes human-in-the-loop approaches towards mining and verifying functional dependencies, and present users with examples of potential constraint violations to double-check. This experiment compares UGuide with Smoke on a lineage-oriented specification of a data profiling problem.

Setup. We evaluate the following task: given a functional dependency (FD) over a table and an FD evaluation algorithm that outputs the distinct values that violate the FD, our goal is to construct a bipartite graph that connects the violations with the tuples | . Collectively, for a set of given FDs, this construction leads to a two-level bipartite graph connecting FDs and violations to tuples responsible for the violations. We compare Smoke-based approaches with UGuide’s implementation333UGuide proposed novel algorithms for mining and verifying functional dependencies, and implemented a fast version of the data-profiling task using Metanome [72]. Although latency was not their focus, the system was optimized for performance, so we believe it is a reasonable comparison baseline.. Based on correspondence with the authors, it turns out that UGuide internally creates data structures akin to the lineage indexes that Smoke captures. This makes sense because it mirrors a lineage-based description of the problem.

Techniques. FD violations for can be identified by transforming the FD into one or more SQL queries. We consider two rewrite approaches. The simple approach (CD) runs the query Q=SELECT A FROM T GROUP BY A HAVING COUNT(DISTINCT B) > 1; backward and forward lineage indexes correspond to the bipartite graph above.

UGuide implements an optimization which, although not modeled as lineage, effectively simulates lineage indexes. We thus describe the second approach (UG) in lineage terms. We first evaluate Q=SELECT DISTINCT attr FROM T for attr, and capture lineage. We backward trace each to the input , and forward trace each lineage record to . If there are more than one distinct values in the forward trace output, then the FD is violated. The lineage indexes also correspond to the desired bipartite graph. The UG implementation is typically faster than CD for FD mining because UG explicitly builds lineage indexes once per attribute and reuses them across FD checks. Our experiments report the cost of individual FD checks and the cost to construct the bipartite graph, however the relative findings are expected to grow wider for multi-FD checks. We compare Smoke using both approaches (Smoke-CD, Smoke-UG) with UGuide that implements the UG approach (Metanome-UG).

Main Results. Figure 15 evaluates the techniques using four functional dependencies over the Physician dataset used in the Holoclean [76] paper. Overall, Smoke-UG outperforms Metanome-UG by while the simpler Smoke-CD approach outperforms both approaches. Both Smoke capture overheads are consistent with our microbenchmarks ( overhead). There are several reasons why Smoke-UG outperforms Metanome-UG. Metanome-UG incurs virtual function call costs when constructing its version of lineage indexes ( overhead on Q that we implemented in UGuide), as well as general JVM overhead even after a warm-up phase to enable JIT optimization. Further, Metanome-UG models all attribute types as strings, which slows uniqueness checks for integer data types such as NPI. For fairness, the other three FDs are over string attributes (zip is a string).

Application Takeaways: Lineage can express many real-world tasks, such as those in visualization and data profiling, that are currently hand-implemented in ad-hoc ways. We have shown evidence that lineage capture can be fast enough to free developers from implementing lineage-tracing logic without sacrificing, and in many cases, improving performance.

## 7 Conclusions and Future Work

Smoke illustrates that it possible to both capture lineage with low overhead and enable fast lineage query performance. Smoke reduces the overhead of fine-grained lineage capture by avoiding shortcomings of logical [86, 9, 32, 66, 31, 21, 35] and physical [58, 46, 89, 44, 45, 43] approaches in a principal manner, and is competitive or outperforms hand-optimized visualization and data profiling applications. Smoke also contributes to the space of physical database design [15, 3, 25, 69, 62, 51, 42, 79, 59, 4, 73, 84] by being the first engine to consider lineage as a type of information for physical design decisions. Our capture techniques and workload-aware optimization make Smoke well-suited for online; adaptive; and offline physical database design settings. Finally, we believe the design principles used in the development of Smoke (P1-P4 in Introduction) are broadly applicable beyond our design.

There are many areas for future work to explore: 1) leverage modern features such as vectorized and compressed execution, columnar formats, and UDFs [77], 2) develop cost-based techniques to instrument plans in an application-aware manner (e.g., Defer is best-suited for speculation in-between interactions), 3) model database optimization policies (e.g., statistics computation, cube-construction, key indexes) as lineage queries, and 4) extend support to data cleaning, visualization, machine learning, and what-if [24, 6] applications.

## References

• [1] D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden, et al. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 2013.
• [2] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
• [3] S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, 2000.
• [4] I. Alagiannis, S. Idreos, and A. Ailamaki. H2O: A hands-free adaptive store. In SIGMOD, 2014.
• [5] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In SIGMOD, 2015.
• [6] S. Assadi, S. Khanna, Y. Li, and V. Tannen. Algorithms for provisioning queries and analytics. arXiv preprint arXiv:1512.06143, 2015.
• [7] L. Battle, R. Chang, J. Heer, and M. Stonebraker. Position statement: The case for a visualization performance benchmark. In DSIA, 2015.
• [8] L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for interactive visualization. In SIGMOD, 2016.
• [9] D. Bhagwat, L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In VLDB, 2004.
• [10] P. Buneman, S. Khanna, and W. C. Tan. Why and where: A characterization of data provenance. In ICDT, 2001.
• [11] S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In ACM, 1991.
• [12] A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, 2014.
• [13] R. Chang, M. Ghoniem, R. Kosara, W. Ribarsky, J. Yang, E. Suma, C. Ziemkiewicz, D. Kern, and A. Sudjianto. WireVis: Visualization of categorical, time-varying data from financial transactions. In VAST, 2007.
• [14] S. Chaudhuri, P. Ganesan, and S. Sarawagi. Factorizing complex predicates in queries to exploit indexes. In SIGMOD, 2003.
• [15] S. Chaudhuri and V. Narasayya. Self-tuning database systems: A decade of progress. In VLDB, 2007.
• [16] A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In CIDR, 2017.
• [17] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. In Foundations and Trends in Databases, 2009.
• [18] L. Chiticariu, W.-C. Tan, and G. Vijayvargiya. Dbnotes: A post-it system for relational databases based on provenance. In SIGMOD, 2005.
• [19] Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining outputs in modern data analytics. VLDB, 2016.
• [20] Crossfilter.
• [21] Y. Cui. Lineage tracing in data warehouses. PhD thesis, Stanford University, 2001.
• [22] Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., 2000.
• [23] D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. In VLDB, 2017.
• [24] D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan: Provisioning for what-if analysis. In CIDR, 2013.
• [25] K. Dias, M. Ramacher, U. Shaft, V. Venkataramani, and G. Wood. Automatic performance diagnosis and tuning in oracle. In CIDR, 2005.
• [26] K. Dursun, C. Binnig, U. Cetintemel, and T. Kraska. Revisiting reuse in main memory database systems. arXiv preprint arXiv:1608.05678, 2016.
• [27] EU GDPR.
• [28] F. Faerber, A. Kemper, P.-A. Larson, J. Levandoski, T. Neumann, and A. Pavlo. Main memory database systems. Foundations and Trends® in Databases, 2017.
• [30] K. E. Gebaly and J. Lin. Afterburner: The case for in-browser analytics. arXiv, 2016.
• [31] F. Geerts, A. Kementsietsidis, and D. Milano. Mondrian: Annotating and querying databases through colors and blocks. In ICDE, 2006.
• [32] B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model through query rewriting. In ICDE, 2009.
• [33] G. Graefe and W. J. McKenna. The volcano optimizer generator: Extensibility and efficient search. In ICDE, 1993.
• [34] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In Data Mining and Knowledge Discovery, 1997.
• [35] T. J. Green, G. Karvounarakis, Z. G. Ives, and V. Tannen. Update exchange with mappings and provenance. In VLDB, 2007.
• [36] A. Gupta, D. Agarwal, D. Tan, J. Kulesza, R. Pathak, S. Stefani, and V. Srinivasan. Amazon redshift and the case for simpler data warehouses. In SIGMOD, 2015.
• [37] D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. In VLDB, 2015.
• [38] N. Hanusse, S. Maabout, and R. Tofan. Revisiting the partial data cube materialization. In ADBIS, 2011.
• [39] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. Oltp through the looking glass, and what we found there. In SIGMOD, 2008.
• [40] J. Heer, M. Agrawala, and W. Willett. Generalized selection via interactive query relaxation. In CHI, 2008.
• [41] J. M. Hellerstein, M. Stonebraker, J. Hamilton, et al. Architecture of a database system. Foundations and Trends® in Databases, 2007.
• [42] S. Idreos, M. L. Kersten, and S. Manegold. Database cracking. In CIDR, 2007.
• [43] R. Ikeda. Provenance In Data-Oriented Workflows. PhD thesis, Stanford University, 2012.
• [44] R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In ICDE, 2013.
• [45] R. Ikeda and J. Widom. Panda: A system for provenance and data. In IEEE Data Eng. Bull., 2010.
• [46] M. Interlandi, K. Shah, S. D. Tetali, M. A. Gulzar, S. Yoo, M. Kim, T. Millstein, and T. Condie. Titian: Data provenance support in spark. In VLDB, 2015.
• [47] R. Johnson, V. Raman, R. Sidle, and G. Swart. Row-wise parallel predicate evaluation. In VLDB, 2008.
• [48] U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl. M4: A visualization-oriented time series data aggregation. In VLDB, 2014.
• [49] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In AVI, 2012.
• [50] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In SIGMOD, 2010.
• [51] M. L. Kersten and S. Manegold. Cracking the database store. In CIDR, 2005.
• [52] M. S. Kester, M. Athanassoulis, and S. Idreos. Access path selection in main-memory optimized data systems: Should i scan or should i probe? In SIGMOD, 2017.
• [53] Y. Klonatos, T. Rompf, C. Koch, and H. Chafi. Legobase: Building efficient query engines in a high-level language, 2014.
• [54] A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. In SIGMOD, 2017.
• [55] L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. In EuroVis, 2013.
• [56] Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. In Vis, 2014.
• [57] Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. In Computer Graphics Forum, 2013.
• [58] D. Logothetis, S. De, and K. Yocum. Scalable lineage capture for debugging disc analytics. In SoCC, 2013.
• [59] Y. Lu, A. Shanbhag, A. Jindal, and S. Madden. Adaptdb: adaptive partitioning for distributed joins. In VLDB, 2017.
• [60] S. Manegold, M. L. Kersten, and P. Boncz. Database architecture evolution: Mammals flourished long before dinosaurs became extinct. PVLDB.
• [61] R. Miller. Response time in man-computer conversational transactions. In Fall joint computer conference, 1968.
• [62] B. Mozafari, E. Z. Y. Goh, and D. Y. Yoon. Cliffguard: A principled framework for finding robust database designs. In SIGMOD, 2015.
• [63] F. Naumann. Data profiling revisited. SIGMOD Rec., 2014.
• [64] T. Neumann. Efficiently compiling efficient query plans for modern hardware. In VLDB, 2011.
• [65] T. Neumann and V. Leis. Compiling database queries into machine code. IEEE DEB, 2014.
• [66] X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. H. Liu, V. Krishnaswamy, and V. Radhakrishnan. Provenance-aware query optimization. In ICDE, 2017.
• [67] Airline On-Time Performance.
• [68] Airline On-Time Statistics and Delay Causes.
• [69] Oracle. Oracle database 10g: The self-managing database. Technical report, Oracle, 2003.
• [70] S. Padmanabhan, T. Malkemus, R. C. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in modern computer architectures. In ICDE, 2001.
• [71] C. A. Pahins, S. A. Stephens, C. Scheidegger, and J. L. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. In TVCG, 2017.
• [72] T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with metanome. In VLDB, 2015.
• [73] E. Petraki, S. Idreos, and S. Manegold. Holistic indexing in main-memory column-stores. In SIGMOD, 2015.
• [74] Physician Compare National.
• [75] F. Psallidas and E. Wu. Smoke: Fine-grained lineage capture at interactive speed. Technical report, Department of Computer Science, Columbia University, 2017.
• [76] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. In VLDB, 2017.
• [77] A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Comput. Surv., 2017.
• [78] S. Roy, L. Orr, and D. Suciu. Explaining query answers with explanation-ready databases. In VLDB, 2015.
• [79] A. Shanbhag, A. Jindal, Y. Lu, and S. Madden. Amoeba: a shape changing storage system for big data. In VLDB, 2016.
• [80] B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages. IEEE, 1996.
• [81] V. Tannen. The semiring franework for database provenance. In PODS, 2017.
• [82] P. Terlecki, F. Xu, M. Shaw, V. Kim, and R. Wesley. On improving user response times in tableau. In SIGMOD, 2015.
• [83] S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J.-A. Quiane-Ruiz, and N. Tang. Uguide: User-guided discovery of fd-detectable errors. In SIGMOD, 2017.
• [84] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through large-scale machine learning. In SIGMOD, 2017.
• [85] Z. Wang, N. Ferreira, Y. Wei, A. S. Bhaskar, and C. Scheidegger. Gaussian cubes: Real-time modeling for visual exploration of large multidimensional datasets. TVCG, 2017.
• [86] J. Widom. Trio: a system for integrated management of data, accuracy, and lineage. In CIDR, 2005.
• [87] E. Wu and S. Madden.

Scorpion: Explaining away outliers in aggregate queries.

In PVLDB, 2013.
• [88] E. Wu, S. Madden, and M. Stonebraker. A demonstration of dbwipes: clean as you query. In VLDB, 2012.
• [89] E. Wu, S. Madden, and M. Stonebraker. Subzero: a fine-grained lineage system for scientific databases. In ICDE, 2013.
• [90] E. Wu, F. Psallidas, Z. Miao, H. Zhang, L. Rettig, Y. Wu, and T. Sellam.

Combining design and performance in a data visualization management system.

In CIDR, 2017.
• [91] Z. Zhang, E. R. Sparks, and M. J. Franklin. Diagnosing machine learning pipelines with fine-grained lineage. In HPDC, 2017.

## Appendix A Query Compilation

One of the main design principles that we realized in Smoke is the tight integration of the lineage capture logic with the query execution logic. In this section, we give a brief background on the query compilation and push-based execution model that Smoke leverages to realize this principle. (However, note that our techniques are not bound to this execution model. Based on this execution model we better describe tuning techniques of alternative logical and physical approaches for lineage capture in the next section. Readers familiar with query compilation concepts can skip to the next section.)

Query compilation combines query optimization and execution with low-level compiler-based optimizations (e.g., LLVM, LSM, GCC). It replaces query interpretation [41] with a compilation phase that transpiles queries into intermediate representations (IR) such as C, C++, or LLVM IR, that are further optimized by a standard compiler, and an execution phase than runs the query as a binary executable. Each operator in the physical plan emits its intermediate representation that implements its logic, and the engine emits glue code to combine operator inputs and outputs. Many modern database systems [30, 36, 5] have taken this approach, and continues to be an active research area [53] with positive results.

Query compilation systems typically implement operators using the producer-consumer code generation model [64] to derive a push-based execution model that is in contrast to the pull-based iterator [33], batch [70], or full-column [60] execution models of traditional query interpreters. Each operator exposes two functions: produce triggers child operators to produce tuples or data structures with the appropriate schema, and consume emits execution logic to process its inputs and hand the result to the parent consume methods. Borrowing the example from Neumann [64], the following generates pseudo code for the and scan operators:

$$\sigma$$.produce        $$\sigma$$.input.produce
$$\sigma$$.consume(a,s)   print ‘if ’+$$\sigma$$.condition;
$$\sigma$$.parent.consume(a, $$\sigma$$)
scan.produce     print ‘for t in relation’
scan.parent.consume(attrs, scan)

The compilation phase will call the root operator’s produce method to emit IR that generates result tuples. Consider compiling the physical plan . .produce calls scan.produce, which emits the for loop over and calls .consume to consume the tuples of the scan. Then, .consume inlines the selection over tuples of in the for loop. The final emitted pseudocode will be:
for t in relation
if p(t)
<$$\sigma$$’s parent logic>(t)


Operators such as hash aggregation and the building side of hash joins are called pipeline breakers because the entire pipeline (i.e., all operators up to the pipeline breaker) needs to materialize its end result before the next pipeline can start operating. Each pipeline defines a separate code block in the final emitted program code, where each code block is a separate for loop (similar to the example above).

Finally, it is important to note that interpretation-based execution models can also enable the tight integration principle by introducing new physical operators similar to the ones we presented in Section 3.2. Designing such a system is an interesting future direction.

## Appendix B Tuning

Having provided a basic background on the query compilation and execution model of Smoke, in this section we present low-level optimization details that we enabled for lineage capture techniques. We start with logical approaches, then discuss Smoke-based optimizations, and conclude with physical approaches.

Tuning logical techniques. In our preliminary experiments we used Perm and GProm as they are implemented over Postgres 8.3 and a commercial database system, respectively. Unfortunately, both systems exhibit increased lineage capture overhead for reasons intrinsic to the underlying DBMSs which are not related to the principals behind the logical rewrite rules of Perm and GProm.

In particular, we observed increased lineage capture overhead due to (a) the added complexities from the transactional processing layers of full-fledged database systems [39] and (b) no flexibility in reusing data structures in the same physical query plan; reusing data structures is important for lineage capture, as we also noted for our techniques in Section 3. For these reasons, we implemented the rewrite rules of Perm and GProm in Smoke because (a) it does not incur the overhead of the transactional processing layers of a full-fledged database and (b) reuses data structures within the same query plan—hence, enables a fair comparison of only the principals behind Smoke and logical lineage capture techniques having fixed the underlying execution engine. Finally, we note that our implementation of logical alternatives in Smoke outperforms the available versions of Perm and GProm by two orders of magnitude because it avoids these caveats. For instance, Logic-Rid and Logic-Tup for 1m tuples and 1000 groups in our group-by aggregation microbenchmark take while Perm and GProm take . (depending on optimization knobs.)

To better explain the optimizations that we implemented for logical approaches, consider the following query which has the same structure with the queries that we used for our multi-operator experiments: , where and ; denotes the schema (i.e., set of attributes) of relation . Furthermore, let the join between and to be a pk-fk natural join.

Perm’s re-rewrite rule for this query results in the following query: ; denotes the relational rename operation of the attribute to Figure 16 shows the corresponding non-optimized and optimized physical plans that Smoke produces for this query. Similar non-optimized physical plans, with the same lack of optimizations, are produced by Perm and GProm. Next, we describe the set of optimizations to derive the optimized plan.

First, Perm joins the output of the aggregation with the join on the same attributes that the hash table was built on for the group-by aggregation. This allows us to reuse the hash table to join the output of the join with the aggregation result. As such, the built of the hash table at pipeline can be eliminated as we can reuse directly the hash table that was built during pipeline . Second, consider the materialization of the output at pipeline and the materialization of the same output at . This is the exact output of the base query. is more costly than due to the projection on a, possibly, large relation. Hence, we can eliminate and provide the output result by executing . Finally, note that the two natural joins between the tables A and B are the same. Hence, if we materialize the join output there is no need to perform again the join. In our experiments we found that joins are costly to materialize. As an alternative we re-used the hash table built for the inner table A and probed only the outer table B which resulted in the best performance in our experiments with the TPC-H queries. In general, for all n-way joins we materialized the last hash table and probed only the rightmost table in the left-deep plans that Smoke produces.

Note that the set of optimizations that we presented above can be enabled by any DBMS. This means that there is no intrinsic problem with the re-write rules of Perm and GProm, beyond the ones that we described in Section 2 and experimentally showed in our experiments. Rather, DBMSs, on top of which logical systems are built, need to perform low-level optimizations of physical plans. This is why we did not compare against systems that do not enable these optimizations and directly optimized logical alternatives within Smoke.

Tuning Smoke techniques. In Section 3.3 we noted that Smoke specifically optimizes SPJA queries with pk-fk joins. Note that the set of optimizations presented above are also enabled by Smoke for Smoke-D. The difference is that Smoke-D does not need to materialize as in . Instead, it combines together and to build directly lineage indexes and always outperforms the logical approaches even with the optimizations because it avoids storing the denormalized representation that logical approaches perform. Furthermore, in our experiments we discussed that Smoke-I oupterforms Smoke-D in TPC-H queries that we experimented with. For these TPC-H queries, Smoke-I results in annotating intermediate hash-tables with rids as part of lineage capture for joins and the final aggregation operator uses these rids to perform the Inject method. Hence, Smoke-I avoids the expensive join of Smoke-D and outperforms it.

Tuning physical techniques. Similarly to logical techniques, we have also implemented Phys-Mem and Phys-Bdb within Smoke. More specifically, for a compiled query plan as generated by Lazy, we instrument it with virtual function calls to emit lineage (i.e., <output, input> rids). Then, assume that we want to capture backward lineage. For one-to-many relations between output and input, Phys-Mem probes a hash table on the output rid. Each entry in the hash table keeps a pointer to an rid index that we use to append the input rid. For one-to-one relations, we use an rid list where we append the input rid. In contrast, Phys-Bdb, adds the <output, input> rids in BerkeleyDB with the key being the output rid. (For forward lineage the process is the same but we probe on the input rids and we append output rids.) For our experiments, we used BerkeleyDB 12.1 that we instructed to (a) be in-memory with cache size 0.5GiB (which is sufficient to avoid spooling to disk) and (b) use B-Tree as its internal indexing structure. Essentially, both approaches are similar to Smoke-I but Phys-Mem implements the capture logic of Smoke-I in a virtual function instead of having it inline in the compiled plan while Phys-Bdb uses the B-tree indexes of BerkeleyDB instead of the lineage indexes of Smoke.

## Appendix C Variants of TPC-H Q1

In Section 6.4 we evaluated the optimizations of Smoke on three variants of TPC-H , namely, ; ; and , and compared them with lazy lineage query approaches. Here, we give a more detailed description of these queries and their lazy alternatives. First, we give a brief background on lazy alternatives [21, 43].

Lazy lineage capture. Consider a base group-by aggregation query , where are group-by attributes and is the set of aggregation functions. A backward lineage query for a given output is equivalent to . Furthermore, if the base query includes selections, these selections need to be added in the selection clause of . Essentially, instead of accessing lineage through indexes as in the eager approaches of Smoke, the lazy approaches access lineage with selection scans on base relations. Similar are the equivalence rules for backward (or forward) lineage queries on top of joins, selections, projections, or general workflows [21, 43].

Now, recall that TPC-H is specified as shown below (Smoke’s hash-based evaluation precludes sorting and ORDER BY clauses are omitted):

Q1 = SELECT
l_returnflag, l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount))
as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax))
as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM lineitem
WHERE l_shipdate < ’1998-12-01’
GROUP BY
l_returnflag,
l_linestatus

The output of is four groups derived from combinations of l_returnflag and l_linestatus (i.e., ).

drills down from on the year and month of l_shipdate given an output group of . Using with the backward lineage query construct this operation can be specified as follows:

Q1 = SELECT ...,
extract(year from l_shipdate),
extract(month from l_shipdate)
FROM  ,lineitem)
GROUP BY
extract(year from l_shipdate),
extract(month from l_shipdate)

For the eager approach of Smoke, is evaluated using only the lineage indexes that we construct during the execution of . As a result, we do not need to add the selections specified in in because the backward lineage will retrieve tuples of lineitem that satisfy these selections.

Now, regarding the lazy approach to evaluate , recall that a backward lineage for a group-by aggregation query can be specified lazily as . By applying this rule on we derive its lazy alternative:

Q1 =
SELECT ...,
extract(year from l_shipdate),
extract(month from l_shipdate)
FROM lineitem
WHERE
l_shipdate < ’1998-12-01’ and
l_linestatus = ? and
l_returnflag = ?
GROUP BY
extract(year from l_shipdate),
extract(month from l_shipdate)

Given an output group of we can parameterize the selections above and evaluate using a selection scan on lineitem (as opposed to the indexed scan using our Smoke techniques). Finally, note that in our experiments we considered backward lineage queries to be specified for only one output group of . This avoids disjunctive selections on l_linestatus and l_shipdate (e.g., change the single selection l_linestatus = ? above to l_linestatus IN ()), since we only need to backward trace to tuples that contribute only to one group. However, in the general case, lazy approaches need to consider these expensive disjunctions. In contrast, eager approaches have access to tuples of each group through the lineage indexes and do not require to re-calculate groups with expensive selections.

Following the logic of above we similarly derived the eager and lazy alternatives of and . For completeness, we list them below without further discussion.

Q1
SELECT ... FROM ,lineitem)
WHERE l_shipinstruct = ? and
l_shipmode = ?
GROUP BY ...
Q1
SELECT ... FROM lineitem
WHERE l_shipdate < ’1998-12-01’
l_shipinstruct = ? and l_shipmode = ? and
l_linestatus = ? and l_returnflag = ?
GROUP BY ...
Q1
SELECT ... FROM ,lineitem)
GROUP BY ..., l_tax
Q1
SELECT ... FROM lineitem
WHERE l_shipdate < ’1998-12-01’ and
l_shipinstruct = ? and l_shipmode = ? and
l_linestatus = ? and l_returnflag = ? and
extract(year from l_shipdate) = ?
extract(month from l_shipdate) = ?
GROUP BY ..., l_tax

## Appendix D Crossfiltering using lineage

In Section 6.5.1, we introduced two techniques (i.e., BT and BT+FT) for crossfiltering using lineage that we compared with the Lazy approach. In this section, we give a more detailed technical discussion. Figure 17 shows the proposed physical plans (i.e., BT and BT+FT) for crossfiltering alongside the naive approach (i.e., Lazy) and drives our discussion.

Consider a set of queries Q = SELECT G, F(J) FROM T GROUP BY G for which we seek to support crossfiltering functionality,

Lazy. During the execution of each , Lazy simply executes the group-by aggregations without capturing lineage. Then, given a selection of a subset of outputs of (i.e., ), Lazy supports crossfiltering by updating each as follows:

Q=SELECT G, F(J)
FROM T
WHERE
GROUP BY G

Essentially, Lazy supports crossfiltering by performing lazy lineage capture to identify the partitions of the base relation that contributed to the selected outputs. Finally, to evaluate the set of all updated , Lazy does not execute each update separately. Rather it uses a shared selection scan of the input relation with the selection being to avoid multiple, expensive selection scans of the base relation.

During the execution of initial group-by aggregations, BT and BT+FT perform lineage capture in order to to use lineage indexes for crossfiltering.

BT. As we noted above, Lazy supports crossfiltering with lazy lineage capture. This means that above is equivalent to the following query.

Q=SELECT G,F(J)
FROM backward_trace(,T)
GROUP BY G

Since we have built a backward index bt, during the execution of , then can be evaluated using an indexed-scan on the base table using only the rid arrays of bt of bt that correspond to the selected output records . Essentially, with this approach we can avoid the selection scans of Lazy with indexed scans from lineage. Similarly to Lazy, BT uses a shared scan but this time a shared indexed-scan based on bt.

BT+FT. Finally, note that BT still needs to perform group-by aggregations which is a very costly operation due to the construction of hash tables. Instead, recall the notion of a forward lineage index for a group-by aggregation query: Each tuple in the input is associated with a group in the output. So, forward indexes provide a perfect hashing between tuples in the base table and group-by aggregation results that we have already calculated (i.e., the initial views). Hence, instead of constructing hash tables, BT+FT uses the forward indexes as perfect hash tables and performs crossfiltering as follows:

agg_update() in the listing above simply updates the aggregation (e.g., for COUNT(*), agg_update is simply [fw[][bw[i][j]]]++). Finally, note the remove_non_affected_groups function at the end of the listing above. This function loops over the groups of each updated group-by and removes the groups that were not affected by the group-by. In the case of COUNT(*) this is simply the groups that have a zero count. For other aggregates, like SUM, we need to track which groups were updated within the agg_update functions. However, in most interactive visualizations it is more important to maintain the groups even if the have a zero count (or a zero sum). In that case the remove_non_affected_groups can simply by ignored and the agg_update can perform the update without updating a state of what groups were updated. Our experiments in Section 6.5.1 report the latency of BT+FT including the time for this operation.

## Appendix E Lineage Semantics in Smoke

In Section 2, we noted that Smoke uses transformation provenance semantics that can allow us to encode several novel provenance semantics at the will of end-developers. Here, we give a brief discussion on how one can go about encoding new semantics in Smoke.

Consider the following example query and database: SELECT COUNT(*), A.cname, B.pname FROM A.cid = B.id GROUP BY A.cname, B.pname.

cid cname
1 Bob
2 Alice
oid cid pname date
1 1 iPhone 12/25
2 1 iPhone 12/25
3 2 XBox 12/25

The output of this query is the following:

COUNT(*) A.cname B.pname
1 Bob iPhone
2 Alice xBox

According to Smoke’s provenance semantics the backward index for with respect to table contains the tuple rid twice. This is important because in this way Smoke can encode multiple provenance semantics. The why-provenance of is and to answer why-provenance queries Smoke simply concatenates the backward index rids: rids at the same position in the backward indexes for A and B correspond to the why-provenance witnesses. Now, the which-provenance of is which Smoke can derive by performing a set union of the backward indexes. Finally, the how-provenance of is which Smoke can derive by set unioning and concatenating the backward rid indexes, similarly to the operations for which- and why-provenance.

Now, note that all these operations to derive different provenance semantics are lineage consuming queries whose logic we can push down as workload-aware optimizations similar to the ones in Section 4. In which case Smoke operates as a which-, why-, or how-provenance system for the case illustrated above. However, note that depending on the provenance semiring that how-provenance captures, the lineage consuming logic can be different and expressing semirings as lineage consuming queries is an open question.

These observations, along with the focus of Smoke to capture forward lineage, highlight that Smoke provides a general architecture for novel provenance semantics.

## Appendix F Instrumentation Algebra

In Section 3 we presented the physical algebra of Smoke for lineage capture of projection, selections, hash-based group-by aggregations, and hash-based equi-joins. In this section, we extend this algebra for bag and set union, intersection, and difference as well as nested-loop -joins and cross products.

### f.1 Set Union

Set union between two relations and (i.e., , where denotes set union and uattrs the attributes from and to union on) are implemented in a hash-based way with consecutive appends to a hash table: Initially, the operator builds a hash table using the relation with the key being the attributes of the union (i.e., uattrs). Then, probes the hash table constructed by on the union attributes using relation . If an entry does not already exist for the union attributes, appends a new entry in the hash table with the union attributes. Essentially, and are the same operator, that probe and append tuples in a hash table. The only difference is that takes as input an empty hash table while takes as input a pre-built hash table. Finally, scans the hash table and constructs the output. Next, we discuss Defer and Inject lineage capture approaches for set union; Figure 18 shows their corresponding physical plans.

Inject:  LABEL:dl:set_union_inject illustrates the Inject lineage capture of Smoke for set union. Similarly to group-by aggregation, Inject rewrites to append, besides the union attributes, two arrays a_rids and b_rids that track which tuples from and , respectively, contribute to the hash table entry. During we populate a_rids and during we populate b_rids. Finally, outputs the result and the lineage indexes.

Defer:  LABEL:dl:set_union_defer illustrates the Defer lineage capture of Smoke for set union. Similarly to group-by aggregation, Defer rewrites and to append an oid to each hash table entry, initially set to , besides the union attributes. Then, outputs the set union result and assigns the correct oid to each hash table entry.