DeepAI AI Chat
Log In Sign Up

Understanding Data Science Lifecycle Provenance via Graph Segmentation and Summarization

by   Hui Miao, et al.

Increasingly modern data science platforms today have non-intrusive and extensible provenance ingestion mechanisms to collect rich provenance and context information, handle modifications to the same file using distinguishable versions, and use graph data models (e.g., property graphs) and query languages (e.g., Cypher) to represent and manipulate the stored provenance/context information. Due to the schema-later nature of the metadata, multiple versions of the same files, and unfamiliar artifacts introduced by team members, the "provenance graph" is verbose and evolving, and hard to understand; using standard graph query model, it is difficult to compose queries and utilize this valuable information. In this paper, we propose two high-level graph query operators to address the verboseness and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the retrospective provenance between a set of source vertices and a set of destination vertices via flexible boundary criteria to help users get insight about the derivation relationships among those vertices. We show the semantics of such a query in terms of a context-free grammar, and develop efficient algorithms that run orders of magnitude faster than state-of-the-art. Second, we propose a graph summarization operator that combines similar segments together to query prospective provenance of the underlying project. The operator allows tuning the summary by ignoring vertex details and characterizing local structures, and ensures the provenance meaning using path constraints. We show the optimal summary problem is PSPACE-complete and develop effective approximation algorithms. The operators are implemented on top of a property graph backend. We evaluate our query methods extensively and show the effectiveness and efficiency of the proposed methods.


page 1

page 2

page 3

page 4


Personalized Graph Summarization: Formulation, Scalable Algorithms, and Applications

Are users of an online social network interested equally in all connecti...

Shortest Beer Path Queries based on Graph Decomposition

Given a directed edge-weighted graph G=(V, E) with beer vertices B⊆ V, a...

Efficient Semantic Summary Graphs for Querying Large Knowledge Graphs

Knowledge Graphs (KGs) integrate heterogeneous data, but one challenge i...

Context-Free Path Querying by Matrix Multiplication

Graph data models are widely used in many areas, for example, bioinforma...

An Algorithm for Context-Free Path Queries over Graph Databases

RDF (Resource Description Framework) is a standard language to represent...

Efficient Continuous Multi-Query Processing over Graph Streams

Graphs are ubiquitous and ever-present data structures that have a wide ...