DeepAI AI Chat
Log In Sign Up

Efficiently Processing Workflow Provenance Queries on SPARK

by   Rajmohan C, et al.

In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.


page 1

page 2

page 3

page 4


WfCommons: A Framework for Enabling Scientific Workflow Research and Development

Scientific workflows are a cornerstone of modern scientific computing. T...

The role of visual saliency in the automation of seismic interpretation

In this paper, we propose a workflow based on SalSi for the detection an...

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

An advantage of scientific workflow systems is their ability to collect ...

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...

Learning Graph Partitions

Given a partition of a graph into connected components, the membership o...

Fast Subspace Identification Method Based on Containerised Cloud Workflow Processing System

Subspace identification (SID) has been widely used in system identificat...