Efficiently Processing Workflow Provenance Queries on SPARK

08/25/2018
by   Rajmohan C, et al.
0

In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2021

WfCommons: A Framework for Enabling Scientific Workflow Research and Development

Scientific workflows are a cornerstone of modern scientific computing. T...
research
12/31/2018

The role of visual saliency in the automation of seismic interpretation

In this paper, we propose a workflow based on SalSi for the detection an...
research
07/25/2018

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

An advantage of scientific workflow systems is their ability to collect ...
research
11/29/2017

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...
research
12/15/2021

Learning Graph Partitions

Given a partition of a graph into connected components, the membership o...
research
12/29/2021

Fast Subspace Identification Method Based on Containerised Cloud Workflow Processing System

Subspace identification (SID) has been widely used in system identificat...

Please sign up or login with your details

Forgot password? Click here to reset