DeepAI AI Chat
Log In Sign Up

Efficiently Processing Workflow Provenance Queries on SPARK

08/25/2018
by   Rajmohan C, et al.
ibm
0

In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/29/2021

WfCommons: A Framework for Enabling Scientific Workflow Research and Development

Scientific workflows are a cornerstone of modern scientific computing. T...
12/31/2018

The role of visual saliency in the automation of seismic interpretation

In this paper, we propose a workflow based on SalSi for the detection an...
07/25/2018

Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

An advantage of scientific workflow systems is their ability to collect ...
11/29/2017

Multimodal Attribute Extraction

The broad goal of information extraction is to derive structured informa...
12/15/2021

Learning Graph Partitions

Given a partition of a graph into connected components, the membership o...
12/29/2021

Fast Subspace Identification Method Based on Containerised Cloud Workflow Processing System

Subspace identification (SID) has been widely used in system identificat...