Dataset Lifecycle Framework and its applications in Bioinformatics

02/24/2021
by   Yiannis Gkoufas, et al.
0

Bioinformatics pipelines depend on shared POSIX filesystems for its input, output and intermediate data storage. Containerization makes it more difficult for the workloads to access the shared file systems. In our previous study, we were able to run both ML and non-ML pipelines on Kubeflow successfully. However, the storage solutions were complex and less optimal. This is because there are no established resource types to represent the concept of data source on Kubernetes. More and more applications are running on Kubernetes for batch processing. End users are burdened with configuring and optimising the data access, which is what we have experienced before. In this article, we are introducing a new concept of Dataset and its corresponding resource as a native Kubernetes object. We have leveraged the Dataset Lifecycle Framework which takes care of all the low-level details about data access in Kubernetes pods. Its pluggable architecture is designed for the development of caching, scheduling and governance plugins. Together, they manage the entire lifecycle of the custom resource Dataset. We use Dataset Lifecycle Framework to serve data from object stores to both ML and non-ML pipelines running on Kubeflow. With DLF, we make training data fed into ML models directly without being downloaded to the local disks, which makes the input scalable. We have enhanced the durability of training metadata by storing it into a dataset, which also simplifies the set up of the Tensorboard, separated from the notebook server. For the non-ML pipeline, we have simplified the 1000 Genome Project pipeline with datasets injected into the pipeline dynamically. In addition, our preliminary results indicate that the pluggable caching mechanism can improve the performance significantly.

READ FULL TEXT
research
11/07/2021

Plumber: Diagnosing and Removing Performance Bottlenecks in Machine Learning Data Pipelines

Input pipelines, which ingest and transform input data, are an essential...
research
06/03/2019

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Machine learning (ML) has become increasingly important and performance-...
research
03/05/2022

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

The use of FPGAs for efficient graph processing has attracted significan...
research
04/17/2023

eTOP: Early Termination of Pipelines for Faster Training of AutoML Systems

Recent advancements in software and hardware technologies have enabled t...
research
04/16/2020

Developing and Deploying Machine Learning Pipelines against Real-Time Image Streams from the PACS

Executing machine learning (ML) pipelines on radiology images is hard du...
research
06/29/2022

The Vera C. Rubin Observatory Data Butler and Pipeline Execution System

The Rubin Observatory's Data Butler is designed to allow data file locat...
research
04/17/2023

Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments

Running complex sets of machine learning experiments is challenging and ...

Please sign up or login with your details

Forgot password? Click here to reset