Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

11/04/2022
by   Ubaid Ullah Hafeez, et al.
0

Data processing frameworks such as Apache Beam and Apache Spark are used for a wide range of applications, from logs analysis to data preparation for DNN training. It is thus unsurprising that there has been a large amount of work on optimizing these frameworks, including their storage management. The shift to cloud computing requires optimization across all pipelines concurrently running across a cluster. In this paper, we look at one specific instance of this problem: placement of I/O-intensive temporary intermediate data on SSD and HDD. Efficient data placement is challenging since I/O density is usually unknown at the time data needs to be placed. Additionally, external factors such as load variability, job preemption, or job priorities can impact job completion times, which ultimately affect the I/O density of the temporary files in the workload. In this paper, we envision that machine learning can be used to solve this problem. We analyze production logs from Google's data centers for a range of data processing pipelines. Our analysis shows that I/O density may be predictable. This suggests that learning-based strategies, if crafted carefully, could extract predictive features for I/O density of temporary files involved in various transformations, which could be used to improve the efficiency of storage management in data processing pipelines.

READ FULL TEXT
research
12/03/2018

Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments

This chapter presents software architectures of the big data processing ...
research
12/02/2018

Koji: Automating pipelines with mixed-semantics data sources

We propose a new result-oriented semantic for defining data processing w...
research
10/04/2022

Integrating pre-processing pipelines in ODC based framework

Using on-demand processing pipelines to generate virtual geospatial prod...
research
01/28/2021

tf.data: A Machine Learning Data Processing Framework

Training machine learning models requires feeding input data for models ...
research
09/22/2021

Astronomical Pipeline Provenance: A Use Case Evaluation

In this decade astronomy is undergoing a paradigm shift to handle data f...
research
08/13/2021

Digital Twin of a Cloud Data Centre: OpenStack Cluster Visualisation

Data centres in contemporary times are essential as the supply of data i...
research
02/07/2022

Comprehensive Performance Analysis of Homomorphic Cryptosystems for Practical Data Processing

Oblivious data processing has been an on and off topic for the last deca...

Please sign up or login with your details

Forgot password? Click here to reset