Towards Optimizing Storage Costs on the Cloud

05/24/2023
by   Koyel Mukherjee, et al.
0

We study the problem of optimizing data storage and access costs on the cloud while ensuring that the desired performance or latency is unaffected. We first propose an optimizer that optimizes the data placement tier (on the cloud) and the choice of compression schemes to apply, for given data partitions with temporal access predictions. Secondly, we propose a model to learn the compression performance of multiple algorithms across data partitions in different formats to generate compression performance predictions on the fly, as inputs to the optimizer. Thirdly, we propose to approach the data partitioning problem fundamentally differently than the current default in most data lakes where partitioning is in the form of ingestion batches. We propose access pattern aware data partitioning and formulate an optimization problem that optimizes the size and reading costs of partitions subject to access patterns. We study the various optimization problems theoretically as well as empirically, and provide theoretical bounds as well as hardness results. We propose a unified pipeline of cost minimization, called SCOPe that combines the different modules. We extensively compare the performance of our methods with related baselines from the literature on TPC-H data as well as enterprise datasets (ranging from GB to PB in volume) and show that SCOPe substantially improves over the baselines. We show significant cost savings compared to platform baselines, of the order of 50 that range from terabytes to petabytes in volume.

READ FULL TEXT
research
07/08/2023

Compression Performance Analysis of Different File Formats

In data storage and transmission, file compression is a common technique...
research
05/19/2022

On Efficiently Partitioning a Topic in Apache Kafka

Apache Kafka addresses the general problem of delivering extreme high vo...
research
09/05/2017

Inhomogeneous Hypergraph Clustering with Applications

Hypergraph partitioning is an important problem in machine learning, com...
research
08/17/2022

Partitioning Hypergraphs is Hard: Models, Inapproximability, and Applications

We study the balanced k-way hypergraph partitioning problem, with a spec...
research
07/20/2023

Flatness-Aware Minimization for Domain Generalization

Domain generalization (DG) seeks to learn robust models that generalize ...
research
07/01/2019

On Slicing Sorted Integer Sequences

Representing sorted integer sequences in small space is a central proble...
research
10/05/2021

Phoebe: A Learning-based Checkpoint Optimizer

Easy-to-use programming interfaces paired with cloud-scale processing en...

Please sign up or login with your details

Forgot password? Click here to reset