A Cost-based Storage Format Selector for Materialization in Big Data Frameworks

06/11/2018
by   Rana Faisal Munir, et al.
0

Modern big data frameworks (such as Hadoop and Spark) allow multiple users to do large-scale analysis simultaneously. Typically, users deploy Data-Intensive Workflows (DIWs) for their analytical tasks. These DIWs of different users share many common parts (i.e, 50-80 in future executions. The materialization improves the overall processing time of DIWs and also saves computational resources. Current solutions for materialization store data on Distributed File Systems (DFS) by using a fixed data format. However, a fixed choice might not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (i.e., horizontal, vertical or hybrid) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present a cost-based approach which helps deciding the most appropriate storage format in every situation. A generic cost-based storage format selector framework considering the three fragmentation strategies is presented. Then, we use our framework to instantiate cost models for specific Hadoop data formats (namely SequenceFile, Avro and Parquet), and test it with realistic use cases. Our solution gives on average 33 SequenceFile, 11 provides upto 25

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2019

Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

Big data systems development is full of challenges in view of the variet...
research
09/24/2021

User-Defined Functions for HDF5

Scientific datasets are known for their challenging storage demands and ...
research
01/25/2020

GeoRocket: A scalable and cloud-based data store for big geospatial files

We present GeoRocket, a software for the management of very large geospa...
research
04/06/2022

Benchmarking Apache Arrow Flight – A wire-speed protocol for data transfer, querying and microservices

Moving structured data between different big data frameworks and/or data...
research
01/25/2021

Towards an Open Format for Scalable System Telemetry

A data representation for system behavior telemetry for scalable big dat...
research
09/14/2022

PAPyA: Performance Analysis of Large RDF Graphs Processing Made Easy

Prescriptive Performance Analysis (PPA) has shown to be more useful than...

Please sign up or login with your details

Forgot password? Click here to reset