Metadata Caching in Presto: Towards Fast Data Processing

11/20/2022
by   Beinan Wang, et al.
0

Presto is an open-source distributed SQL query engine for OLAP, aiming for "SQL on everything". Since open-sourced in 2013, Presto has been consistently gaining popularity in large-scale data analytics and attracting adoption from a wide range of enterprises. From the development and operation of Presto, we witnessed a significant amount of CPU consumption on parsing column-oriented data files in Presto worker nodes. This blocks some companies, including Meta, from increasing analytical data volumes. In this paper, we present a metadata caching layer, built on top of the Alluxio SDK cache and incorporated in each Presto worker node, to cache the intermediate results in file parsing. The metadata cache provides two caching methods: caching the decompressed metadata bytes from raw data files and caching the deserialized metadata objects. Our evaluation of the TPC-DS benchmark on Presto demonstrates that when the cache is warm, the first method can reduce the query's CPU consumption by 10 can minimize the CPU usage by 20

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2019

Sub-query Fragmentation for Query Analysis and Data Caching in the Distributed Environment

When data stores and users are distributed geographically, it is essenti...
research
11/06/2020

Optimal Online Algorithms for File-Bundle Caching and Generalization to Distributed Caching

We consider a generalization of the standard cache problem called file-b...
research
05/29/2021

SMURF: Efficient and Scalable Metadata Access for Distributed Applications

In parallel with big data processing and analysis dominating the usage o...
research
11/03/2017

Toward real-time data query systems in HEP

Exploratory data analysis tools must respond quickly to a user's questio...
research
09/17/2020

Extensible Data Skipping

Data skipping reduces I/O for SQL queries by skipping over irrelevant da...
research
03/16/2018

Distributed Caching for Complex Querying of Raw Arrays

As applications continue to generate multi-dimensional data at exponenti...
research
04/07/2015

Garbage Collection Techniques for Flash-Resident Page-Mapping FTLs

Storage devices based on flash memory have replaced hard disk drives (HD...

Please sign up or login with your details

Forgot password? Click here to reset