Distributed Caching for Complex Querying of Raw Arrays

03/16/2018
by   Weijie Zhao, et al.
0

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time.

READ FULL TEXT
research
10/11/2019

Sub-query Fragmentation for Query Analysis and Data Caching in the Distributed Environment

When data stores and users are distributed geographically, it is essenti...
research
02/03/2022

QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data

In this work, we explore the problem of correctly and efficiently answer...
research
12/03/2019

Learning Multi-dimensional Indexes

Scanning and filtering over multi-dimensional tables are key operations ...
research
11/20/2022

Metadata Caching in Presto: Towards Fast Data Processing

Presto is an open-source distributed SQL query engine for OLAP, aiming f...
research
09/13/2023

Finding Morton-Like Layouts for Multi-Dimensional Arrays Using Evolutionary Algorithms

The layout of multi-dimensional data can have a significant impact on th...
research
05/12/2022

Query Complexity Based Optimal Processing of Raw Data

The paper aims to find an efficient way for processing large datasets ha...
research
07/13/2016

Deleting and Testing Forbidden Patterns in Multi-Dimensional Arrays

Understanding the local behaviour of structured multi-dimensional data i...

Please sign up or login with your details

Forgot password? Click here to reset