Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks

05/22/2018
by   Pietro Michiardi, et al.
0

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2019

In-memory Distributed Spatial Query Processing and Optimization

Due to the ubiquity of spatial data applications and the large amounts o...
research
07/08/2019

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

Due to the ubiquity of spatial data applications and the large amounts o...
research
01/06/2019

Exact Selectivity Computation for Modern In-Memory Database Query Optimization

Selectivity estimation remains a critical task in query optimization eve...
research
04/28/2016

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

While cluster computing frameworks are continuously evolving to provide ...
research
08/27/2020

Cost-based Query Rewriting Techniques for Optimizing Aggregates Over Correlated Windows

Window aggregates are ubiquitous in stream processing. In Azure Stream A...
research
01/31/2018

Henge: Intent-driven Multi-Tenant Stream Processing

We present Henge, a system to support intent-based multi-tenancy in mode...
research
04/17/2019

Terra: Scalable Cross-Layer GDA Optimizations

Geo-distributed analytics (GDA) frameworks transfer large datasets over ...

Please sign up or login with your details

Forgot password? Click here to reset