Characterizing BigBench queries, Hive, and Spark in multi-cloud environments

07/06/2020
by   Nicolas Poggi, et al.
0

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases – queries – which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1GB to 10TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

READ FULL TEXT

page 10

page 11

research
06/15/2019

Query and Resource Optimizations: A Case for Breaking the Wall in Big Data Systems

Modern big data systems run on cloud environments where resources are sh...
research
04/12/2022

Forecasting SQL Query Cost at Twitter

With the advent of the Big Data era, it is usually computationally expen...
research
09/09/2020

CASH: A Credit Aware Scheduling for Public Cloud Platforms

The public cloud offers a myriad of services which allows its tenants to...
research
09/01/2018

Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

Cloud-based data analysis is nowadays common practice because of the low...
research
04/06/2022

Sigma Workbook: A Spreadsheet for Cloud Data Warehouses

Cloud data warehouses (CDWs) bring large-scale data and compute power cl...
research
07/08/2020

Cloud Based Big Data DNS Analytics at Turknet

Domain Name System (DNS) is a hierarchical distributed naming system for...
research
04/19/2023

Tutorial: The Ubiquitous Skiplist, its Variants, and Applications in Modern Big Data Systems

The Skiplist, or skip list, originally designed as an in-memory data str...

Please sign up or login with your details

Forgot password? Click here to reset