Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

READ FULL TEXT
research
02/01/2018

Data Dwarfs: A Lens Towards Fully Understanding Big Data and AI Workloads

The complexity and diversity of big data and AI workloads make understan...
research
12/19/2019

Is Big Data Performance Reproducible in Modern Cloud Networks?

Performance variability has been acknowledged as a problem for over a de...
research
07/05/2018

A Comparative Study of Containers and Virtual Machines in Big Data Environment

Container technique is gaining increasing attention in recent years and ...
research
11/27/2018

A Frequency Scaling based Performance Indicator Framework for Big Data Systems

It is important for big data systems to identify their performance bottl...
research
06/07/2016

Big Data Refinement

"Big data" has become a major area of research and associated funding, a...
research
04/20/2018

Analyzing astronomical data with Apache Spark

We investigate the performances of Apache Spark, a cluster computing fra...
research
07/05/2021

Big Data Information and Nowcasting: Consumption and Investment from Bank Transactions in Turkey

We use the aggregate information from individual-to-firm and firm-to-fir...

Please sign up or login with your details

Forgot password? Click here to reset