Large-scale Data Modelling in Hive and Distributed Query Processing using MapReduce and Tez

01/29/2023
by   Abzetdin Adamov, et al.
0

Huge amounts of data being generated continuously by digitally interconnected systems of humans, organizations and machines. Data comes in variety of formats including structured, unstructured and semi-structured, what makes it impossible to apply the same standard approaches, techniques and algorithms to manage and process this data. Fortunately, the enterprise level distributed platform named Hadoop Ecosystem exists. This paper explores Apache Hive component that provides full stack data managements functionality in terms of Data Definition, Data Manipulation and Data Processing. Hive is a data warehouse system, which works with structured data stored in tables. Since, Hive works on top the Hadoop HDSFS, it benefits from extraordinary feature of HDFS including Fault Tolerance, Reliability, High Availability, Scalability, etc. In addition, Hive can take advantage of distributed computing power of the cluster through assigning jobs to MapReduce, Tez and Spark engines to run complex queries. The paper is focused on studying of Hive Data Model and analysis of processing performance done by MapReduce and Tez.

READ FULL TEXT

page 4

page 5

page 7

page 11

page 13

page 14

research
12/28/2022

Does Big Data Require Complex Systems? A Performance Comparison Between Spark and Unicage Shell Scripts

The paradigm of big data is characterized by the need to collect and pro...
research
11/10/2022

Writing summary for the state-of-the-art methods for big data clustering in distributed environment

Big Data processing systems handle huge unstructured and structured data...
research
06/01/2023

Cross Modal Data Discovery over Structured and Unstructured Data Lakes

Organizations are collecting increasingly large amounts of data for data...
research
03/09/2022

Exoshuffle: Large-Scale Shuffle at the Application Level

Shuffle is a key primitive in large-scale data processing applications. ...
research
03/17/2023

Generate, Transform, Answer: Question Specific Tool Synthesis for Tabular Data

Tabular question answering (TQA) presents a challenging setting for neur...
research
04/25/2018

Processing Database Joins over a Shared-Nothing System of Multicore Machines

To process a large volume of data, modern data management systems use a ...
research
09/06/2022

Deploying a sharded MongoDB cluster as a queued job on a shared HPC architecture

Data stores are the foundation on which data science, in all its variati...

Please sign up or login with your details

Forgot password? Click here to reset