A Survey on Geographically Distributed Big-Data Processing using MapReduce

07/06/2017
by   Shlomi Dolev, et al.
0

Hadoop and Spark are widely used distributed processing frameworks for large-scale data processing in an efficient and fault-tolerant manner on private or public clouds. These big-data processing systems are extensively used by many industries, e.g., Google, Facebook, and Amazon, for solving a large class of problems, e.g., search, clustering, log analysis, different types of join operations, matrix multiplication, pattern matching, and social network analysis. However, all these popular systems have a major drawback in terms of locally distributed computations, which prevent them in implementing geographically distributed data processing. The increasing amount of geographically distributed massive data is pushing industries and academia to rethink the current big-data processing systems. The novel frameworks, which will be beyond state-of-the-art architectures and technologies involved in the current system, are expected to process geographically distributed data at their locations without moving entire raw datasets to a single location. In this paper, we investigate and discuss challenges and requirements in designing geographically distributed data processing frameworks and protocols. We classify and study batch processing (MapReduce-based systems), stream processing (Spark-based systems), and SQL-style processing geo-distributed frameworks, models, and algorithms with their overhead issues.

READ FULL TEXT
research
11/18/2018

A Survey on Spark Ecosystem for Big Data Processing

With the explosive increase of big data in industry and academic fields,...
research
03/25/2021

Understanding the Challenges and Assisting Developers with Developing Spark Applications

To process data more efficiently, big data frameworks provide data abstr...
research
10/23/2017

Communication Efficient Checking of Big Data Operations

We propose fast probabilistic algorithms with low (i.e., sublinear in th...
research
02/15/2019

Reactive Liquid: Optimized Liquid Architecture for Elastic and Resilient Distributed Data Processing

Today's most prominent IT companies are built on the extraction of insig...
research
02/11/2018

Distributed Readability Analysis Of Turkish Elementary School Textbooks

The readability assessment deals with estimating the level of difficulty...
research
06/21/2019

The Coming Age of Pervasive Data Processing

Emerging Big Data analytics and machine learning applications require a ...
research
12/18/2017

Error-Tolerant Big Data Processing

Real-world data contains various kinds of errors. Before analyzing data,...

Please sign up or login with your details

Forgot password? Click here to reset