Comparisons of Algorithms in Big Data Processing

04/14/2020
by   Amirali Daghighi, et al.
0

Parallel computing is the fundamental base for MapReduce framework in Hadoop. Each data chunk is replicated over 3 servers for increasing availability of data and decreasing probability of data loss. Hence, the 3 servers that have Map task stored on their disk are fastest servers to process them, which are called local servers. All servers in the same rack as local servers are called rack-local servers that are slower than local servers since data chunk associated with Map task should be fetched through top of the rack switch. All other servers are called remote servers that are slowest servers since they need to fetch data from a local server in another rack, so data should be transmitted through at least 2 top of rack switches and a core switch. Note that number of switches in path of data transfer depends on internal network structure of data centers. The First-In-First-Out (FIFO) and Hadoop Fair Scheduler (HFS) algorithms do not take rack structure of data centers into account, so they are known to not be heavy-traffic delay optimal or even throughput optimal. The recent advances on scheduling for data centers considering rack structure of them and heterogeneity of servers resulted in state-of-the-art Balanced-PANDAS algorithm that outperforms classic MaxWeight algorithm. In both Balanced-PANDAS and MaxWeight algorithms, processing rate of local, rack-local, and remote servers are assumed to be known. However, with the change of traffic over time in addition to estimation errors of processing rates, it is not realistic to consider processing rates to be known. In this work, we study robustness of Balanced-PANDAS and MaxWeight algorithms in terms of inaccurate estimations of processing rates. We observe that Balanced-PANDAS is not as sensitive as MaxWeight on the accuracy of processing rates, making it more appealing to use in data centers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2019

The Power of d Choices in Scheduling for Data Centers with Heterogeneous Servers

MapReduce framework is the de facto in big data and its applications whe...
research
05/09/2019

Load Balancing Guardrails: Keeping Your Heavy Traffic on the Road to Low Response Times

Load balancing systems, comprising a central dispatcher and a scheduling...
research
09/23/2017

GB-PANDAS: Throughput and heavy-traffic optimality analysis for affinity scheduling

Dynamic affinity scheduling has been an open problem for nearly three de...
research
03/25/2021

Accelerating Big-Data Sorting Through Programmable Switches

Sorting is a fundamental and well studied problem that has been studied ...
research
02/20/2020

Asymptotically Optimal Load Balancing in Large-scale Heterogeneous Systems with Multiple Dispatchers

We consider the load balancing problem in large-scale heterogeneous syst...
research
08/23/2017

Optimal Threshold Policies for Robust Data Center Control

With the simultaneous rise of energy costs and demand for cloud computin...
research
03/28/2020

Distributed function estimation: adaptation using minimal communication

We investigate whether in a distributed setting, adaptive estimation of ...

Please sign up or login with your details

Forgot password? Click here to reset