Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

04/28/2016
by   Ahsan Javed Awan, et al.
0

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing are stream processing workloads have similar micro-architectural characteristics and are bounded by the latency of frequent data access to DRAM. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10 execution time by up-to 14% and (iii) multiple small executors can provide up-to 36% speedup over single large executor.

READ FULL TEXT

page 6

page 7

page 8

page 9

page 10

research
12/14/2022

Towards Interactive, Adaptive and Result-aware Big Data Analytics

As data volumes grow across applications, analytics of large amounts of ...
research
08/03/2018

Edge Based Data-Driven Pipelines (Technical Report)

This research reports investigates an edge on-device stream processing p...
research
07/21/2022

Templating Shuffles

Cloud data centers are rapidly evolving. At the same time, large-scale d...
research
01/25/2019

A quality model for evaluating and choosing a stream processing framework architecture

Today, we have to deal with many data (Big data) and we need to make dec...
research
11/20/2021

Freeing Compute Caches from Serialization and Garbage Collection in Managed Big Data Analytics

Managed analytics frameworks (e.g., Spark) cache intermediate results in...
research
05/22/2018

Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks

In modern large-scale distributed systems, analytics jobs submitted by v...
research
06/03/2018

Efficient Time-Evolving Stream Processing at Scale

Time-evolving stream datasets exist ubiquitously in many real-world appl...

Please sign up or login with your details

Forgot password? Click here to reset