STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

by   Yijie Mei, et al.

Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications.


page 1

page 2

page 3

page 4


Industrial Big Data Analytics: Challenges, Methodologies, and Applications

While manufacturers have been generating highly distributed data from va...

BigDL: A Distributed Deep Learning Framework for Big Data

In this paper, we present BigDL, a distributed deep learning framework f...

SecureDL: Securing Code Execution and Access Control for Distributed Data Analytics Platforms

Distributed data analytics platforms such as Apache Spark enable cost-ef...

Translation of Array-Based Loops to Distributed Data-Parallel Programs

Large volumes of data generated by scientific experiments and simulation...

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Programming stateful cloud applications remains a very painful experienc...

NetRPC: Enabling In-Network Computation in Remote Procedure Calls

People have shown that in-network computation (INC) significantly boosts...

Labyrinth: Compiling Imperative Control Flow to Parallel Dataflows

Parallel dataflow systems have become a standard technology for large-sc...

Please sign up or login with your details

Forgot password? Click here to reset