High Performance Dataframes from Parallel Processing Patterns

09/13/2022
by   Niranda Perera, et al.
0

The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators. In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also underline the flexibility of the proposed API and the extensibility of the framework on different hardware. To the best of our knowledge, Cylon is the first and only distributed-memory parallel dataframe system available today.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2023

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

The Data Science domain has expanded monumentally in both research and i...
research
05/11/2023

Scalable Ray Tracing Using the Distributed FrameBuffer

Image- and data-parallel rendering across multiple nodes on high-perform...
research
01/20/2020

The Parallelism Motifs of Genomic Data Analysis

Genomic data sets are growing dramatically as the cost of sequencing con...
research
02/22/2023

A Unified Cloud-Enabled Discrete Event Parallel and Distributed Simulation Architecture

Cloud simulation environments today are largely employed to model and si...
research
02/08/2018

Towards A Systems Approach To Distributed Programming

It is undeniable that most developers today are building distributed app...
research
01/19/2023

Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of...
research
11/11/2019

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

Jaccard Similarity index is an important measure of the overlap of two s...

Please sign up or login with your details

Forgot password? Click here to reset