In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

07/03/2023
by   Niranda Perera, et al.
0

The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking a look at this problem from a high-performance computing point of view. In a prior publication, we presented a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon [1]. In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns. Furthermore, we evaluate the performance of Cylon on the ORNL Summit supercomputer.

READ FULL TEXT

page 5

page 10

page 11

research
09/13/2022

High Performance Dataframes from Parallel Processing Patterns

The data science community today has embraced the concept of Dataframes ...
research
01/19/2023

Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of...
research
07/19/2020

High Performance Data Engineering Everywhere

The amazing advances being made in the fields of machine and deep learni...
research
08/13/2021

HPTMT Parallel Operators for High Performance Data Science Data Engineering

Data-intensive applications are becoming commonplace in all science disc...
research
10/27/2020

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

In the current era of Big Data, data engineering has transformed into an...
research
11/26/2017

Obtaining the coefficients of a Vector Autoregression Model through minimization of parameter criteria

VAR models are a type of multi-equation model that have been widely appl...
research
12/03/2022

Applications of AI in Astronomy

We provide a brief, and inevitably incomplete overview of the use of Mac...

Please sign up or login with your details

Forgot password? Click here to reset