Hybrid Cloud and HPC Approach to High-Performance Dataframes

12/28/2022
by   Kaiying Shan, et al.
0

Data pre-processing is a fundamental component in any data-driven application. With the increasing complexity of data processing operations and volume of data, Cylon, a distributed dataframe system, is developed to facilitate data processing both as a standalone application and as a library, especially for Python applications. While Cylon shows promising performance results, we experienced difficulties trying to integrate with frameworks incompatible with the traditional Message Passing Interface (MPI). While MPI implementations encompass scalable and efficient communication routines, their process launching mechanisms work well with mainstream HPC systems but are incompatible with some environments that adopt their own resource management systems. In this work, we alleviated this issue by directly integrating the Unified Communication X (UCX) framework, which supports a variety of classic HPC and non-HPC process-bootstrapping mechanisms as our communication framework. While we experimented with our methodology on Cylon, the same technique can be used to bring MPI communication to other applications that do not employ MPI's built-in process management approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2021

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Python has become a dominant programming language for emerging areas lik...
research
04/26/2019

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

As dataset sizes increase, data analysis tasks in high performance compu...
research
12/01/2021

A unified framework to improve the interoperability between HPC and Big Data languages and programming models

One of the most important issues in the path to the convergence of HPC a...
research
09/17/2021

Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters

Understanding and visualizing the full-stack performance trade-offs and ...
research
12/16/2017

An MPI-Based Python Framework for Distributed Training with Keras

We present a lightweight Python framework for distributed training of ne...
research
05/16/2018

Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications

Over the past decade, the fourth paradigm of data-intensive science rapi...
research
05/08/2019

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...

Please sign up or login with your details

Forgot password? Click here to reset