High Performance Data Engineering Everywhere

07/19/2020
by   Chathura Widanage, et al.
0

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.

READ FULL TEXT

page 1

page 2

page 6

research
10/27/2020

A Fast, Scalable, Universal Approach For Distributed Data Aggregations

In the current era of Big Data, data engineering has transformed into an...
research
09/14/2022

PAPyA: Performance Analysis of Large RDF Graphs Processing Made Easy

Prescriptive Performance Analysis (PPA) has shown to be more useful than...
research
04/16/2018

BigDL: A Distributed Deep Learning Framework for Big Data

In this paper, we present BigDL, a distributed deep learning framework f...
research
05/22/2019

AXS: A framework for fast astronomical data processing based on Apache Spark

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-sourc...
research
09/05/2023

Data-Juicer: A One-Stop Data Processing System for Large Language Models

The immense evolution in Large Language Models (LLMs) has underscored th...
research
07/03/2023

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

The Data Science domain has expanded monumentally in both research and i...
research
07/27/2021

HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

Data-intensive applications impact many domains, and their steadily incr...

Please sign up or login with your details

Forgot password? Click here to reset