Data Engineering for HPC with Python

10/13/2020
by   Vibhatha Abeykoon, et al.
0

Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

READ FULL TEXT

page 2

page 4

page 5

research
12/01/2020

Python Workflows on HPC Systems

The recent successes and wide spread application of compute intensive ma...
research
08/18/2020

Benchmarking network fabrics for data distributed training of deep neural networks

Artificial Intelligence/Machine Learning applications require the traini...
research
10/28/2020

ePython: An implementation of Python for the many-core Epiphany coprocessor

The Epiphany is a many-core, low power, low on-chip memory architecture ...
research
10/29/2020

Advanced Python Performance Monitoring with Score-P

Within the last years, Python became more prominent in the scientific co...
research
07/01/2016

Want Drugs? Use Python

We describe how Python can be leveraged to streamline the curation, mode...
research
09/09/2021

AutoSmart: An Efficient and Automatic Machine Learning framework for Temporal Relational Data

Temporal relational data, perhaps the most commonly used data type in in...
research
04/12/2018

Fast Counting in Machine Learning Applications

We propose scalable methods to execute counting queries in machine learn...

Please sign up or login with your details

Forgot password? Click here to reset