HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

07/27/2021
by   Supun Kamburugamuve, et al.
13

Data-intensive applications impact many domains, and their steadily increasing size and complexity demands high-performance, highly usable environments. We integrate a set of ideas developed in various data science and data engineering frameworks. They employ a set of operators on specific data abstractions that include vectors, matrices, tensors, graphs, and tables. Our key concepts are inspired from systems like MPI, HPF (High-Performance Fortran), NumPy, Pandas, Spark, Modin, PyTorch, TensorFlow, RAPIDS(NVIDIA), and OneAPI (Intel). Further, it is crucial to support different languages in everyday use in the Big Data arena, including Python, R, C++, and Java. We note the importance of Apache Arrow and Parquet for enabling language agnostic high performance and interoperability. In this paper, we propose High-Performance Tensors, Matrices and Tables (HPTMT), an operator-based architecture for data-intensive applications, and identify the fundamental principles needed for performance and usability success. We illustrate these principles by a discussion of examples using our software environments, Cylon and Twister2 that embody HPTMT.

READ FULL TEXT

page 1

page 9

research
08/13/2021

HPTMT Parallel Operators for High Performance Data Science Data Engineering

Data-intensive applications are becoming commonplace in all science disc...
research
01/19/2023

Supercharging Distributed Computing Environments For High Performance Data Engineering

The data engineering and data science community has embraced the idea of...
research
07/19/2020

High Performance Data Engineering Everywhere

The amazing advances being made in the fields of machine and deep learni...
research
02/13/2018

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Deep learning models with convolutional and recurrent networks are now u...
research
05/15/2023

Dragon-Alpha cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA Library

Java is very powerful, but in Deep Learning field, its capabilities prob...
research
07/09/2020

A Programming Model for Hybrid Workflows: combining Task-based Workflows and Dataflows all-in-one

This paper tries to reduce the effort of learning, deploying, and integr...
research
11/15/2017

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

This paper describes PlinyCompute, a system for development of high-perf...

Please sign up or login with your details

Forgot password? Click here to reset