Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

by   Polykarpos Thomadakis, et al.

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300 and linear scalability on a node equipped with four GPUs. The framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20 messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40 optimizations at the library level as well as by creating opportunities to leverage application-specific optimizations like over-decomposition.


page 6

page 17


Towards Performance Portable Programming for Distributed Heterogeneous Systems

Hardware heterogeneity is here to stay for high-performance computing. L...

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing

This paper introduces a novel approach to automatic ahead-of-time (AOT) ...

Heterogeneous Active Messages (HAM) – Implementing Lightweight Remote Procedure Calls in C++

We present HAM (Heterogeneous Active Messages), a C++-only active messag...

HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy

The pervasive adoption of Deep Learning (DL) and Graph Processing (GP) m...

GX-Plug: a Middleware for Plugging Accelerators to Distributed Graph Processing

Recently, research communities highlight the necessity of formulating a ...

Modeling Data Movement Performance on Heterogeneous Architectures

The cost of data movement on parallel systems varies greatly with machin...

From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions

We study the simulation of stellar mergers, which requires complex simul...

Please sign up or login with your details

Forgot password? Click here to reset