Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning

11/13/2017
by   Peter Boyle, et al.
0

We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight's Landing) processors, and using the Linux operating system. The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size. We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB "huge" virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library "Technology Preview" and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2018

ECHO-3DHPC: Advance the performance of astrophysics simulations with code modernization

We present recent developments in the parallelization scheme of ECHO-3DH...
research
09/16/2023

Comparative evaluation of bandwidth-bound applications on the Intel Xeon CPU MAX Series

In this paper we explore the performance of Intel Xeon MAX CPU Series, r...
research
09/11/2023

SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

In this work, fundamental performance, power, and energy characteristics...
research
12/15/2022

Neuroevolution Surpasses Stochastic Gradient Descent for Physics-Informed Neural Networks

The potential of learned models for fundamental scientific research and ...
research
01/14/2016

Evaluation of the Partitioned Global Address Space (PGAS) model for an inviscid Euler solver

In this paper we evaluate the performance of Unified Parallel C (which i...
research
10/15/2021

On Extending Amdahl's law to Learn Computer Performance

The problem of learning parallel computer performance is investigated in...

Please sign up or login with your details

Forgot password? Click here to reset