Efficient executions of Pipelined Conjugate Gradient Method on Heterogeneous Architectures

05/13/2021
by   Manasi Tiwari, et al.
0

The Preconditioned Conjugate Gradient (PCG) method is widely used for solving linear systems of equations with sparse matrices. A recent version of PCG, Pipelined PCG, eliminates the dependencies in the computations of the PCG algorithm so that the non-dependent computations can be overlapped with communication. In this paper, we propose three methods for efficient execution of the Pipelined PCG algorithm on GPU accelerated heterogeneous architectures. The first two methods achieve task-parallelism using asynchronous executions of different tasks on CPU cores and GPU. The third method achieves data parallelism by decomposing the workload between CPU and GPU based on a performance model. The performance model takes into account the relative performance of CPU cores and GPU using some initial executions and performs 2D data decomposition. We also implement optimization strategies like kernel fusion for GPU and merging vector operations for CPU. Our methods give up to 8x speedup and on average 3x speedup over PCG CPU implementation of Paralution and PETSc libraries. They also give up to 5x speedup and on average 1.45x speedup over PCG GPU implementation of Paralution and PETSc libraries. The third method also provides an efficient solution for solving problems that cannot be fit into the GPU memory and gives up to 2.5x speedup for such problems.

READ FULL TEXT

page 1

page 3

page 4

page 6

research
03/28/2022

Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm

We improve on GenASM, a recent algorithm for genomic sequence alignment,...
research
10/22/2015

ZNN - A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-Core and Many-Core Shared Memory Machines

Convolutional networks (ConvNets) have become a popular approach to comp...
research
09/26/2022

From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types

Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a co...
research
05/14/2019

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Sparse matrix-vector multiplication (SpMV) operations are commonly used ...
research
03/05/2020

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach

This article presents an automatic approach to quickly derive a good sol...
research
11/06/2017

Fast Integral Histogram Computations on GPU for Real-Time Video Analytics

In many Multimedia content analytics frameworks feature likelihood maps ...
research
06/15/2023

MuMFiM: Multiscale Modeling of Fibrous Materials

This article presents MuMFiM, an open source application for multiscale ...

Please sign up or login with your details

Forgot password? Click here to reset