A High-Throughput Solver for Marginalized Graph Kernels on GPU

10/14/2019
by   Yu-Hang Tang, et al.
0

We present the design of a solver for the efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using conjugate gradient to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. The sparsity of the graphs is exploited hierarchically by storing only non-empty tiles of the graphs and non-zero elements within each tile using a coordinate format and bitmaps, respectively. A new partition-based reordering algorithm is proposed for aggregating non-zero elements of the graphs into fewer but denser tiles in order to improve the efficiency of the sparse format. We carried out extensive theoretical analyses on the graph tensor product primitives between tiles of various density, and evaluated their performance on synthetic and real-world datasets. Our implementation is able to deliver three to four orders of magnitude speedup over existing CPU-based solvers. The capability of the solver can enable kernel-based learning tasks at unprecedented scales.

READ FULL TEXT

page 7

page 8

page 9

page 11

research
01/29/2022

Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

Tensor decomposition (TD) is an important method for extracting latent i...
research
05/27/2020

Optimization of Tensor-product Operations in Nekbone on GPUs

In the CFD solver Nek5000, the computation is dominated by the evaluatio...
research
11/22/2022

High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature

In this work, we present a well-optimized GPU implementation of Dilithiu...
research
03/25/2021

ButterFly BFS – An Efficient Communication Pattern for Multi Node Traversals

Breadth-First Search (BFS) is a building block used in a wide array of g...
research
10/20/2020

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

Tucker decomposition is one of the most popular models for analyzing and...
research
03/27/2023

Maple: A Processing Element for Row-Wise Product Based Sparse Tensor Accelerators

Sparse tensor computing is a core computational part of numerous applica...
research
03/20/2023

Exploiting Inter-Operation Data Reuse in Scientific Applications using GOGETA

HPC applications are critical in various scientific domains ranging from...

Please sign up or login with your details

Forgot password? Click here to reset