Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

01/29/2022
by   Andy Nguyen, et al.
0

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized Coordinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm, in which threads collaborate instead of contending on memory access to discover and resolve their conflicting updates on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12-2.6X geometric-mean speedup (with up to 33.35X speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors.

READ FULL TEXT
research
02/20/2021

ALTO: Adaptive Linearized Storage of Sparse Tensors

The analysis of high-dimensional sparse data is becoming increasingly po...
research
10/14/2019

A High-Throughput Solver for Marginalized Graph Kernels on GPU

We present the design of a solver for the efficient and high-throughput ...
research
04/06/2019

Load-Balanced Sparse MTTKRP on GPUs

Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the...
research
12/13/2020

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Designing efficient and scalable sparse linear algebra kernels on modern...
research
03/13/2023

∇SD: Differentiable Programming for Sparse Tensors

Sparse tensors are prevalent in many data-intensive applications, yet ex...
research
09/18/2020

GrateTile: Efficient Sparse Tensor Tiling for CNN Processing

We propose GrateTile, an efficient, hardwarefriendly data storage scheme...
research
04/14/2022

cu_FastTucker: A Faster and Stabler Stochastic Optimization for Parallel Sparse Tucker Decomposition on Multi-GPUs

High-Order, High-Dimension, and Sparse Tensor (HOHDST) data originates f...

Please sign up or login with your details

Forgot password? Click here to reset