Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures

by   Andy Nguyen, et al.

Tensor decomposition (TD) is an important method for extracting latent information from high-dimensional (multi-modal) sparse data. This study presents a novel framework for accelerating fundamental TD operations on massively parallel GPU architectures. In contrast to prior work, the proposed Blocked Linearized CoOrdinate (BLCO) format enables efficient out-of-memory computation of tensor algorithms using a unified implementation that works on a single tensor copy. Our adaptive blocking and linearization strategies not only meet the resource constraints of GPU devices, but also accelerate data indexing, eliminate control-flow and memory-access irregularities, and reduce kernel launching overhead. To address the substantial synchronization cost on GPUs, we introduce an opportunistic conflict resolution algorithm that discovers and resolves conflicting updates across threads on-the-fly, without keeping any auxiliary information or storing non-zero elements in specific mode orientations. As a result, our framework delivers superior in-memory performance compared to prior state-of-the-art, and is the only framework capable of processing out-of-memory tensors. On the latest Intel and NVIDIA GPUs, BLCO achieves 2.12-2.6X geometric-mean speedup (with up to 33.35X speedup) over the state-of-the-art mixed-mode compressed sparse fiber (MM-CSF) on a range of real-world sparse tensors.



page 9


ALTO: Adaptive Linearized Storage of Sparse Tensors

The analysis of high-dimensional sparse data is becoming increasingly po...

A High-Throughput Solver for Marginalized Graph Kernels on GPU

We present the design of a solver for the efficient and high-throughput ...

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Designing efficient and scalable sparse linear algebra kernels on modern...

GrateTile: Efficient Sparse Tensor Tiling for CNN Processing

We propose GrateTile, an efficient, hardwarefriendly data storage scheme...

Efficient Tensor Kernel methods for sparse regression

Recently, classical kernel methods have been extended by the introductio...

cu_FastTucker: A Faster and Stabler Stochastic Optimization for Parallel Sparse Tucker Decomposition on Multi-GPUs

High-Order, High-Dimension, and Sparse Tensor (HOHDST) data originates f...

Accelerating Domain Propagation: an Efficient GPU-Parallel Algorithm over Sparse Matrices

Fast domain propagation of linear constraints has become a crucial compo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.