H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation

08/26/2021
by   Wajih Boukaram, et al.
0

Tile low rank representations of dense matrices partition them into blocks of roughly uniform size, where each off-diagonal tile is compressed and stored as its own low rank factorization. They offer an attractive representation for many data-sparse dense operators that appear in practical applications, where substantial compression and a much smaller memory footprint can be achieved. TLR matrices are a compromise between the simplicity of a regular perfectly-strided data structure and the optimal complexity of the unbalanced trees of hierarchically low rank matrices, and provide a convenient performance-tuning parameter through their tile size that can be proportioned to take into account the cache size where the tiles reside in the memory hierarchy. There are currently no high-performance algorithms that can generate Cholesky and LDL^T factorizations, particularly on GPUs. The difficulties in achieving high performance when factoring TLR matrices come from the expensive compression operations that must be performed during the factorization process and the adaptive rank distribution of the tiles that causes an irregular work pattern for the processing cores. In this work, we develop a dynamic batching operation and combine it with batched adaptive randomized approximations to achieve high performance both on GPUs and CPUs. Our implementation attains over 1.2 TFLOP/s in double precision on the V100 GPU, and is limited by the performance of batched GEMM operations. The Cholesky factorization of covariance matrix of size N = 131K arising in spatial statistics can be factored to an accuracy ϵ=10^-2 in just a few seconds. We believe the proposed GEMM-centric algorithm allows it to be readily ported to newer hardware such as the tensor cores that are optimized for small GEMM operations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2019

Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression

Hierarchical matrices are space and time efficient representations of de...
research
07/13/2017

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

We present high performance implementations of the QR and the singular v...
research
08/23/2022

Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies

Factorization of large dense matrices are ubiquitous in engineering and ...
research
10/11/2022

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition

Low-rank Candecomp / PARAFAC (CP) Decomposition is a powerful tool for t...
research
01/29/2021

Performance of the low-rank tensor-train SVD (TT-SVD) for large dense tensors on modern multi-core CPUs

There are several factorizations of multi-dimensional tensors into lower...
research
07/30/2023

HODLR3D: Hierarchical matrices for N-body problems in three dimensions

This article introduces HODLR3D, a class of hierarchical matrices arisin...
research
10/26/2016

The Reverse Cuthill-McKee Algorithm in Distributed-Memory

Ordering vertices of a graph is key to minimize fill-in and data structu...

Please sign up or login with your details

Forgot password? Click here to reset