Distributed Out-of-Memory SVD on CPU/GPU Architectures

08/17/2022
by   Ismael Boureima, et al.
0

We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks seen in the power method are typically associated with the computation of the Gram matrix A^TA, which can be significant when A is large and dense, or when A is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of A^TA by using a batching strategy where the intermediate factors are computed block by block. We also suppress I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128PB with 1e-6 sparsity.

READ FULL TEXT
research
09/21/2020

Ranky : An Approach to Solve Distributed SVD on Large Sparse Matrices

Singular Value Decomposition (SVD) is a well studied research topic in m...
research
02/19/2022

Distributed Out-of-Memory NMF of Dense and Sparse Data on CPU/GPU Architectures with Automatic Model Selection for Exascale Data

The need for efficient and scalable big-data analytics methods is more e...
research
01/29/2021

Performance of the low-rank tensor-train SVD (TT-SVD) for large dense tensors on modern multi-core CPUs

There are several factorizations of multi-dimensional tensors into lower...
research
06/16/2018

A New High Performance and Scalable SVD algorithm on Distributed Memory Systems

This paper introduces a high performance implementation of Zolo-SVD algo...
research
10/13/2020

Projection techniques to update the truncated SVD of evolving matrices

This paper considers the problem of updating the rank-k truncated Singul...
research
02/15/2020

Sparse Coresets for SVD on Infinite Streams

In streaming Singular Value Decomposition (SVD), d-dimensional rows of a...
research
04/17/2021

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Because of the superior feature representation ability of deep learning,...

Please sign up or login with your details

Forgot password? Click here to reset