Analyzing the Performance Portability of Tensor Decomposition

We employ pressure point analysis and roofline modeling to identify performance bottlenecks and determine an upper bound on the performance of the Canonical Polyadic Alternating Poisson Regression Multiplicative Update (CP-APR MU) algorithm in the SparTen software library. Our analyses reveal that a particular matrix computation, Φ^(n), is the critical performance bottleneck in the SparTen CP-APR MU implementation. Moreover, we find that atomic operations are not a critical bottleneck while higher cache reuse can provide a non-trivial performance improvement. We also utilize grid search on the Kokkos library parallel policy parameters to achieve 2.25x average speedup over the SparTen default for Φ^(n) computation on CPU and 1.70x on GPU. We conclude our investigations by comparing Kokkos implementations of the STREAM benchmark and the matricized tensor times Khatri-Rao product (MTTKRP) benchmark from the Parallel Sparse Tensor Algorithm (PASTA) benchmark suite to implementations using vendor libraries. We show that with a single implementation Kokkos achieves performance comparable to hand-tuned code for fundamental operations that make up tensor decomposition kernels on a wide range of CPU and GPU systems. Overall, we conclude that Kokkos demonstrates good performance portability for simple data-intensive operations but requires tuning for algorithms with more complex dependencies and data access patterns.

READ FULL TEXT

page 14

page 21

research
05/28/2017

A Unified Optimization Approach for Sparse Tensor Operations on GPUs

Sparse tensors appear in many large-scale applications with multidimensi...
research
06/19/2018

Parallel Nonnegative CP Decomposition of Dense Tensors

The CP tensor decomposition is a low-rank approximation of a tensor. We ...
research
12/02/2020

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis

Tensor decomposition models play an increasingly important role in moder...
research
10/22/2020

Efficient parallel CP decomposition with pairwise perturbation and multi-sweep dimension tree

CP tensor decomposition with alternating least squares (ALS) is dominate...
research
05/14/2019

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Sparse matrix-vector multiplication (SpMV) operations are commonly used ...
research
09/08/2020

GPU Parallel Computation of Morse-Smale Complexes

The Morse-Smale complex is a well studied topological structure that rep...
research
02/25/2019

Acceleration of expensive computations in Bayesian statistics using vector operations

Many applications in Bayesian statistics are extremely computationally i...

Please sign up or login with your details

Forgot password? Click here to reset