cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

05/03/2017
by   Antti-Pekka Hynninen, et al.
0

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose algorithms that both use a shared memory buffer in order to reduce global memory access scatter, and by (b) computing memory positions of tensor elements using a thread-parallel algorithm. We evaluate the performance of cuTT on a variety of benchmarks with tensor ranks ranging from 2 to 12 and show that cuTT performance is independent of the tensor rank and that it performs no worse than an approach based on code generation. We develop a heuristic scheme for choosing the optimal parameters for tensor transpose algorithms by implementing an analytical GPU performance model that can be used at runtime without need for performance measurements or profiling. Finally, by integrating cuTT into the tensor algebra library TAL-SH, we significantly reduce the tensor transpose overhead in tensor contractions, achieving as low as just one percent overhead for arithmetically intensive tensor contractions.

READ FULL TEXT

page 8

page 9

page 19

page 20

research
06/22/2022

tntorch: Tensor Network Learning with PyTorch

We present tntorch, a tensor learning framework that supports multiple d...
research
03/13/2020

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Achieving high-performance GPU kernels requires optimizing algorithm imp...
research
04/14/2022

cu_FastTucker: A Faster and Stabler Stochastic Optimization for Parallel Sparse Tucker Decomposition on Multi-GPUs

High-Order, High-Dimension, and Sparse Tensor (HOHDST) data originates f...
research
08/03/2020

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code gene...
research
04/14/2017

HPTT: A High-Performance Tensor Transposition C++ Library

Recently we presented TTC, a domain-specific compiler for tensor transpo...
research
09/24/2018

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

In this paper, we develop software for decomposing sparse tensors that i...
research
05/18/2017

Spin Summations: A High-Performance Perspective

Besides tensor contractions, one of the most pronounced computational bo...

Please sign up or login with your details

Forgot password? Click here to reset