GPU Tensor Cores for fast Arithmetic Reductions

01/15/2020
by   Cristóbal A. Navarro, et al.
0

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of n numbers as a set of chained m × m matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is T(n)=5 log_m^2n and its speedup is S=45 log_2m^2 over the classic O(n log n) parallel reduction algorithm. Experimental performance results show that the proposed reduction method is ∼ 3.2 × faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of R MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of R=4,5 MMAs per block, while large thread-blocks work best with R=1. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.

READ FULL TEXT
research
03/08/2019

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such a...
research
04/23/2021

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Fast Fourier Transform (FFT) is an essential tool in scientific and engi...
research
10/19/2017

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational prob...
research
11/27/2020

High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores

Many research works have been performed on implementation of Vitrerbi de...
research
10/25/2021

Accelerating Compact Fractals with Tensor Core GPUs

This work presents a GPU thread mapping approach that allows doing fast ...
research
01/03/2022

Squeeze: Efficient Compact Fractals for Tensor Core GPUs

This work presents Squeeze, an efficient compact fractal processing sche...
research
04/25/2020

Efficient GPU Thread Mapping on Embedded 2D Fractals

This work proposes a new approach for mapping GPU threads onto a family ...

Please sign up or login with your details

Forgot password? Click here to reset