Bandwidth-Optimal Random Shuffling for GPUs

06/11/2021
by   Rory Mitchell, et al.
0

Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies. Moreover, existing parallel shuffling algorithms show unsatisfactory performance on GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed `bijective shuffle' trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a consistent, linear time, statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Empirical results show that the bijective shuffle algorithm outperforms competing algorithms on multicore CPUs and GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2022

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

We present Onesweep, a least-significant digit (LSD) radix sorting algor...
research
03/05/2019

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

In this work, we propose FUSE, a novel GPU cache system that integrates ...
research
07/30/2023

Exploiting Parallel Memory Write Requests for Covert Channel Attacks in Integrated CPU-GPU Systems

In heterogeneous SoCs, accelerators like integrated GPUs (iGPUs) are int...
research
01/11/2019

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Frameworks like Numpy are a popular choice for application developers fr...
research
05/15/2018

Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry

In this paper, we design parallel write-efficient geometric algorithms t...
research
05/25/2021

A linear parallel algorithm to compute bisimulation and relational coarsest partitions

The most efficient way to calculate strong bisimilarity is by calculatio...
research
07/12/2023

Benchmarking the Security Protocol and Data Model (SPDM) for component authentication

Efforts to secure computing systems via software traditionally focus on ...

Please sign up or login with your details

Forgot password? Click here to reset