A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)

05/17/2021
by   Berenger Bramas, et al.
0

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/24/2017

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake

The modern CPU's design, which is composed of hierarchical memory and SI...
research
02/19/2020

Fast Implementation of Morphological Filtering Using ARM NEON Extension

In this paper we consider speedup potential of morphological image filte...
research
05/12/2022

Vectorized and performance-portable Quicksort

Recent works showed that implementations of Quicksort using vector CPU i...
research
04/24/2017

Fast Sorting Algorithms using AVX-512 on Intel Knights Landing

This paper describes fast sorting techniques using the recent AVX-512 in...
research
08/21/2019

Engineering Faster Sorters for Small Sets of Items

Sorting a set of items is a task that can be useful by itself or as a bu...
research
03/16/2018

The ARM Scalable Vector Extension

This article describes the ARM Scalable Vector Extension (SVE). Several ...
research
09/04/2019

Galois Field Arithmetics for Linear Network Coding using AVX512 Instruction Set Extensions

Linear network coding requires arithmetic operations over Galois fields,...

Please sign up or login with your details

Forgot password? Click here to reset