Transcoding Unicode Characters with AVX-512 Instructions

12/09/2022
by   Robert Clausecker, et al.
0

Intel includes on its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open source library.

READ FULL TEXT

page 1

page 8

research
09/21/2021

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and ...
research
11/07/2019

Efficient Computation of Positional Population Counts Using SIMD Instructions

In several fields such as statistics, machine learning, and bioinformati...
research
05/31/2023

Minotaur: A SIMD-Oriented Synthesizing Superoptimizer

Minotaur is a superoptimizer for LLVM's intermediate representation that...
research
01/21/2021

UNIT: Unifying Tensorized Instruction Compilation

Because of the increasing demand for computation in DNN, researchers dev...
research
10/06/2020

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on inge...
research
10/02/2019

Base64 encoding and decoding at almost the speed of a memory copy

Many common document formats on the Internet are text-only such as email...
research
03/30/2017

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds and o...

Please sign up or login with your details

Forgot password? Click here to reset