Transcoding Billions of Unicode Characters per Second with SIMD Instructions

09/21/2021
by   Daniel Lemire, et al.
0

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.

READ FULL TEXT

page 1

page 9

page 17

research
12/09/2022

Transcoding Unicode Characters with AVX-512 Instructions

Intel includes on its recent processors a powerful set of instructions c...
research
02/22/2019

Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format ...
research
10/06/2020

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on inge...
research
11/14/2021

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UT...
research
04/02/2013

Software for creating pictures in the LaTeX environment

To create a text with graphic instructions for output pictures into LATE...
research
09/22/2017

Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to a...
research
09/16/2022

Lossless SIMD Compression of LiDAR Range and Attribute Scan Sequences

As LiDAR sensors have become ubiquitous, the need for an efficient LiDAR...

Please sign up or login with your details

Forgot password? Click here to reset