Transcoding Billions of Unicode Characters per Second with SIMD Instructions

by   Daniel Lemire, et al.

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.


page 1

page 9

page 17


Transcoding Unicode Characters with AVX-512 Instructions

Intel includes on its recent processors a powerful set of instructions c...

Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format ...

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on inge...

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UT...

Software for creating pictures in the LaTeX environment

To create a text with graphic instructions for output pictures into LATE...

Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to a...

Lossless SIMD Compression of LiDAR Range and Attribute Scan Sequences

As LiDAR sensors have become ubiquitous, the need for an efficient LiDAR...

Please sign up or login with your details

Forgot password? Click here to reset