Transcoding Billions of Unicode Characters per Second with SIMD Instructions

by   Daniel Lemire, et al.

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.



page 1

page 9

page 17


Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format ...

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on inge...

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UT...

Software for creating pictures in the LaTeX environment

To create a text with graphic instructions for output pictures into LATE...

Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to a...

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds and o...

Examiner: Automatically Locating Inconsistent Instructions Between Real Devices and CPU Emulators for ARM

Emulator is widely used to build dynamic analysis frameworks due to its ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.