Transcoding Billions of Unicode Characters per Second with SIMD Instructions

09/21/2021
by   Daniel Lemire, et al.
0

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art disks and networks. These transcoding functions make little use of the single-instruction-multiple-data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on current systems (x64 and ARM). To ensure reproducibility, we make our software freely available as an open source library.

READ FULL TEXT

Authors

page 1

page 9

page 17

02/22/2019

Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format ...
10/06/2020

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on inge...
11/14/2021

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UT...
04/02/2013

Software for creating pictures in the LaTeX environment

To create a text with graphic instructions for output pictures into LATE...
09/22/2017

Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to a...
03/30/2017

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds and o...
05/29/2021

Examiner: Automatically Locating Inconsistent Instructions Between Real Devices and CPU Emulators for ARM

Emulator is widely used to build dynamic analysis frameworks due to its ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.