Unicode at Gigabytes per Second

by   Daniel Lemire, et al.

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.


page 1

page 2

page 3

page 4


Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and ...

ImarisWriter: Open Source Software for Storage of Large Images in Blockwise Multi-Resolution Format

We publish as open source a high performance file writer library to stor...

Number Parsing at a Gigabyte per Second

With disks and networks providing gigabytes per second, parsing decimal ...

SFILES 2.0: An extended text-based flowsheet representation

SFILES is a text-based notation for chemical process flowsheets. It was ...

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

There exists a natural tension between encouraging a diverse ecosystem o...

FPScreen: A Rapid Similarity Search Tool for Massive Molecular Library Based on Molecular Fingerprint Comparison

We designed a fast similarity search engine for large molecular librarie...

Anonymization of Whole Slide Images in Histopathology for Research and Education

Objective: The exchange of health-related data is subject to regional la...

Please sign up or login with your details

Forgot password? Click here to reset