Unicode at Gigabytes per Second

11/14/2021
by   Daniel Lemire, et al.
0

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2021

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and ...
research
08/24/2020

ImarisWriter: Open Source Software for Storage of Large Images in Blockwise Multi-Resolution Format

We publish as open source a high performance file writer library to stor...
research
01/11/2021

Number Parsing at a Gigabyte per Second

With disks and networks providing gigabytes per second, parsing decimal ...
research
07/25/2022

SFILES 2.0: An extended text-based flowsheet representation

SFILES is a text-based notation for chemical process flowsheets. It was ...
research
03/18/2020

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

There exists a natural tension between encouraging a diverse ecosystem o...
research
06/13/2019

FPScreen: A Rapid Similarity Search Tool for Massive Molecular Library Based on Molecular Fingerprint Comparison

We designed a fast similarity search engine for large molecular librarie...
research
11/11/2022

Anonymization of Whole Slide Images in Histopathology for Research and Education

Objective: The exchange of health-related data is subject to regional la...

Please sign up or login with your details

Forgot password? Click here to reset