Validating UTF-8 In Less Than One Instruction Per Byte

10/06/2020
by   John Keiser, et al.
0

The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2021

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and ...
research
12/09/2022

Transcoding Unicode Characters with AVX-512 Instructions

Intel includes on its recent processors a powerful set of instructions c...
research
09/22/2017

Roaring Bitmaps: Implementation of an Optimized Software Library

Compressed bitmap indexes are used in systems such as Git or Oracle to a...
research
05/27/2023

Ethical Considerations Towards Protestware

A key drawback to using a Open Source third-party library is the risk of...
research
05/11/2018

The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD

UKB is an open source collection of programs for performing, among other...
research
05/02/2021

Assessing Exception Handling Testing Practices in Open-Source Libraries

Modern programming languages (e.g., Java and C#) provide features to sep...
research
02/22/2019

Parsing Gigabytes of JSON per Second

JavaScript Object Notation or JSON is a ubiquitous data exchange format ...

Please sign up or login with your details

Forgot password? Click here to reset