Duncode Characters Shorter

07/11/2023
by   Changshang Xue, et al.
0

This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at <https://github.com/laohur/duncode>. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at <https://github.com/laohur/wiki2txt>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2022

OmniPrint: A Configurable Printed Character Synthesizer

We introduce OmniPrint, a synthetic data generator of isolated printed c...
research
02/18/2016

Encoding Data for HTM Systems

Hierarchical Temporal Memory (HTM) is a biologically inspired machine in...
research
10/09/2020

Weaponizing Unicodes with Deep Learning – Identifying Homoglyphs with Weakly Labeled Data

Visually similar characters, or homoglyphs, can be used to perform socia...
research
11/09/2018

Typeface Completion with Generative Adversarial Networks

The mood of a text and the intention of the writer can be reflected in t...
research
03/27/2022

UAST: Unicode Aware Sanskrit Transliteration

Devanagari is the writing system that is adapted by various languages li...
research
07/17/2023

A benchmark of categorical encoders for binary classification

Categorical encoders transform categorical features into numerical repre...
research
09/12/2018

Multimodal neural pronunciation modeling for spoken languages with logographic origin

Graphemes of most languages encode pronunciation, though some are more e...

Please sign up or login with your details

Forgot password? Click here to reset