Quantifying Character Similarity with Vision Transformers

05/24/2023
by   Xinmei Yang, et al.
0

Record linkage is a bedrock of quantitative social science, as analyses often require linking data from multiple, noisy sources. Off-the-shelf string matching methods are widely used, as they are straightforward and cheap to implement and scale. Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching. However, such lists do not exist for many settings, skewing research with linked datasets towards a few high-resource contexts that are not representative of the diversity of human societies. This study develops an extensible way to measure character substitution costs for OCR'ed documents, by employing large-scale self-supervised training of vision transformers (ViT) with augmented digital fonts. For each language written with the CJK script, we contrastively learn a metric space where different augmentations of the same character are represented nearby. In this space, homoglyphic characters - those with similar appearance such as “O” and “0” - have similar vector representations. Using the cosine distance between characters' representations as the substitution cost in an edit distance matching algorithm significantly improves record linkage compared to other widely used string matching methods, as OCR errors tend to be homoglyphic in nature. Homoglyphs can plausibly capture character visual similarity across any script, including low-resource settings. We illustrate this by creating homoglyph sets for 3,000 year old ancient Chinese characters, which are highly pictorial. Fascinatingly, a ViT is able to capture relationships in how different abstract concepts were conceptualized by ancient societies, that have been noted in the archaeological literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2023

Linking Representations with Multimodal Contrastive Learning

Many applications require grouping instances contained in diverse docume...
research
10/25/2019

Massively Parallel Algorithms for String Matching with Wildcards

We study distributed algorithms for string matching problem in presence ...
research
03/31/2020

A Clustering Framework for Lexical Normalization of Roman Urdu

Roman Urdu is an informal form of the Urdu language written in Roman scr...
research
12/24/2014

A Fuzzy Based Model to Identify Printed Sinhala Characters (ICIAfS14)

Character recognition techniques for printed documents are widely used f...
research
01/26/2021

Coloring the Black Box: What Synesthesia Tells Us about Character Embeddings

In contrast to their word- or sentence-level counterparts, character emb...
research
07/16/2018

Combining a Context Aware Neural Network with a Denoising Autoencoder for Measuring String Similarities

Measuring similarities between strings is central for many established a...
research
04/05/2023

Efficient OCR for Building a Diverse Digital History

Thousands of users consult digital archives daily, but the information t...

Please sign up or login with your details

Forgot password? Click here to reset