Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

04/07/2023
by   Queenie Luo, et al.
0

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames – a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

READ FULL TEXT
research
04/30/2020

Character-Level Translation with Self-attention

We explore the suitability of self-attention models for character-level ...
research
04/01/2021

Keyword Transformer: A Self-Attention Model for Keyword Spotting

The Transformer architecture has been successful across many domains, in...
research
10/31/2022

Structured State Space Decoder for Speech Recognition and Synthesis

Automatic speech recognition (ASR) systems developed in recent years hav...
research
05/28/2021

Hierarchical Transformer Encoders for Vietnamese Spelling Correction

In this paper, we propose a Hierarchical Transformer model for Vietnames...
research
08/27/2019

Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure

The architecture of Transformer is based entirely on self-attention, and...
research
05/11/2020

Hierarchical Attention Transformer Architecture For Syntactic Spell Correction

The attention mechanisms are playing a boosting role in advancements in ...
research
05/12/2021

Spelling Correction with Denoising Transformer

We present a novel method of performing spelling correction on short inp...

Please sign up or login with your details

Forgot password? Click here to reset