Handwritten Stenography Recognition and the LION Dataset

08/15/2023
by   Raphaela Heil, et al.
0

Purpose: In this paper, we establish a baseline for handwritten stenography recognition, using the novel LION dataset, and investigate the impact of including selected aspects of stenographic theory into the recognition process. We make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition. Methods: A state-of-the-art text recognition model is trained to establish a baseline. Stenographic domain knowledge is integrated by applying four different encoding methods that transform the target sequence into representations, which approximate selected aspects of the writing system. Results are further improved by integrating a pre-training scheme, based on synthetic data. Results: The baseline model achieves an average test character error rate (CER) of 29.81 reduced significantly by combining stenography-specific target sequence encodings with pre-training and fine-tuning, yielding CERs in the range of 24.5 Conclusion: The obtained results demonstrate the challenging nature of stenography recognition. Integrating stenography-specific knowledge, in conjunction with pre-training and fine-tuning on synthetic data, yields considerable improvements. Together with our precursor study on the subject, this is the first work to apply modern handwritten text recognition to stenography. The dataset and our code are publicly available via Zenodo.

READ FULL TEXT

page 6

page 7

page 8

page 9

research
05/21/2020

Text-to-Text Pre-Training for Data-to-Text Tasks

We study the pre-train + fine-tune strategy for data-to-text tasks. Fine...
research
12/16/2021

Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription

We present a self-supervised pre-training approach for learning rich vis...
research
04/17/2018

Synthetic data generation for Indic handwritten text recognition

This paper presents a novel approach to generate synthetic dataset for h...
research
12/15/2021

DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization

We propose a novel domain-specific generative pre-training (DS-GPT) meth...
research
06/19/2023

Handwritten Text Recognition from Crowdsourced Annotations

In this paper, we explore different ways of training a model for handwri...
research
03/11/2021

Full Page Handwriting Recognition via Image to Sequence Extraction

We present a Neural Network based Handwritten Text Recognition (HTR) mod...
research
03/09/2022

Text-DIAE: Degradation Invariant Autoencoders for Text Recognition and Document Enhancement

In this work, we propose Text-Degradation Invariant Auto Encoder (Text-D...

Please sign up or login with your details

Forgot password? Click here to reset