Digitizing Historical Balance Sheet Data: A Practitioner's Guide

03/31/2022
by   Sergio Correia, et al.
0

This paper discusses how to successfully digitize large-scale historical micro-data by augmenting optical character recognition (OCR) engines with pre- and post-processing methods. Although OCR software has improved dramatically in recent years due to improvements in machine learning, off-the-shelf OCR applications still present high error rates which limits their applications for accurate extraction of structured information. Complementing OCR with additional methods can however dramatically increase its success rate, making it a powerful and cost-efficient tool for economic historians. This paper showcases these methods and explains why they are useful. We apply them against two large balance sheet datasets and introduce "quipucamayoc", a Python package containing these methods in a unified framework.

READ FULL TEXT

page 3

page 8

page 10

page 13

page 19

research
07/25/2018

Turbulucid: A Python Package for Post-Processing of Fluid Flow Simulations

A Python package for post-processing of plane two-dimensional data from ...
research
02/25/2020

CausalML: Python Package for Causal Machine Learning

CausalML is a Python implementation of algorithms related to causal infe...
research
09/09/2019

OCR4all -- An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Optical Character Recognition (OCR) on historical printings is a challen...
research
10/03/2017

A Survey on Optical Character Recognition System

Optical Character Recognition (OCR) has been a topic of interest for man...
research
06/29/2023

The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps

Scanned historical maps in libraries and archives are valuable repositor...
research
09/27/2022

3D Rendering Framework for Data Augmentation in Optical Character Recognition

In this paper, we propose a data augmentation framework for Optical Char...
research
03/29/2011

Application of Threshold Techniques for Readability Improvement of Jawi Historical Manuscript Images

Historical documents such as old books and manuscripts have a high aesth...

Please sign up or login with your details

Forgot password? Click here to reset