Applications of Machine Learning in Document Digitisation

02/05/2021
by   Christian M. Dahl, et al.
8

Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that 'large and detailed' usually implies 'costly and difficult', especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. We instead advocate the use of modern machine learning techniques to automate the digitisation process. We give an overview of the potential for applying machine digitisation for data collection through two illustrative applications. The first demonstrates that unsupervised layout classification applied to raw scans of nurse journals can be used to construct a treatment indicator. Moreover, it allows an assessment of assignment compliance. The second application uses attention-based neural networks for handwritten text recognition in order to transcribe age and birth and death dates from a large collection of Danish death certificates. We describe each step in the digitisation pipeline and provide implementation insights.

READ FULL TEXT

page 5

page 11

page 12

page 13

page 22

page 23

page 25

page 35

research
11/08/2018

A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective

Data collection is a major bottleneck in machine learning and an active ...
research
12/01/2021

Learning to automate cryo-electron microscopy data collection with Ptolemy

Over the past decade, cryogenic electron microscopy (cryo-EM) has emerge...
research
12/09/2022

PATO: Policy Assisted TeleOperation for Scalable Robot Data Collection

Large-scale data is an essential component of machine learning as demons...
research
07/16/2021

Learning to Limit Data Collection via Scaling Laws: Data Minimization Compliance in Practice

Data minimization is a legal obligation defined in the European Union's ...
research
07/02/2021

Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

Domain-specific data is the crux of the successful transfer of machine l...
research
01/06/2020

Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

Travelogues represent an important and intensively studied source for sc...

Please sign up or login with your details

Forgot password? Click here to reset