Occode: an end-to-end machine learning pipeline for transcription of historical population censuses

06/07/2021
by   Bjørn-Richard Pedersen, et al.
0

Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end machine learning pipeline that scales to the dataset size, and a model that achieves high accuracy with few manual transcriptions. In addition, the correctness of the model results must be verified. This paper describes our lessons learned developing, tuning, and using the Occode end-to-end machine learning pipeline for transcribing 7,3 million rows with handwritten occupation codes in the Norwegian 1950 population census. We achieve an accuracy of 97 for the automatically transcribed codes, and we send 3 verification. We verify that the occupation code distribution found in our result matches the distribution found in our training data which should be representative for the census as a whole. We believe our approach and lessons learned are useful for other transcription projects that plan to use machine learning in production. The source code is available at: https://github.com/uit-hdl/rhd-codes

READ FULL TEXT

page 3

page 12

research
03/10/2023

Marginalia and machine learning: Handwritten text recognition for Marginalia Collections

The pressing need for digitization of historical document collections ha...
research
06/28/2023

More efficient manual review of automatically transcribed tabular data

Machine learning methods have proven useful in transcribing historical d...
research
02/20/2020

KaoKore: A Pre-modern Japanese Art Facial Expression Dataset

From classifying handwritten digits to generating strings of text, the d...
research
04/05/2023

Low-Shot Learning for Fictional Claim Verification

In this paper, we study the problem of claim verification in the context...
research
05/30/2022

End-to-End Topology-Aware Machine Learning for Power System Reliability Assessment

Conventional power system reliability suffers from the long run time of ...
research
08/31/2022

End-to-End Rationale Reconstruction

The logic behind design decisions, called design rationale, is very valu...
research
12/29/2019

Deep learning surrogate models for spatial and visual connectivity

Spatial and visual connectivity are important metrics when developing wo...

Please sign up or login with your details

Forgot password? Click here to reset