Lifelong learning for text retrieval and recognition in historical handwritten document collections

12/11/2019
by   Lambert Schomaker, et al.
15

This chapter provides an overview of the problems that need to be dealt with when constructing a lifelong-learning retrieval, recognition and indexing engine for large historical document collections in multiple scripts and languages, the Monk system. This application is highly variable over time, since the continuous labeling by end users changes the concept of what a 'ground truth' constitutes. Although current advances in deep learning provide a huge potential in this application domain, the scale of the problem, i.e., more than 520 hugely diverse books, documents and manuscripts precludes the current meticulous and painstaking human effort which is required in designing and developing successful deep-learning systems. The ball-park principle is introduced, which describes the evolution from the sparsely-labeled stage that can only be addressed by traditional methods or nearest-neighbor methods on embedded vectors of pre-trained neural networks, up to the other end of the spectrum where massive labeling allows reliable training of deep-learning methods. Contents: Introduction, Expectation management, Deep learning, The ball-park principle, Technical realization, Work flow, Quality and quantity of material, Industrialization and scalability, Human effort, Algorithms, Object of recognition, Processing pipeline, Performance,Compositionality, Conclusion.

READ FULL TEXT

page 5

page 9

page 13

research
07/05/2019

A Novel Deep Learning Pipeline for Retinal Vessel Detection in Fluorescein Angiography

While recent advances in deep learning have significantly advanced the s...
research
03/15/2021

Generating Synthetic Handwritten Historical Documents With OCR Constrained GANs

We present a framework to generate synthetic historical documents with p...
research
04/28/2020

On the Reliability of Test Collections for Evaluating Systems of Different Types

As deep learning based models are increasingly being used for informatio...
research
09/17/2019

Fast Search with Poor OCR

The indexing and searching of historical documents have garnered attenti...
research
02/13/2017

Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive

The German Broadcasting Archive (DRA) maintains the cultural heritage of...

Please sign up or login with your details

Forgot password? Click here to reset