Log In Sign Up

A Probabilistic Framework for Lexicon-based Keyword Spotting in Handwritten Text Images

by   E. Vidal, et al.

Query by String Keyword Spotting (KWS) is here considered as a key technology for indexing large collections of handwritten text images to allow fast textual access to the contents of these collections. Under this perspective, a probabilistic framework for lexicon-based KWS in text images is presented. The presentation aims at providing a tutorial view that helps to understand the relations between classical statements of KWS and the relative challenges entailed by these statements. More specifically, the development of the proposed framework makes it self-evident that word recognition or classification implicitly or explicitly underlies any formulation of KWS. Moreover, it clearly suggests that the same statistical models and training methods successfully used for handwriting text recognition can advantageously be used also for KWS, even though KWS does not generally require or rely on any kind of previously produced image transcripts. These ideas are developed into a specific, probabilistically sound approach for segmentation-free, lexicon-based, query-by-string KWS. Experiments carried out using this approach are presented, which support the consistency and general interest of the proposed framework. Several datasets, traditionally used for KWS benchmarking are considered, with results significantly better than those previously published for these datasets. In addition, results on two new, larger handwritten text image datasets are reported, showing the great potential of the methods proposed in this paper for indexing and textual search in large collections of handwritten documents.


page 1

page 2

page 3

page 4


Open Set Classification of Untranscribed Handwritten Documents

Huge amounts of digital page images of important manuscripts are preserv...

Asking questions on handwritten document collections

This work addresses the problem of Question Answering (QA) on handwritte...

Neural Ctrl-F: Segmentation-free Query-by-String Word Spotting in Handwritten Manuscript Collections

In this paper, we approach the problem of segmentation-free query-by-str...

Indexing Highly Repetitive String Collections

Two decades ago, a breakthrough in indexing string collections made it p...

Contextual Pattern Matching

The research on indexing repetitive string collections has focused on th...

TextStyleBrush: Transfer of Text Aesthetics from a Single Example

We present a novel approach for disentangling the content of a text imag...

Bootstrapping Weakly Supervised Segmentation-free Word Spotting through HMM-based Alignment

Recent work in word spotting in handwritten documents has yielded impres...