Chargrid-OCR: End-to-end trainable Optical Character Recognition through Semantic Segmentation and Object Detection

09/10/2019 ∙ by Christian Reisswig, et al. ∙ 0

We present an end-to-end trainable approach for optical character recognition (OCR) on printed documents. It is based on predicting a two-dimensional character grid ("chargrid") representation of a document image as a semantic segmentation task. To identify individual character instances from the chargrid, we regard characters as objects and use object detection techniques from computer vision. We demonstrate experimentally that our method outperforms previous state-of-the-art approaches in accuracy while being easily parallelizable on GPU (thereby being significantly faster), as well as easier to train.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical Character Recognition (OCR) on documents is an age-old problem for which numerous open-source (e.g. Tesseract (2018)

) as well as proprietary solutions exist. Especially in the sub-domain of printed documents, the problem is often regarded as being solved. Nevertheless, current state-of-the-art document-level OCR solutions (as far as the published research goes) consist of a complicated pipeline of steps, where each step is either a hand-optimized heuristic or if trainable, requires intermediate data and annotations to train.

Deep neural networks have been proven very successful in object detection tasks such as detecting dogs and cats in images 

Liu et al. (2016). In this work, we build on top of these developments and treat OCR as a semantic segmentation and object detection task for detecting and recognizing character instances on a page111A related task of recognizing text in natural images, referred to as Scene Text Recognition (STR), has been faster in adopting techniques from object detection in computer vision Busta et al. (2017). However, compared to STR, document OCR deals with much denser text and very high accuracy requirements Breuel (2017b).

. We introduce a new end-to-end trainable OCR pipeline for (but not limited to) printed documents that is based on deep fully convolutional neural networks. Our main contribution is to frame the OCR problem as an ultra-dense instance-segmentation task 

He et al. (2017)

for characters over the full input document image. We do not rely on any pre-processing stages like binarization or deskewing. Instead, our model learns directly from the raw document pixel data. At the core of our method, we predict a

chargrid representation Katti et al. (2018) of the input document - a 1-hot encoded grid of characters. Thus, we call our method “Chargrid-OCR”. Additionally, we introduce two novel post-processing steps, both of which are crucial to performing fast and accurate dense OCR. We show that our method can outperform line-based pipelines like e.g. Tesseract 4 Smith (2007) that rely on a combination of deep convolutional and recurrent networks with CTC loss Tesseract (2018); Breuel (2017a) in terms of accuracy, while being significantly simpler to train.

Figure 1: Schematic representation of the Chargrid-OCR network architecture with its input and outputs. Parameters and

denote the number of channels and strides per convolution filter.

is referred to as base channels. Colors in encode the predicted character class for each pixel.

2 Chargrid-OCR: OCR as an ultra-dense Object Detection Task

Chargrid-OCR method is a lexicon-free (only character-based), end-to-end trainable approach for OCR. Given a document image, chargrid-OCR predicts character segmentation mask together with object bounding boxes for characters in one single step (see Fig 

1). Both, semantic segmentation and object detection are common tasks in computer vision, e.g. Ronneberger et al. (2015); Liu et al. (2016); Lin et al. (2017)

. The character segmentation mask classifies each pixel into a character class and the character bounding box detects a bounding box around each character.

Both, our semantic segmentation and box detection (sub-)networks are fully convolutional and consist of only a single stage similar to e.g. Liu et al. (2016), and unlike e.g. Ren et al. (2015). Being single-stage is especially important as there may be thousands of characters (i.e. objects) on a single page which yields an ultra-dense object detection task.

2.1 Chargrid-OCR Architecture

The chargrid representation of a document image maps each pixel that is occupied by a single character on the input document to a unique index that corresponds to that character – see Katti et al. (2018) for details.

Given an input document, our model predicts the chargrid representation of the complete document. This is accomplished by using the chargrid as target for a semantic segmentation network. Since the chargrid does not allow one to delineate character instances, we further use class agnostic object detection to predict individual character boxes. We are thus solving an instance segmentation task.

Concretely, the input to our model is an image with text, e.g. a scanned document. The output is a segmentation mask (chargrid) and a set of bounding boxes. The segmentation mask, , classifies each pixel in the input image into characters (Fig. 1). The bounding boxes are predicted in a similar way as standard object detection methods Liu et al. (2016) with (i) a box mask () whose confidence denotes the presence of a box at that pixel, (ii) box centers (), which denote the offset from the location of the predicting pixel to the center of the box and (iii) the box widths and the heights (). Finally, for grouping characters into words, we also predict offsets to word centers (). The architecture of the model is based on a fully-convolutional encoder-decoder structure, with two decoders (one for semantic segmentation, one for bounding box detection) branching out of the common encoder. Fig. 1 illustrates the architecture with an example input and its corresponding outputs. The model is trained using categorical cross-entropy for the segmentation outputs () and using Huber loss for the regression output (Liu et al. (2016).

2.2 Post-processing

The character candidate boxes are those that have confidence surpassing a certain threshold (e.g. 50%) in the box mask, . This gives multiple boxes around the same character. In order to delete redundantly predicted box proposals of the same character instance, non-maximum suppression (NMS) is applied Liu et al. (2016). However, in our scenario, the number of proposals can be of the order of . As NMS runs in quadratic time, this can become a computational bottleneck.

To speed up the process, we introduce a preliminary step before NMS, which we call “GraphCore”. Recall that each candidate pixel predicts the offset to the center of its box. We construct a directed graph where each vertex is a candidate pixel and we add a directed edge going from pixel to pixel if pixel predicts pixel as the center of its bounding box. By taking the k-core, with =1, of the resulting graph (i.e. only the loops in the graph, which can be done efficiently in linear time) only pixels towards the center of a bounding box (typically, one or two candidate boxes per character) are selected as candidates for NMS.

Another necessary post-processing step is to construct word boxes from character boxes. To do so, we cluster characters together based on their predicted word centers, which even allows us to predict rotated words.

3 Experiments

3.1 Data and Metric

For each document input image, we require ground-truth data in the form of character bounding boxes. (WIKI dataset) We generated a dataset by synthetically rendering pages in A4 format using English Wikipedia content and applied common data augmentation. We generated 66,486 pages; the page layout and font specifications (type, height and color) were sampled randomly. This enabled to synthesize input images and perfect ground truth labels. (EDGAR dataset) We converted a vast set of publicly available scanned financial reports EDGAR into png images and sampled 42,918 pages with a non-repetitive layouts. We processed the images with Tesseract4 and thereby obtained noisy (i.e. including OCR errors) ground truth.

We evaluated the model on both, a holdout dataset from our training data (EDGAR: 77 pages and 22,521 words ; Wiki: 200 pages; 76,738 words) as well as benchmark OCR dataset, i.e. Business letters (179 pages; 48,639 words; Rice et al. (1995)) and UNLV. (383 pages; 133,245 words; Shahab et al. (2010)).

We use Word Recognition Rate to measure the accuracy of OCR which can be computed by the , with and being the number of matched and unmatched words respectively. A predicted word matches a ground truth if and only if contents agree (i.e. identical string) and they intersect (IoU > 1%). If multiple predictions match the same ground truth, only one prediction is considered a match. The remaining unmatched predictions and unmatched ground truth words are added to obtain .

3.2 Results

We trained various versions of our new model on the datasets described in Sec. 3.1 and compared them to Tesseract (v3 and v4) on various different datasets in Table 1 in terms of Word Recognition Rate. As benchmark, we use off-the-shelf Tesseract without any retraining or fine tuning. Tesseract represents a publicly available state-of-the-art OCR approach. For each model, we report the training data that was used, the validation data (which overlaps with the domain of the training data), and the test data. The test data are domain-independent from training and/or validation data and serve as an indicator for model generalizability.

Our baseline is denoted by “ChargridOCR-32”, consists of 32 convolutional base channels ( in Fig. 1), and was trained on the Wiki dataset. This model outperforms Tesseract v3, but not Tesseract v4. The same model trained on Wiki+EDGAR is competitive with, and typically superior to, Tesseract4 on validation and test data. Finally, a model with twice as many convolutional channels “ChargridOCR-64” and thus with higher capacity trained on Wiki+EDGAR outperforms Tesseract on all datasets, however with significant computational overhead (rightmost column in Tab. 1). In Fig. 2, we show some crops exemplifying incorrect predictions from our model.

Model Training Data Validation data Test Data Time
Wiki EDGAR Letters UNLV 1000 pages
Tesseract3 Unknown 86.3% 83.4% 87.7% 72.4% 5600s
Tesseract4 Unknown 94.6% 91.0% 92.6% 76.8% 14800s
ChargridOCR-32 Wiki 97.3% 86.4% 89.4% 75.3% 241s
ChargridOCR-32 Wiki+EDGAR 97.2% 91.4% 92.3% 80.4% 241s
ChargridOCR-64 Wiki+EDGAR 98.8% 91.6% 93.5% 81.6% 550s
Table 1: Results, reported in terms of Word Recognition Rate. Tesseract run-times are obtained using 1 Xeon E5-2698 CPU core and Chargrid-OCR’s on 1 V100 GPU.
Figure 2: Three faulty example crops (top, middle, bottom row) from the validation set. From left to right: original image, predicted segmentation mask, predicted character boxes (after postprocessing), and resulting extracted words (blue if they match ground truth, red otherwise).

4 Conclusion

We presented a new end-to-end trainable optical character recognition pipeline that is based on state-of-the-art computer vision approaches using object detection and semantic segmentation. Our pipeline is significantly simpler compared to other sequential and line-based approaches, especially those used for document-level optical character recognition such as Tesseract 4. We empirically show that our model outperforms Tesseract 4 on a number of diverse evaluation datasets by a large margin both in terms of accuracy and run-time.


  • T. M. Breuel (2017a) High performance text recognition using a hybrid convolutional-LSTM implementation. In ICDAR 2017, pp. 11–16. External Links: Link, Document Cited by: §1.
  • T. M. Breuel (2017b) Robust, simple page segmentation using hybrid convolutional mdlstm networks. In ICDAR 2017, Vol. 1, pp. 733–740. Cited by: footnote 1.
  • M. Busta, L. Neumann, and J. Matas (2017) Deep textspotter: an end-to-end trainable scene text localization and recognition framework. In ICCV 2017, pp. 2223–2231. External Links: Link, Document Cited by: footnote 1.
  • [4] EDGAR SEC. Note: Cited by: §3.1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §1.
  • A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. B. Faddoul (2018) Chargrid: towards understanding 2d documents. In EMBS 2018, pp. 4459–4469. External Links: Link Cited by: §1, §2.1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §2.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV 2016, pp. 21–37. External Links: Link, Document Cited by: §1, §2.1, §2.2, §2, §2.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §2.
  • S. V. Rice, F. R. Jenkins, and T. A. Nartker (1995) The fourth annual test of ocr accuracy. Technical report Technical Report 95. Cited by: §3.1.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI 2015, pp. 234–241. Cited by: §2.
  • A. Shahab, F. Shafait, T. Kieninger, and A. Dengel (2010) An open approach towards the benchmarking of table structure recognition systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 113–120. Cited by: §3.1.
  • R. Smith (2007) An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2, pp. 629–633. Cited by: §1.
  • Tesseract (2018) V4. Note: Cited by: §1, §1.