Efficient, Lexicon-Free OCR using Deep Learning

06/05/2019 ∙ by Marcin Namysl, et al. ∙ Fraunhofer 0

Contrary to popular belief, Optical Character Recognition (OCR) remains a challenging problem when text occurs in unconstrained environments, like natural scenes, due to geometrical distortions, complex backgrounds, and diverse fonts. In this paper, we present a segmentation-free OCR system that combines deep learning methods, synthetic training data generation, and data augmentation techniques. We render synthetic training data using large text corpora and over 2000 fonts. To simulate text occurring in complex natural scenes, we augment extracted samples with geometric distortions and with a proposed data augmentation technique - alpha-compositing with background textures. Our models employ a convolutional neural network encoder to extract features from text images. Inspired by the recent progress in neural machine translation and language modeling, we examine the capabilities of both recurrent and convolutional neural networks in modeling the interactions between input elements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Optical character recognition (OCR) is one of the most widely studied problems in the field of pattern recognition and computer vision. It is not limited to printed but also handwritten documents 

[1], as well as natural scene text [2]. The accuracy of various OCR methods has recently greatly improved due to advances in deep learning [3, 4, 5]. Moreover, many current open-source and commercial products reach a high recognition accuracy and good throughput for run-of-the-mill printed document images. While this has lead the research community to regard OCR as a largely solved problem, we show that even the most successful and widespread OCR solutions are neither able to robustly handle large font varieties, nor distorted texts, potentially superimposed on complex backgrounds. Such unconstrained environments for digital documents have already become predominant, due to the wide availability of mobile phones and various specialized video recording devices.

In contrast to popular OCR engines, methods used in scene text recognition  [6, 7] exploit computationally expensive network models, aiming to achieve the best possible recognition rates on popular benchmarks. Such methods are tuned to deal with significantly smaller amounts of text per image and are often constrained to predefined lexicons. Commonly used evaluation protocols substantially limit the diversity of symbols to be recognized, e.g., by ignoring all non-alphanumeric characters, or neglecting case sensitivity [8]. Hence, models designed for scene text are generally inadequate for printed document OCR, whereas high throughput and support for great varieties of symbols are essential.

In this paper, we address the general OCR problem and try to overcome the limitations of both printed- and scene text recognition systems. To this end, we present a fast and robust deep learning multi-font OCR engine, which currently recognizes different character classes. Our models are trained almost exclusively using synthetically generated documents. We employ segmentation-free text recognition methods that require a much lower data labeling effort, making the resulting framework more readily extensible for new languages and scripts. Subsequently we propose a novel data augmentation technique that improves the robustness of neural models for text recognition. Several large and challenging datasets, consisting of both real and synthetically rendered documents, are used to evaluate all OCR methods. The comparison with leading established commercial (ABBYY FineReader 111https://www.abbyy.com/en-eu/ocr-sdk/, OmniPage Capture 222https://www.nuance.com/print-capture-and-pdf-solutions/optical-character-recognition/omnipage/omnipage-for-developers.html) and open-source engines (Tesseract 3 [9], Tesseract 4 333https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM) shows that the proposed solutions obtain significantly better recognition results with comparable execution time.

The remaining part of this paper is organized as follows: in Section II, we highlight related research papers, while in Section III, we describe the datasets used in our experiments, as well as the data augmentation routines. In Section IV, we present the detailed system architecture, which is then evaluated and compared against several state-of-the-art OCR engines in Section V. Our conclusions, alongside a few worthwhile avenues for further investigations, are the subject of the final Section VI.

Ii Related work

In this section, we review related approaches for printed-, handwritten- and scene text recognition. These can be broadly categorized into segmentation-based and segmentation-free methods.

Segmentation-based OCR methods recognize individual character hypotheses, explicitly or implicitly generated by a character segmentation method. The output is a recognition lattice containing various segmentation and recognition alternatives weighted by the classifier. The lattice is then decoded, e.g., via a greedy or beam search method and the decoding process may also make use of an external language model or allow the incorporation of certain (lexicon) constraints.

The PhotoOCR system for text extraction from smartphone imagery, proposed by Bissacco et al.[10], is a representative example of a segmentation-based OCR method. They used a deep neural network trained on extracted histogram of oriented gradient (HOG) features for character classification and incorporated a character-level language model into the score function.

The accuracy of segmentation-based methods heavily suffers from segmentation errors and the lack of context information wider than a single cropped character-candidate image during classification. Improper incorporation of an external language model or lexicon constraints can degrade accuracy[11]. While offering a high flexibility in the choice of segmenters, classifiers, and decoders, segmentation-based approaches require a similarly high effort in order to tune optimally for specific application scenarios. Moreover, the precise weighting of all involved hypotheses must be re-computed from scratch as soon as one component is updated (e.g., the language model), whereas the process of data labeling (e.g., at the character/pixel level) is usually a painstaking endeavor, with a high cost in terms of human annotation labor.

Segmentation-free OCR methods eliminate the need for pre-segmented inputs. Cropped words or entire text lines are usually geometrically normalized (III-B) and then can be directly recognized. Previous works on segmentation-free OCR [12]

employed Hidden Markov Models (HMMs) to avoid the difficulties of segmentation-based techniques. Most of the recently developed segmentation-free solutions employ recurrent and convolutional neural networks.

Multi-directional, multi-dimensional recurrent neural networks (MDRNNs) currently enjoy a high popularity among researchers in the field of handwriting recognition 

[1] because of their ability to attain state-of-the-art recognition rates. They generalize standard Recurrent Neural Networks (RNNs) by providing recurrent connections along all spatio-temporal dimensions, making them robust to local distortions along any combination of the respective input dimensions. Bidirectional RNNs, consisting of two hidden layers that traverse the input sequence in opposite spatial directions (i.e., left-to-right and right-to-left), connected to a single output layer, were found to be well-suited for both handwriting[13] and printed text recognition[14]

. In order to mitigate the vanishing/exploding gradient problem, most RNNs use Long Short-Term Memory (LSTM) units (or variants thereof) as building blocks. A noteworthy extension to the LSTM cells are the ”peephole” LSTM units

[15], where the multiplicative gates compute their activations at the current time step using in addition the activation of the memory cell from the previous time step.

MDRNNs are computationally much more expensive than their basic 1-D variant, both during training and inference. Because of this, they have been less frequently explored in the field of printed document OCR. Instead, in order to overcome the issue of sensitivity to stroke variations along the vertical axis, researchers have proposed different solutions. For example, Breuel et al. [14] combined a standard 1-D LSTM network architecture with a text line normalization method for performing OCR of printed Latin and Fraktur scripts. In a similar manner, by normalizing the positions and baselines of letters, Yousefi et al. [16] achieved superior performance and faster convergence with a 1-D LSTM network over a 2-D variant for Arabic handwriting recognition.

An additional advantage of segmentation-free approaches is their inherent ability to work directly on grayscale or full-color images. This increases the robustness and accuracy of text recognition, as any information loss caused by a previously mandatory binarization step can be avoided. Asad et al. 

[17] applied the 1-D LSTM network directly to original, blurred document images and were able to obtain state-of-the-art recognition results.

The introduction of convolutional neural networks (CNNs) allowed for a further jump in the recognition rates. Since CNNs are able to extract latent representations of input images, thus increasing robustness to local distortions, they can be successfully employed as a substitute for MD-LSTM layers. Breuel et al. [18]

proposed a model that combined CNNs and LSTMs for printed text recognition. Features extracted by CNNs were combined and fed into the LSTM network with a Connectionist Temporal Classification (CTC)

[19] output layer. A few recent methods have completely forgone the use of the computationally expensive recurrent layers and rely purely on convolutional layers for modeling the local context. Borisyuk et al. [20] presented a scalable OCR system called Rosetta, which employs a fully convolutional network (FCN) model followed by the CTC layer in order to extract the text depicted on input images uploaded daily at Facebook.

In the current work, we build upon the previously mentioned techniques and propose an end-to-end segmentation-free OCR system. Our approach is purely data-driven and can be adapted with minimal manual effort to different languages and scripts. Feature extraction from text images is realized using convolutional layers. Using the extracted features, we analyze the ability to model local context with both recurrent and fully convolutional sequence-to-sequence architectures. The alignment of the extracted features with ground-truth transcripts is realized via a CTC layer. To the best of our knowledge, this is the first work that compares fully convolutional and recurrent models in the context of OCR.

Iii Datasets and data preparation

To train and evaluate our OCR system we prepared several datasets, consisting of both real and synthetic documents. This section describes each in detail, as well as the preparation of training, validation, and test samples, data augmentation techniques, and the geometric normalization procedure.

We collected pages of scanned historical and recent German-language newspapers as well as contemporary German invoices. All documents were deskewed and pre-processed via document layout analysis algorithms, providing us with the geometrical and logical document structure, including bounding boxes, baseline positions, and x-height values for each text line. The initial transcriptions obtained using the Tesseract [9] OCR engine were manually corrected.

Even without the need for the character- or word-level ground truth, the manual annotation process proved to be error-prone and time-consuming. Motivated by the work of Jaderberg et al. [21], we developed an automatic synthetic data generation process. Two large text corpora, namely the English and German Wikipedia dump files444https://dumps.wikimedia.org/, were used as training sources for generating sentences. For validation and test purposes, we used a corpus from the Leipzig Corpora Collection[22]. The texts were rendered using a set of over serif, sans serif, and monospace fonts555https://fonts.google.com/.

The generation process first selects a piece of text (up to characters) from the corpus and renders it on an image with a randomly chosen font. The associated attributes (i.e., bounding boxes, baseline positions, and x-height values) used for rendering are stored in the corresponding document layout structure. A counter for the number of occurrences of every individual character in the generated dataset is maintained and used to guide the text extraction mechanism to choose text pieces containing the less frequently represented symbols. Upon generating enough text line samples to fill an image of pre-specified dimensions (e.g., pixels), the image is saved on disk together with the associated layout information. The procedure described above is repeated until the number of occurrences of each symbol reaches a required minimum level (, and

in our synthetic training, validation, and test set, respectively), which guarantees that even rare characters are sufficiently well represented in each of the generated datasets, or until all text files have been processed. By using sentences from real corpora we ensure that the sampled character and n-gram distribution is the same as that of natural language texts.

A summary of our data sources is presented in TABLE I. We train and recognize 132 different character classes, including basic lower and upper case Latin letters, whitespace character, German umlauts, ligature ß, digits, punctuation marks, subscripts, superscripts, as well as mathematical, currency and other commonly used symbols. The training data consists of about million characters, of which were synthetically generated.

Newspapers Invoices Synthetic documents
Training Test Test Training Validation Test
Documents
Text lines
GT length
TABLE I: Summary of the used data sources.

The batches containing the final training and validation samples are generated on the fly, as in the following. Text line images are randomly selected from the corresponding (training or validation) dataset and the associated layout information is used to normalize each sample. Note that the normalization step is optional (see also III-B), since especially in the case of scene text it may be too computationally expensive and error-prone to extract exact baselines and x-heights at inference time. All samples are re-scaled to a fixed height of

pixels, while maintaining the aspect ratio. This particular choice for the sample height was determined experimentally. Larger sample heights did not improve recognition accuracy for skew-free text lines. However, if the targeted use case involves the recognition of relatively long, free-form text lines, the use of taller samples is warranted. Since text lines lengths vary greatly, the corresponding images must be (zero) padded appropriately to fit the widest element within the batch. We minimize the amount of padding by composing batches from text lines having similar widths. Subsequently, random data augmentation methods are dynamically applied to each sample (

III-A).

Fig. 1: Training and validation data samples from our system.

Iii-a Data augmentation

We apply standard data augmentation methods, like Gaussian smoothing, perspective distortions, morphological filtering, downscaling, additive noise, and elastic distortions[23] during training and validation. Additionally, we propose a novel augmentation technique — alpha compositing[24] with background texture images. Each time a specific sample is presented to the network, it is alpha-composited with a randomly selected background texture image (Fig. 1). By randomly altering backgrounds of training samples, the network is guided to focus on significant text features and learns to ignore background noise. The techniques mentioned above are applied dynamically both to training and validation samples. In contrast to the approach proposed by Jaderberg et al. [21], we render undistorted synthetic documents once, and then apply random data augmentations dynamically. This allows us to efficiently generate samples and eliminates the significant overhead caused by disk I/O operations.

Iii-B Geometric normalization

Breuel et al. [18] recommended that text line images should be geometrically normalized prior to recognition. We trained models with and without such normalization in order to verify this assumption. The normalization step is performed per text line, before feature extraction. During training, we use the saved text line attributes, whereas at inference time, the layout analysis algorithm provides the baseline position and the x-height values for each line. Using the baseline information, the skew of the text lines is corrected. The scale of each image is normalized, so that the distance between the baseline and the x-height line, as well as the heights of ascenders and descenders, are approximately constant.

For the unnormalized case, the normalization procedure is skipped entirely and each cropped text line sample is further passed on to the feature extractor. This case is especially relevant for scene text images, where normalization information is usually unavailable and expensive to compute separately.

Iv System architecture

The architecture of our hybrid CNN-LSTM model is depicted in Fig. 2 and is inspired by the CRNN [2] and Rosetta [20] systems. The bottom part consists of convolutional layers that extract high-level features of an image. Activation maps obtained by the last convolutional layer are transformed into a feature sequence with the map to sequence

operation. Specifically, 3D maps are sliced along their width dimension into 2D maps and then each map is flattened into a vector. The resulting feature sequence is fed to a bidirectional recurrent neural network with

hidden units in both directions. The output sequences from both layers are concatenated and fed to a linear layer with softmax activation function to produce per-timestep probability distribution over the set of available classes. The CTC output layer is employed to compute a loss between the network outputs and the ground truth transcriptions. During inference, CTC loss computation is replaced by greedy CTC decoding. TABLE 

II presents a detailed structure of our recurrent model.

Fig. 2: The architecture of our system. The gray region indicates a recurrent block omitted in case of the fully convolutional model.
Operation Output volume size
Conv2d (

; stride:

)
Max pooling (; stride: )
Conv2d (; stride: )
Max pooling (; stride: )
Map to sequence
Dropout (50%)
Bidirectional RNN (units: )
Dropout (50%)
Linear mapping (units: num_classes)
CTC output layer output_sequence_length
TABLE II: Detailed structure of our recurrent model.

In case of our fully convolutional model, feature sequences transformed by the map to sequence operation (see the previous paragraph) are directly fed to a linear layer, skipping the recurrent components entirely. TABLE III presents the detailed structure of our fully convolutional model.

Operation Output volume size
Conv2d (; stride: )
Max pooling (; stride: )
Conv2d (; stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Conv2d (, stride: )
Map to sequence
Linear layer (units: num_classes)
CTC output layer output_sequence_length
TABLE III: Detailed structure of our fully convolutional model.

All models were trained via minibatch stochastic gradient descent using the Adaptive Moment Estimation (Adam) optimization method

[25]. The learning rate is decayed by a factor of every iterations and has an initial value of for the recurrent- and

for the fully convolutional model. Batch normalization

[3] is applied after every convolutional block to speed up the training. The hybrid models were trained for approximately epochs and the fully convolutional models for about epochs.

The Python interface of the Tensorflow 

[26] framework was used for training all models. The inference timings were done via Tensorflow’s C++ interface.

V Evaluation and discussion

We compare performance of our system with two established commercial OCR products: ABBYY FineReader 12 and OmniPage Capture SDK 20.2 and with a popular open-source OCR library – Tesseract versions 3 and 4. The latest Tesseract engine uses deep learning models similar to ours. Recognition is performed at the text line level. The ground truth layout structure is used to crop samples from document images.

Since it was shown that LSTMs learn an implicit language model[27], we evaluate our system without external language models or lexicons, although their use can likely further increase accuracy. By contrast, both examined commercial engines use language models and lexicons for English and German, and their settings have been chosen for best recognition accuracy. We use the fast integer Tesseract 4 models666https://github.com/tesseract-ocr/tessdata_fast because they demonstrate a comparable running time to the other examined methods.

Our data sources are summarized in TABLE I. We conduct experiments on the test documents with (Type-2, Type-3) and without (Type-1) additional distortions (III-A) applied prior to decoding. We explore two different scenarios for the degradations. In the first scenario, only geometrical transformations, morphological operations, blur, noise addition and downscaling are considered (Type-2). This scenario corresponds to the typical case of printed document scans of varying quality. In the second scenario, all extracted text line images are additionally alpha-composited with a random background texture (Type-3). Different texture sets are used for training and testing. Additionally, we randomly invert the image gray values. This scenario best corresponds to scene text recognition. Note that since the distortions are applied randomly, some images obtained by this procedure may end up nearly illegible, even for human readers.

We aggregate results from multiple experiments (every text line image is randomly distorted times) and report the average error values. TABLE IV summarizes our test datasets. We evaluate all methods on original and distorted text lines, containing and characters, respectively.

Newspapers Invoices Synthetic documents
Type-1 Type-2 Type-1 Type-2 Type-1 Type-2 Type-3
TABLE IV: Ground truth lengths of datasets used in our experiments.

TABLE V compares error rates of all examined OCR engines. We use the Levenshtein edit distance metric[28] to measure the character error rate (CER). All of our models, unless otherwise stated, are fine-tuned with real data, use geometric text line normalization, and data augmentation methods (III-A) except elastic distortions. The results show that our system outperforms all other methods in terms of recognition accuracy in all scenarios. A substantial difference can be primarily observed on distorted documents alpha-composited with background textures, where Tesseract and both commercial engines exhibit a very poor recognition performance. Noisy backgrounds hinder their ability to perform an adequate character segmentation. Although Tesseract 4 was trained on augmented synthetic data, we observe that it cannot properly deal with significantly distorted inputs. The established solutions have problems recognizing subscript and superscript symbols. Both commercial engines have great difficulties in handling fonts with different, alternating styles located on the same page.

TABLE VI gives an insight into the most frequent errors (insertions, deletions, and substitutions) made by the best performing proposed and commercial methods on real versus synthetic data. All tested methods have the most difficulties in recognizing the exact number of whitespace characters due to non-uniform letter and word spacing (kerning, justified text) across documents. This problem is particularly visible on the manually corrected real documents, where a certain degree of ambiguity due to human judgment becomes apparent. The remaining errors for the hybrid models look reasonable and seem to be primarily focused on small or thin characters, which are indeed the ones most affected by distortions and background patterns. In contrast, ABBYY FineReader exhibits a clear tendency to insert spurious characters, especially for highly textured and distorted images.

Newspapers Invoices Synthetic
Type-1 Type-2 Type-1 Type-2 Type-1 Type-2 Type-3
ABBYY FineReader
OmniPage Capture
Tesseract 3
Tesseract 4
Ours
Ours
Ours
Ours
Ours
Ours
Ours
Ours
denotes the fully convolutional model.
denotes the hybrid model.
denotes that no geometric normalization was used.
denotes the use of peephole LSTM units.
denotes training with elastic distortions.
denotes training without alpha compositing with background textures.
denotes training exclusively with synthetic data.
TABLE V: Character error rates (%) on the test datasets.
Fig. 3: Runtime comparison (in seconds) on the test datasets (per standard page with 1 500 symbols). Values for the original documents (Type-1) and CPU experiments are averaged over 10 and for all other experiments over 30 trials. All CPU experiments use a batch size of images, whereas the GPU runs use a batch of images. Note that data points from different datasets are connected solely to allow easier traceability of each engine.
Error type %
Insertion of ’ ’
Substitution ’l’’i’
Substitution ’.’’,’
Insertion of ’.’
Substitution ’i’’l’
Insertion of ’_’
Substitution ’I’’l’
Insertion of ’t’
Substitution ’o’’a’
Substitution ’f’’t’
(a) Ours (real data)
Error type %
Insertion of ’ ’
Substitution ’0’’O’
Deletion of ’.’
Substitution ’O’’Ö’
Deletion of ’_’
Deletion of ’-’
Substitution ’I’’l’
Substitution ’.’’,’
Insertion of ’r’
Substitution ’©’’O’
(b) Ours (synthetic data)
Error type %
Insertion of ’ ’
Insertion of ’.’
Substitution ’,’’.’
Insertion of ’i’
Insertion of ’r’
Deletion of ’e’
Substitution ’c’’e’
Insertion of ’l’
Insertion of ’n
Insertion of ’t’
(c) ABBYY FineReader (real data)
Error type %
Insertion of ’ ’
Insertion of ’i’
Insertion of ’e’
Insertion of ’t’
Insertion of ’r’
Insertion of ’.’
Insertion of ’l’
Insertion of ’-’
Insertion of ’n’
Insertion of ’a’
(d) ABBYY FineReader (synthetic data)
TABLE VI: Top 10 most frequent errors for the best proposed- and commercial OCR engines. Note that the real data (left side) comprises both newspapers and invoices.

Fig. 3 presents the runtime comparison. Both commercial engines and Tesseract 3 work slowly for significantly distorted images. Apparently, they make use of certain computationally expensive image restoration techniques in order to be able to handle low-quality inputs. Unsurprisingly, the GPU-accelerated models are fastest across the board. We discuss the runtime on CPU in V-A. All experiments were conducted on a workstation equipped with an Nvidia GeForce GTX 745 graphics card and an Intel Core i7-6700 CPU.

V-a Ablation study

In this section, we analyze the impact of different model components on the recognition performance of our system.

V-A1 Fully convolutional vs. recurrent model

The fully convolutional model achieves a slightly lower accuracy than the best recurrent variant. However, its inference time is significantly lower on the CPU. This clearly shows that convolutional layers are much more amenable to parallelization than recurrent units.

V-A2 Peephole connections

The model that uses peephole LSTM cells and pools feature maps along the width dimension only once, exhibits a better recognition accuracy in the scene text scenario. This is not the case for typical document scans, where the peepholes do not seem to bring any additional accuracy gains compared to the vanilla LSTM model. The use of peephole connections does, however, add a significant runtime overhead in all cases.

V-A3 Alpha compositing with background textures (Iii-A)

We train one model without alpha compositing with background textures. The model exhibits significantly higher error rates, not only on samples with complicated backgrounds but also on those with significant distortions. This confirms our assumption that this augmentation technique has generally a positive effect on the robustness of neural OCR models.

V-A4 Geometric normalization (Iii-B)

The model using no geometric normalization exhibits a drop in accuracy especially for images showing stronger distortions. This indicates that geometric normalization is indeed beneficial, but not indispensable. Apparently, max pooling and strided convolution operations provide enough translational invariance.

V-A5 Training only on synthetic data

We train two models exclusively on synthetic training data. It obtains very competitive results, which indicates that using such a model together with proper data augmentation is sufficient for achieving a satisfactory recognition accuracy.

V-A6 Elastic distortions

We found that non-linear distortions can further reduce the error rate of models, particularly those trained exclusively on synthetic data. Hence, this augmentation method is beneficial, especially in cases where annotated real data is not available or simply too difficult to produce. We also observe that although most of our models were trained without elastic distortions applied to training data, they can nonetheless deal with test data augmented with non-linear distortions. We attribute this to the fact that we used a substantial amount of fonts to generate our synthetic training data, achieving an adequate variation of text styles.

Vi Conclusions and future work

In this paper, we described our general and efficient OCR solution. Experiments under different scenarios, on both real and synthetically-generated data, showed that both proposed architectures outperform leading commercial and open-source engines. In particular, we demonstrated an outstanding recognition accuracy on severely degraded inputs.

The architecture of our system is universal, and can be used to recognize printed, handwritten or scene text. The training of models for other languages is straightforward. Via the proposed pipeline, deep neural network models can be trained using only text line-level annotations. This saves a considerable manual annotation effort, previously required for producing the character- or word level ground truth segmentations and the corresponding transcriptions.

A novel data augmentation technique, alpha compositing with background textures, is introduced and evaluated with respect to its effects on the overall recognition robustness. Our experiments showed that synthetic data is indeed a viable and scalable alternative to real data, provided that sufficiently diverse samples are generated by the data augmentation modules. The effect of different structural choices and data augmentation on recognition accuracy and inference time is experimentally investigated. Hybrid recognition architectures proved to be more accurate, but also considerably more computationally costly than purely convolutional approaches.

The importance of a solid data generation pipeline cannot be overstated. As such, future work will involve its continuous improvement and comparison with other notable efforts from the research community, e.g., [21]

. We also plan to make the synthetic data used in our experiments publicly available. We feel that fully-convolutional approaches, in particular, offer great potential for future improvement. The incorporation of recent advances, such as residual connections 

[4] and squeeze-and-excitation blocks [5] into our general OCR architecture seems to be a promising direction.

Acknowledgment

This work was supported by the German Federal Ministry of Education and Research (BMBF) funded program KMU-innovativ in the project DeepER.

References

  • [1] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Proceedings of the 21st International Conference on Neural Information Processing Systems, ser. NIPS’08.   USA: Curran Associates Inc., 2008, pp. 545–552.
  • [2] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298–2304, Nov. 2017.
  • [3] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 770–778.
  • [5] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141.
  • [6] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in Proc. European Conference on Computer Vision (ECCV), September 2018.
  • [7] M. Busta, L. Neumann, and J. Matas, “Deep textspotter: An end-to-end trainable scene text localization and recognition framework,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2223–2231.
  • [8] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in Proc. Int. Conf. Computer Vision, Nov. 2011, pp. 1457–1464.
  • [9] R. Smith, “An overview of the tesseract ocr engine,” in Proc. Ninth Int. Conf. Document Analysis and Recognition (ICDAR 2007), vol. 2, Sep. 2007, pp. 629–633.
  • [10] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr: Reading text in uncontrolled conditions,” in Proc. IEEE Int. Conf. Computer Vision, Dec. 2013, pp. 785–792.
  • [11] R. Smith, “Limits on the application of frequency-based language models to ocr,” in Proc. Int. Conf. Document Analysis and Recognition, Sep. 2011, pp. 538–542.
  • [12] S. F. Rashid, F. Shafait, and T. M. Breuel, “Scanning neural network for text line recognition,” in 2012 10th IAPR International Workshop on Document Analysis Systems, March 2012, pp. 105–109.
  • [13] A. Graves et al., “A novel connectionist system for unconstrained handwriting recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868, May 2009.
  • [14] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-performance OCR for printed english and fraktur using lstm networks,” in Proc. 12th Int. Conf. Document Analysis and Recognition, Aug. 2013, pp. 683–687.
  • [15] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 3, Jul. 2000, pp. 189–194 vol.3.
  • [16] M. Reza Yousefi, M. R. Soheili, T. M. Breuel, and D. Stricker, “A comparison of 1d and 2d lstm architectures for the recognition of handwritten arabic,” Proceedings of SPIE - The International Society for Optical Engineering, vol. 9402, 02 2015.
  • [17] F. Asad, A. Ul-Hasan, F. Shafait, and A. Dengel, “High performance ocr for camera-captured blurred documents with lstm networks,” in Proc. 12th IAPR Workshop Document Analysis Systems (DAS), Apr. 2016, pp. 7–12.
  • [18] T. M. Breuel, “High performance text recognition using a hybrid convolutional-lstm implementation,” in Proc. 14th IAPR Int. Conf. Document Analysis and Recognition (ICDAR), vol. 01, Nov. 2017, pp. 11–16.
  • [19] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in

    Proceedings of the 23rd International Conference on Machine Learning

    , ser. ICML ’06.   New York, NY, USA: ACM, 2006, pp. 369–376.
  • [20] F. Borisyuk, A. Gordo, and V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’18.   New York, NY, USA: ACM, 2018, pp. 71–79.
  • [21] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” arXiv preprint arXiv:1406.2227, 2014.
  • [22] D. Goldhahn, T. Eckart, and U. Quasthoff, “Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages.” in LREC, vol. 29, 2012, pp. 31–43.
  • [23] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in Proc. Seventh Int. Conf. Document Analysis and Recognition, Aug. 2003, pp. 958–963.
  • [24] T. Porter and T. Duff, “Compositing digital images,” SIGGRAPH Comput. Graph., vol. 18, no. 3, pp. 253–259, Jan. 1984.
  • [25] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2014.
  • [26] M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
  • [27] E. Sabir, S. Rawls, and P. Natarajan, “Implicit language model in lstm for ocr,” in Proc. 14th IAPR Int. Conf. Document Analysis and Recognition (ICDAR), vol. 07, Nov. 2017, pp. 27–31.
  • [28] V. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966.