Text recognition is considered one of the earliest computer vision tasks to be tackled by researches. For more than a century now  since its inception as a field of research, researchers never stopped working on it. This can be contributed to two important factors.
First, the pervasiveness of text and its importance to our everyday life, as a visual encoding of language that is used extensively in communicating and preserving all kinds of human thought. Second, the necessity of text to humans and its pervasiveness has led to big adequacy requirements over its delivery and reception which has led to the large variability and ever-increasing visual forms text can appear in. Text can originate either as printed or handwritten, with large possible variability in the handwriting styles, the printing fonts, and the formatting options. Text can be found organized in documents as lines, tables, forms, or cluttered in natural scenes. Text can suffer countless types and degrees of degradations, viewing distortions, occlusions, backgrounds, spacings, slantings, and curvatures. Text from spoken languages alone (as an important subset of human-produced text) is available in dozens of scripts that correspond to thousands of languages. All this has contributed to the long-standing complicated nature of unconstrained text recognition.
won the ImageNet image classification challenge
, Deep Learning based techniques have invaded most tasks related to computer vision, either equating or surpassing all previous methods, at a fraction of the required domain knowledge and field expertise. Text recognition was no exception, and methods based on CNNs and Recurrent Neural Networks (RNNs) have dominated all text recognition tasks like OCR, Handwriting recognition , scene text recognition , and license plate recognition , and have became the de facto standard for the task.
Despite their success, one can spot a number of shortcomings in these works. First, for many tasks, an RNN is required for achieving state of the art results which brings-in non-trivial latencies due to the sequential processing nature of RNNs. This is surprising, given the fact that, for pure-visual text recognition long range dependencies have no effect and only local neighborhood should affect the final frame or character classification results. Second, for each of these tasks, we have a separate model that can, with its own set of tweaks and tricks, achieve state of the art result in a single task or a small number of tasks. Thus, No single model is demonstrated effective on the wide spectrum of text recognition tasks. Choosing a different model or feature-extractor for different input data even inside a limited problem like text recognition is a great burden for practitioners and clearly contradicts the idea of automatic or data-driven representation learning promised by deep learning methods.
In this work, we propose a very simple and novel neural network architecture for generic, unconstrained text recognition. Our proposed architecture is a fully convolutional CNN  that consists mostly of depthwise separable convolutions 
with novel inter-layer residual connections and gating, trained on full line or word labels using the CTC loss . We also propose a set of generic data augmentation techniques that are suitable for any text recognition task and show how they affect the performance of the system. We demonstrate the superior performance of our proposed system through extensive experimentation on seven public benchmark datasets. We were also able, for the first time, to demonstrate human-level performance on the reCAPTCHA dataset proposed recently in a Science paper , which is more than 20% absolute increase in CAPTCHA recognition rate compared to their proposed RCN system. We also achieve state of the art performance in SVHN  (the full sequence version), the unconstrained settings of IAM English offline handwriting dataset , KHATT Arabic offline handwriting dataset , University of Washington (UW3) OCR dataset , AOLP license plate recognition dataset  (in all divisions). Our proposed system has also won the ICFHR2018 Competition on Automated Text Recognition on a READ Dataset  achieving more than 25% relative decrease in Character Error Rate (CER) compared to the entry achieving the second place.
To summarize, we address the unconstrained text recognition problem. In particular, we make the following contributions:
We propose a simple, novel neural network architecture that is able to achieve state of the art performance, with feed forward connections only (no recurrent connections), and using only the highly efficient convolutional primitives.
We propose a set of data augmentation procedures that are generic to any text recognition task and can boost the performance of any neural network architecture on text recognition.
We conduct an extensive set of experiments on seven benchmark datasets to demonstrate the validity of our claims about the generality of our proposed architecture. We also perform an extensive ablation study on our proposed model to demonstrate the importance of each of its submodules.
Section 2 gives an overview of related work on the field. Section 3 describes our architecture design and its training process in details. Section 4 describes our extensive set of experiments and presents its results.
Detailed diagram of our proposed network. We explicitly present details of the network, with clear description of each block in the legend below the diagram. We also highlight each dimensionality change of input tensors as they flow through the network. (a) Our overall architecture. Here, we assume the network hasstacked layers of the repeated . (b) Level of the repeated .
2 Related Work
Work in text recognition is enormous and spans over a century, we here focus on methods based on deep learning as they have been the state of the art for at least five years now. Traditional methods are reviewed in [20, 21], and very old methods are reviewed in .
There is two major trends in deep learning based sequence transduction problems, both avoid the need for fine-grained alignment between source and target sequences. The first is using CTC , and the second is using sequence-to-sequence (Encoder-Decoder) models [22, 23] usually with attention .
2.1 CTC Models
In these models the neural network is used as an emission model, such that the input signal is divided into frames and the model computes, for each input frame, the likelihood of each symbol of the target sequence alphabet. A CTC layer then computes, for a given input sequence, a probability distribution over all possible output sequences. An output sequence can then be predicted either greedily (i.e. taking the most likely output at each time-step) or using beam-search. A thorough explanation of CTC can be found in.
BLSTM  + CTC was used to transcribe online handwriting data on , they were also used for OCR in . Deep BLSTM + CTC were used for handwriting recognition (with careful preprocessing) in , they were also used on HOG features for scene text recognition in . This architecture was mostly abandoned in favor of CNN+BLSTM+CTC.
Deep MDLSTM  (interleaved with convolutions) + CTC have achieved the best results on offline handwriting recognition since their introduction in  until recently [32, 33, 34]. However, their performance was recently equated with the much faster CNN+BLSTM+CTC architecture [35, 36]. MDLSTMs have also shown good results in speech recognition .
Various flavors of CNN+BLSTM+CTC currently deliver state of the results in many text recognition tasks. Breuel  shows how they improve over BLSTM+CTC for OCR tasks. Li et al.  shows state of the art license plate recognition using them. Shi et al.  uses it for scene text recognition, however, it is not currently the state of the art. Puigcerver  and Dutta et al.  used CNN+BLSTM+CTC to equate the performance of MDLSTMs in offline handwriting recognition.
There is a recent shift to recurrence-free architectures in all sequence modeling tasks  due to their much better parallelization performance. We can see the trend of CNN+CTC in sequence transduction problems as used in  for speech recognition, or in [42, 43] for scene text recognition. The main difference between these later works and ours is that while we also utilize a CNN+CTC (1) we use a completely different CNN architecture (2) while previous CNN+CTC work was mostly demonstrative of the viability/competitiveness of the approach and was carried on a specific task, we show strong state of the art results on a wide range of text recognition tasks.
2.2 Encoder-Decoder Models
These models consist of two main parts. First, the encoder reads the whole input sequence, then the decoder produces the target sequence token-by-token using information produced by the decoder. An attention mechanism is usually utilized to allow the decoder to automatically (soft-)search for parts of the source sequence that are relevant to predicting the current target token [22, 23, 24].
In  a CNN+BLSTM is used as an encoder and a GRU  is used as a decoder for scene text recognition. A CNN encoder and an RNN decoder is used for reading street name signs from multiple views in . Recurrent convolutional layers  are used as an encoder and an RNN as a decoder for scene text recognition in . In  a CNN+BLSTM is used as an encoder and an LSTM is used as a decoder for handwriting recognition. They achieve competitive results, but their inference time is prohibitive, as they had to use beam search with very wide beam to achieve a good result. Bluche et al.  tackles full paragraph recognition (without line segmentation) they use a MDLSTM as an encoder, LSTM as a decoder, and another MDLSTM for computing attention weights.
In this section, we present details of our proposed architecture. First, Section 3.1 gives an overview of the proposed architecture. Then, Section 3.2 presents derivation of the gating mechanism utilized by our architecture. After that, Section 3.3 discusses the data augmentation techniques we use in the course of training. Finally, Section 3.4 describes our model training and evaluation process.
3.1 Overall Architecture
Figure 1 presents a detailed diagram of the proposed architecture. We opted to use a very detailed diagram for illustration instead of a generic one, in order to make it straightforward to implement the proposed architecture in any of the popular deep learning frameworks without much guessing.
As shown in Figure 1
(a), our proposed architecture is a fully convolutional network (FCN) that makes heavy use of both Batch Normalization and Layer Normalization  to both increase convergence speed and regularize the training process. We also use Batch Renormalization  on all Batch Normalization layers. This allows us to use batch sizes as low as 2 with minor effects on final classification performance. Our main computational block is Depthwise separable convolutions. Using them instead of regular convolutions gives the same or better classification performance, but at a much faster convergence speed and a drastic reduction in parameter count and computational requirements. We also found the inductive bias of depthwise convolutions to have a strong regularization effect on training. For example, for our original architecture with regular convolutions, we had to use spatial DropOut  (i.e. randomly dropping entire channels completely) after nearly every convolution, in contrast to a single spatial DropOut layer after the stack of s as we currently do. An important point to note here, is that in contrast to other works that use stacks of depthwise separable convolutions [54, 55]. We found it crucial to use Batch Normalization in-between the depthwise convolution and the 1x1 pointwise convolution . We also note here that we found spatial DropOut to provide much better regularization than regular un-structured DropOut  on the whole tensor.
First, the input image is Layer Normalized, input features are then projected to be 16 using a 1x1 convolution, then the input tensor is softmax-normalized as this was found to greatly increase convergence speed on most datasets. Next, each channel is preprocessed independently with a 13x13 filter using a depthwise convolution, then we employ the only used dense connection by concatenating, on the channel dimension, the result of preprocessing with the layer-normalized original image. The result is fed to a stack of
s that are responsible for the majority of the input sequence modeling, and target sequence generation task through a fixed number of in-place whole-sequence transformation stages. This was demonstrated to be effectively an unrolled iterative estimation of target sequence. Spatial DropOut is applied on the output of the last , then the channel dimension is projected to be equal to the size of the target alphabet plus one (required for the CTC blank ) using a 1x1 convolution. Finally, we perform global average pooling on the height dimension. This step enables the network to handle inputs of any height. The result is then layer normalized, softmax-normalized then fed to the CTC loss along with its corresponding text label during training.
Stacked s are the main computational blocks of our model. Using attention gates to control the inter-layer Information flow, in the sense of filtering-out unimportant signals (e.g. background noise) and sharpening/intensifying important signals, has received a lot of interest by many researchers lately [59, 60]. However, the original concept is old [61, 62]
. Many patterns of gating mechanisms has been proposed in literature, first for connecting RNNs time steps as a technique to mitigate the vanishing gradient problem[61, 63]
, then this idea was extended to connecting layers of any feed-forward neural networks. This was done either for training very deep networks[64, 65] or for using it, as stated previously, as a smart, content-aware filters.
However, most of the attention gates that were suggested recently for connecting CNN layers use ad-hoc mechanisms tested only on the target datasets without comparison to other mechanisms. We opted to base our attention gates on the gating mechanism suggested by Highway Networks , as they are based on the widely used, heavily studied LSTM networks. Also, a number of variants of Highway networks were studied, compared and contrasted in subsequent studies [66, 58]. We also validate the superiority of these gates experimentally in Section 4.
Assume a plain feed-forward -layer neural network, with layers
. Each of these layers applies a non-linear transformationto its input to get its output, such that
In a Highway network, we have two additional non-linear transformations and . These act as gates that can dynamically pass part of its inputs and suppress the other part, conditioned on the input itself. Therefore:
Authors suggest to tie the two gates, such that we have only one gate and the other is its opposite . This way we find out
This can re-factored to
We should note here that the dimensionality of , , , and must be the same, for this equation to be valid. This is can be an important limitation from both the computational and memory usage perspectives for wide networks (e.g for CNNs when each of those tensors has a large number of channels) or even small networks when have big spatial dimensions.
If we would use the same original plain network but with residual connections we would have
Comparing Equations 4 and 5, it is evident that both equations have the same residual connection in the right hand-side of the equation and thus we can interpret the left hand-side of Equation 4 as the transformation function applied to input . This interpretation motivates us to perform two modifications to Equation 4.
First, we note that the left hand-side of Equation 4 performs a two-stage filtering operation on its input before adding the residual connection. The output of the transformation function is first subtracted from the input signal , then multiplied by the gating function . We argue that all transformations to the input signal should be learned and should be conditioned on the input signal . So instead of using a single transformation function , we propose using two transformation functions and , such that:
In all experiments in this paper, , , and are implemented as Depthwise separable convolutions with , , and activation functions respectively.
Second, the residual connection interpretation along with the dimensionality problem we noted above, motivates us to make a simple modification to Equation 4. This would allow us to efficiently utilize Highway gates even for wide and deep networks.
Let be a transformation mapping to and the opposite transformation; mapping to . We can change Equation 4 such that
When using a sub-sampling transformation for this allows us to retain optimization benefit of residual connections (right hand-side of Equation 8) whilst computing Highway gates on a lower dimensional representation of which can be performed much faster and uses less memory. In all our experiments in this paper, and are implemented as Depthwise separable convolutions with the Exponential Linear Unit (ELU)  as an activation function. Also, we always set , , and . This means that sub-sampling is only done on the channel dimension and spatial dimensions are kept the same.
3.3 Data Augmentation
Encoding domain knowledge into data through label-preserving data transformations has always been very successful and had provided fundamental contributions to the success of data-driven learning systems in the past, and lately after the recent revival of neural networks. We here describe a number data augmentation techniques that are generic and can benefit most text recognition problems, these techniques are applied independently to any input data sample during training
3.3.1 Projective Transforms
Text lines or words to be recognized can suffer various types of scaling or shearing and more generally projection transforms. We apply a random projection transform to every input training sample, by randomly sampling four points that correspond to the original four corners of the original image (i.e. the sampled points represent the new corners of the image) under the following criteria
We either change the coordinate of all four points and fix the coordinate or vice versa. We found it almost always better to either change the image vertically or horizontally but not both simultaneously.
When changing either vertically or horizontally, the new distance between any connected pair of corners must not be less than half the original one, or more than double the original one.
3.3.2 Elastic Distortions
Elastic distortions were first proposed by  as a way for emulating uncontrolled oscillations of hand muscles. It was then used in  for augmenting images of handwritten digits. In , the authors apply a displacement field to each pixel in the original image, such that each pixel moves by a random amount in both the and directions. The displacement field is built by convolving a randomly initialized field with a Gaussian kernel. An obvious problem with this technique is that it operates at a very fine level (the pixel level), which makes it more likely that the resulting image is unrealistically distorted.
To counter this problem, we take the approach used by  and , by working at a coarser level to better preserve the localities of the input image. We generate a regular grid and apply a random displacement field to the control points of the grid. The original image is then warped according to the displaced control points. It is worth noting that the use of a coarse grid makes it unnecessary to apply a smoothing Gaussian to the displacement filed. However, unlike  we do not align to the baseline of the input image, which makes it much simpler to implement. We also impose the following criteria on the randomly generated displacement field, which is uniformly sampled in our implementation:
We apply the distortions either to the direction or the direction and not to both. As we did for the projective transforms, we also found here it is better to apply distortions to one dimension at a time.
We force the minimum width or height of a distorted grid rectangle to be . In other words, zero or negative values, although possible, are not allowed.
3.3.3 Sign Flipping
We randomly flip the sign of the input image tensor. This is motivated by the fact that text content is invariant to its color or texture, and one of the simplest methods to alarm this invariance is sign flipping. It is important to stress that this augmentation improves performance even if all training and testing images have the same text and background colors. The fact that color-inversion data augmentation improves text recognition performance was previously noticed in .
3.4 Model Training and Evaluation
We train our model using the CTC loss function. This enables using unsegmented pairs of line-images and corresponding text transcriptions to train the model without any character/frame-level alignment. Every epoch, training instances are sampled without replacement from the set of training examples. Previously described augmentation techniques are applied during training on a batch-wise manner, i.e. the same random generated values are applied to the whole batch.
In order to generate target sequences, we simply use CTC greedy decoding : i.e. taking the most probable label at each time step, then mapping the result using the CTC sequence mapping function which first removes repeated labels, then removes the blanks. In all our experiments, we do not utilize any form of language modeling or any other non-visual cues.
We evaluate the quality of our models based on the popular Levenstein edit distance  evaluated on the character level, and normalized by the length of the ground truth, which is commonly known as Character Error Rate (CER). Due to the inherent ambiguity in some of the datasets used for evaluation (e.g. handwriting datasets), we need another metric to illustrate how much of this ambiguities have been learned by our model, and by how much can the model performance increase if another signal (other than the visual one, e.g. a linguistic one) can help ranking the possible output sequence transcriptions. For this task, we use the CER@Top metric, which computes the CER as if we were allowed to choose for each output sequence the candidate with the lowest CER out of those returned from CTC beam search decoder . This gives us an upper bound on the maximum possible performance gain by using a secondary re-ranking signal.
In this section we evaluate and compare the proposed architecture to the previous state of the art techniques by conducting extensive experiments on seven public benchmark datasets covering a wide spectrum of text recognition tasks.
4.1 Implementation Details
For all our experiments, we use an initial learning rate of , which is exponentially decayed such that it reaches of its value after batches. We use the maximum batch allowed by our platform with a minimum size of
. All our experiments are performed on a machine equipped with a single Nvidia TITAN Xp GPU. All our models are implemented using TensorFlow.
The variable part of our model is the stack, which is (in our experiments) parameterized by three parameters written as . is the number of s, and are the number of channels in the first and third . For all instances of our model used in the experiments, the number of channels is changed only in the first and third .
4.2 Vicarious reCAPTCHA Dataset
In a recent Science paper , Vicarious presented RCN, a system that could break most modern text-based CAPTCHAs in a very data-efficient manner, they claimed it is multiple orders of magnitude more data efficient than rivaling deep-learning based systems in breaking CAPTCHAs and also more accurate. To support those claims they collected a number of datasets from multiple online CAPTCHA systems and annotated them using humans. We choose to compare against the reCAPTCHA subset specifically since they have also done an analysis of human performance on this specific subset only (shown in Table I). The reCAPTCHA dataset consists of 5500 images, 500 for training and 5000 for testing, with the critical aspect being generalizing to unseen CAPTCHAs using only the 500 training images. In our experiments, we split the training images to 475 images used for training and 25 images used for validation.
Before comparing our results directly to RCN’s, it is important to note that all RCN’s evaluations are case-insensitive, while our evaluations use the much harder case-sensitive evaluation, the reason we did so is that case-sensitive evaluation is the norm for text recognition and we were already getting accurate results without problem simplification.
As shown in Table I, without data augmentation, and with half input spatial resolution our system is already 10% more accurate than RCN. In fact, it alot more accurate since RCN’s results are case-insensitive. When we apply data augmentations and use a deeper model, our model is 20% better than RCN, and more importantly, it is able to surpass the human performance estimated in their paper.
|11 layer Maxout CNN ||64x64||3.96||51||2013|
|Single DRAM ||64x64||5.1||14||2014|
|Single DRAM MC ||64x64||4.4||14||2014|
|Double DRAM MC ||64x64||3.9||28||2014|
|ST-CNN Single ||64x64||3.7||33||2015|
|ST-CNN Multi ||64x64||3.6||37||2015|
|EDRAM Single ||64x64||4.36||11||2017|
|EDRAM Double ||64x64||3.6||22||2017|
Street View House Numbers (SVHN)  is a challenging real-world dataset released by Google. This dataset contains around 249k real world images of house numbers divided to 235k images for training and validation and 13k for testing, the task is to recognize the sequence of numbers in each image. There are between 1 and 5 digits in each image, with a large variability in texture, scale, and spatial arrangement.
Following  we formed a validation set of 5k images by randomly sampling images from the training set. We also convert input RGB images to grayscale, as it was noticed by many authors [77, 78] that this did not affect performance. Although most literature resizes the cropped digits image to 64x64 pixels, we found that for our model, 32x32 pixels was sufficient. This leads to much faster processing of input images. We only applied the data augmentations introduced in 3.3.
As shown in Table II, after progress in this benchmark has stalled since 2015 on a full sequence error of 3.6%, we were able to advance this 3.3% using only 0.9M parameters. That is a massive reduction in parameter count from second best performing system and at a better classification accuracy. Using a larger model with 3.4M parameters we are able to advance the full sequence error to 3.1%.
4.4 University of Washington Database III
|LSTM ||32 x W||0.60||True||100|
|LSTM ||48 x W||0.40||True||500|
|CNN-LSTM ||48 x W||0.17||True||500|
|CNN-LSTM ||48 x W||0.43||False||500|
|Ours: 4(128,512)||32 x W||0.10||False||5|
|Ours: 8(128,512)||32 x W||0.08||False||8|
The University of Washington Database III (UW3) dataset , contains 1600 pages of document images from scientific journals and other common sources. We use text lines extracted by  which contains 100k lines split to 90k for training and 10k for testing. As an OCR benchmark it presents different sets of challenges like small original input resolutions, and the need for extremely low CER.
Geometric Line Normalization is introduced in  to aid text line recognition using 1D LSTM which are not translationally invariant along the vertical axis. It was also found essential for CNN-LSTM networks . It was mainly modeled for targeting Latin scripts, which have many character pairs that are nearly identical in shape but are distinguished from each other based on their position and size relative to text-line’s baseline and x-height. It is a fairly-complex, script specific operation.
The only preprocessing we perform on images is to resize it such that the height of the input text line is 32 pixels, while maintaining the aspect ratio of the line image.
As shown in Table III, even a fairly small model without any form of input text line normalization or script specific knowledge is able to get state of the art CER of 0.10% on the dataset. Using a deeper model, we are able to get a slightly better accuracy of 0.08% CER. One important thing to also note from the table is that our models take two orders of magnitude less epochs to converge to the final classification accuracy, and use smaller input spatial dimensions.
4.5 Application-Oriented License Plate (AOLP) Dataset
Application-Oriented License Plate (AOLP) dataset  has 2049 images of Taiwanese license plates, it is categorized into three subsets with different levels of difficulty for detection and recognition: access control (AC), traffic law enforcement (LE), and road patrol (RP). The evaluation protocol used in all previous work on AOLP is to train on two subsets and test on the third (e.g. for the RP results, we train on AC and LE and test on RP).
As shown in Table IV, we achieve state of the art results with small margin in AC and LE, and at big margin for the challenging RP subset. It is important to note that while  uses extensive data augmentation on the shape and color of input license plate, we use only our described augmentation techniques and convert input image to grayscale (thus, going away with most color information).
4.6 KHATT database
KFUPM Handwritten Arabic TexT (KHATT) is a freely available offline handwritten text database of modern Arabic. It consists of scanned handwritten pages with different writers and text content. Pages segmentation into lines is also provided to allow the evaluation of line-based recognition systems directly without layout analysis. Forms were scanned at 200, 300, and 600 dpi resolutions. The database consists of 4000 paragraphs written by 1000 distinct writers across 18 countries. The 4000 paragraphs are divided into 2000 unique randomly selected paragraphs with different text contents, and 2000 fixed paragraphs with the same text content covering all the shapes of Arabic characters.
Following [16, 82] we carry our experiments on the 2000 unique randomly selected paragraphs, using the database default split of these paragraphs into a training set with 4825 text-lines, test set with 966 text-lines, and a validation set with 937 text-line images.
The main preprocessing step we performed on KHATT was removing the large extra white spaces that is found on some line-images, as was done by . We had to perform this step as the alternative was using line-images with large height to account for this excessive space that can be, in some images, many times more than the line height.
As we do in all our experiments, we compare only with methods that report pure optical results (i.e. without using any language models or syntactic constrains). As shown in Table V, we achieve a significant reduction in CER compared to previous state-of-the-art, this is partly due to the complexity of Arabic script and partly due the fact that little work was previously done on KHATT.
4.7 IAM Handwriting Database
|Method||Preprocessing||Input Scale||Aug.||Val CER(%)||Test CER(%)|
|5 layers MLP, 1024 units each ||
De-skewing, De-slanting, contrast normalization, and region height normalization
|72 x W||No||-||15.6|
|7-layers BLSTM, 200 units each ||Same as above||72 x W||No||-||7.3|
|3-layers of [Conv. + 2D-LSTM], (4,20,100) units ||
mean and variance normalization of the pixel values
|3-layers of [Conv. + 2D-LSTM], wide layers ||De-slanting and inversion||Original||No||5.35||7.85|
|5-layers of [Conv. + 2D-LSTM], wide layers ||Same as above||Original||No||4.42||6.64|
|4-layers of [Conv. + 2D-LSTM], wide layers, double input conv. ||Same as above||Original||No||4.31||6.39|
|5 Convs. + 5 BLSTMs ||
De-skewing and Binarization
|128 x W||No||5.1||8.2|
|5 Convs. + 5 BLSTMs ||Same as above||128 x W||Yes||3.8||5.8|
|STN  + ReseNet18 + BLSTM stack ||De-skewing, De-slanting, Model pre-training||-||Yes||-||5.7|
|Ours: 8(128,1024)||None||32 x W||Yes||4.1||5.8|
|Ours: 16(128,1024)||None||32 x W||Yes||3.3||4.9|
The IAM database  (modern English) is probably the most famous offline handwriting benchmark. It is compiled by the FKI-IAM Research Group. The dataset is handwritten by 657 different writers and composed of 1539 scanned text pages that correspond to English texts extracted from the LOB corpus . They are partitioned into writer-independent training, validation and test partitions of 6161, 966 and 2915 lines respectively. The original line images in the training set have an average width of 1751 pixels and an average height of 124 pixels. There are 79 different character classes in the dataset.
Since the primary concern of this work is image-based text recognition and how to get the best CER, purely out of visual data, we do not compare to any method that uses additional linguistic data to enhance the recognition rates of their systems (e.g. through the use of language models).
In Table VII we give an overview and compare to pure optical methods in literature. As we can see, our method achieves an absolute 1% decrease on CER (relative 16% improvement) compared to previous state of the arts [35, 36] which used a CNN + BLSTM. Although we use input which is down-sampled 4 times compared to , and although  (which achieves 0.1% better CER than ) uses pre-training on a big synthetic dataset  and also uses test time data augmentation.
An important question in visual classification problems - and text recognition is no exception - is: given an ambiguous case, can our model disentangle this ambiguity and shortlist the possible output classes ? In a trial to answer this question we propose using the CER@Top metric which accepts output sequences for an input image and returns the minimum CER computed between the ground-truth and each of the output sequences. CTC beam search decoding can output a list of Top-N candidate output sequences, ordered by their likelihood as computed by the model. We use this list to compute CER@Top. For all our experiments we use a constant beam width of 10.
In Table VI we can see that when our model is allowed to give multiple answers it achieves significant gains, allowing a second choice gives a 15% relative reduction in CER for both models. For our larger model, with only 5 possible output sequences (full lines), it can match the state of the art result on IAM  which uses a combination of word and character 10-gram language models, both trained on combined LOB , Brown , and Wellington  corpora containing around 3.5M running words.
4.8 ICFHR2018 Competition on Automated Text Recognition
|Method||CER per additional specific training pages||Imp.||CER per test document||total error|
|Method||Konzil C||Schiller||Ricordi||Patzig||Schwerin||< 10||Architecture||Training Aug.||Test Aug.|
|OSU||3.7874||12.4514||15.0412||12.5371||3.4996||2||7 Convs. + 2 BLSTMs||Yes (grid distortion , rescaling, rotation and shearing)||
|ParisTech||8.0163||14.5801||30.1986||15.5085||9.1846||2||13 Convs. + 3 BLSTMs + word LM||Yse ||Yes |
|LITIS||4.8064||19.5665||16.3744||12.8284||6.6127||2||11 Convs. + 2 BLSTMs + 9-gram LM||Yes (baseline biasing)||No|
|PRHLT||4.9847||12.5486||28.5164||16.3506||7.1178||2||4 Convolutional Recurrent + 4 BLSTMs + 7-gram char-LM||No||No|
|RPPDI||9.1797||16.2908||30.4939||28.2998||24.9046||1||4-layer of [Conv. + 2D-LSTM], double input conv.  +4-gram word-LM||No||No|
The purpose of this open competition is to achieve a minimum character error rate on a previously unknown text corpus from which only a few pages are available for adjusting an already pre-trained recognition engine . This is a very realistic setup and is much more probable than the usual assumption that we already have thousands of hand-annotated lines which is usually unrealistic due to the associated cost.
The dataset associated with the competition consists of 22 heterogeneous documents. Each of them was written by only one writer but in different time periods in Italian, and modern and medieval German. The training data is divided into a general set (17 documents) and a document-specific set (5 documents) of equal scripts as in the test set. The training set comprises roughly 25 pages per document. The test set contains the same 5 documents as the document-specific training set but uses a different set of pages: 15 pages for each document.
Each participant has to submit 4 transcriptions per test set based on 0, 1, 4 or 16 additional (specific) training pages, where 0 pages correspond to a baseline system without additional training. To identify the winner of the competition, CER of all four transcriptions for each of the 5 test documents is calculated (i.e. each submission is comprised of 20 separate files). Our system is currently the best performing system in this open competition achieving more than 25% relative decrease in CER compared to the second place entry, as shown in competition website .
In Table VIII we can see that our method achieves significant reduction in CER compared to other submissions reported in . We achieve the best result in every single metric reported in . Not only do our method achieve the best results in the various forms of CER reported in Table VIII, but we also achieve the best improvement in CER from 0 specific training pages to 16 specific training pages with nearly 80% decrease in CER which means our method is the most data-efficient of all submissions.
An important measure of method’s quality is its applicability to real-life use case scenarios and resource constrains, so given well-transcribed 16 pages out of given document, how far can we go? This is shown in Table IX. To asses the significance of presented results, we use an empirical rule developed within the READ project  (mentioned in ), stating that transcriptions which achieve a CER below 10% are already helpful as an initial transcription which can be corrected quickly by a human operator. We can see that while most submitted methods can manage at most 2 documents with CER less than 10% when using 16 additional pages, our methods raises this number to 4 out of 5 tested documents. For the fifth document we achieve a CER of 11.4% (slightly higher than 10%). It is worth noting that while all training documents and the 4 other testing documents are German (from various dialects and centuries), this document (Ricordi) is Italian, and it is the only Italian resource in the whole dataset.
4.9 Ablation Study
|6||with residual connections, no gates||25.53||37.30|
|7||no residual connections, with gates||29.85||43.06|
|8||no residual connections, no gates||25.89||38.23|
|no Layer Normalization||44.92||68.24|
|no Layer Normalization at start and end||29.29||42.93|
|no Batch Normalization||35.59||50.81|
|no Softmax before GateBlocks||24.52||36.12|
|Tanh instead of Softmax before GateBlocks||24.43||36.11|
We here conduct an extensive ablation study to highlight which parts of our model are the most crucial on system’s final classification performance. All experiments on this section are performed on the IAM handwriting dataset , as it presents sufficient complexity to highlight the significance of different components of our system. We resize input images to 32 pixels height while maintaining the aspect ratio, however, we limit image width to 200 pixels and down-sample feature maps to a maximum of 4 x 50 pixels to limit the computational requirements needed for our experiments. We also use a small baseline model - 8(32,64) - for the same purpose. All experimented models use a fixed parameter budget of roughly 71k parameters.
4.9.1 Gating Function
First, we experiment with different gating functions other than the used baseline in Equation 8. For notational simplification we remove and from the equation, such that it becomes
In Table X, the second model uses the gating function utilized by most previous work that made use of inter-layer gating functions in CNNs like [59, 93, 42, 94]. It can be seen that our baseline model achieves a 20% relative reduction in CER. This superiority may, however, be tied to our architectural choices. The third model uses a simplification of our proposed gating function with one transformation function, resembling the T-Only Highway network from . It achieves slightly better CER than the baseline. We opted for the baseline as it gives roughly the same performance but with two transformation functions so it has more expressive power. The fourth and fifth models represent other variations on our proposed scheme where we try to change the input filtering order by performing multiplication before subtraction. The last three models illustrate the importance of gating and residual connections. It is interesting to note that using gates without residual connections is much worse than removing both of them.
In Table XI, we show the effect of removing different parts of the normalization mechanisms utilized by our model. The second row shows that removing Layer Normalization has the most drastic effect on the model’s performance. The third row shows that even just removing the two Layer Normalization layers at the beginning and end of the model increases CER by almost 30%. The third shows that Batch Normalization is also crucial for model’s performance, although not as important as Layer Normalization. The last two rows show the effect of removing Softmax normalization before GateBlocks or replacing it with just a Tanh activation function.
In this paper, we tackled the problem of general, unconstrained text recognition. We presented a simple data and computation efficient neural network architecture that can be trained end-to-end on variable-sized images using variable-sized line level transcriptions. We conducted an extensive set of experiments on seven public benchmark datasets covering a wide range of text recognition sub-tasks, and demonstrated state of the art performance on each one of them using the same architecture and with minimal change in hyper-parameters. The experiments demonstrated both the generality of our architecture, and the simplicity of adopting it to any new text recognition task. We also conducted an extensive ablation study on our model that highlighted the importance of each of its submodules.
Our presented architecture is general enough for application on many sequence prediction problem, a further research direction that we think is worthy of investigation is using it for automated speech recognition (ASR). Especially given its sample efficiency and high noise robustness as demonstrated by its performance on handwriting recognition and scene text recognition.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
-  H. F. Schantz, The history of OCR, optical character recognition. Recognition Technologies Users Association Manchester, VT, 1982.
K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” inCompetition and cooperation in neural nets. Springer, 1982, pp. 267–285.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-performance ocr for printed english and fraktur using lstm networks,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, pp. 683–687.
-  A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 855–868, 2009.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” International Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016.
-  H. Li and C. Shen, “Reading car license plates using deep convolutional neural networks and lstms,” arXiv preprint arXiv:1601.05610, 2016.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  L. Sifre and S. Mallat, “Rigid-motion scattering for image classification,” Ph.D. dissertation, Citeseer, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
temporal classification: labelling unsegmented sequence data with recurrent
neural networks,” in
Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
-  D. George, W. Lehrach, K. Kansky, M. Lázaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang et al., “A generative vision model that trains with high data efficiency and breaks text-based captchas,” Science, vol. 358, no. 6368, p. eaag2612, 2017.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2, 2011, p. 5.
-  U.-V. Marti and H. Bunke, “The iam-database: an english sentence database for offline handwriting recognition,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002.
-  S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. Parvez, V. Märgner, and G. A. Fink, “Khatt: An open arabic offline handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp. 1096–1112, 2014.
-  I. Phillips, “User’s reference manual for the uw english/technical document image database iii,” UW-III English/Technical Document Image Database Manual, 1996.
-  G.-S. Hsu, J.-C. Chen, and Y.-Z. Chung, “Application-oriented license plate recognition,” IEEE transactions on vehicular technology, vol. 62, no. 2, pp. 552–561, 2013.
-  T. Strauß, G. Leifert, R. Labahn, T. Hodel, and G. Mühlberger, “Icfhr2018 competition on automated text recognition on a read dataset,” 2018, pp. 477–482.
-  Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
-  Q. Ye and D. Doermann, “Text detection and recognition in imagery: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  A. Hannun, “Sequence modeling with ctc,” Distill, vol. 2, no. 11, p. e8, 2017.
-  A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
-  A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernández, “Unconstrained on-line handwriting recognition with recurrent neural networks,” in Advances in neural information processing systems, 2008, pp. 577–584.
-  T. Bluche, “Deep neural networks for large vocabulary handwritten text recognition,” Ph.D. dissertation, Université Paris Sud-Paris XI, 2015.
-  B. Su and S. Lu, “Accurate scene text recognition based on recurrent neural network,” in Asian Conference on Computer Vision. Springer, 2014, pp. 35–48.
-  A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional recurrent neural networks,” in International Conference on Artificial Neural Networks. Springer, 2007, pp. 549–558.
-  A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Advances in neural information processing systems, 2009, pp. 545–552.
-  T. Bluche, J. Louradour, M. Knibbe, B. Moysset, M. F. Benzeghiba, and C. Kermorvant, “The a2ia arabic handwritten text recognition system at the open hart2013 evaluation,” in Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 2014, pp. 161–165.
-  V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, “Dropout improves recurrent neural networks for handwriting recognition,” in Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 285–290.
P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition with large multidimensional long short-term memory recurrent neural networks,” inFrontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on. IEEE, 2016, pp. 228–233.
-  J. Puigcerver, “Are multidimensional recurrent layers really necessary for handwritten text recognition?” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 67–72.
-  K. Dutta, P. Krishnan, M. Mathew, and C. Jawahar, “Improving cnn-rnn hybrid networks for handwriting recognition,” 2018.
-  J. Li, A. Mohamed, G. Zweig, and Y. Gong, “Exploring multidimensional lstms for large vocabulary asr,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4940–4944.
-  T. M. Breuel, “High performance text recognition using a hybrid convolutional-lstm implementation,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 11–16.
-  B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 11, pp. 2298–2304, 2017.
-  S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
-  Y. Wang, X. Deng, S. Pu, and Z. Huang, “Residual convolutional ctc networks for automatic speech recognition,” arXiv preprint arXiv:1702.07793, 2017.
-  Y. Gao, Y. Chen, J. Wang, and H. Lu, “Reading scene text with attention convolutional sequence modeling,” arXiv preprint arXiv:1709.04303, 2017.
-  F. Yin, Y.-C. Wu, X.-Y. Zhang, and C.-L. Liu, “Scene text recognition with sliding convolutional character models,” arXiv preprint arXiv:1709.01727, 2017.
-  B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: an attentional scene text recognizer with flexible rectification,” IEEE transactions on pattern analysis and machine intelligence, 2018.
-  Z. Wojna, A. N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, and J. Ibarz, “Attention-based extraction of structured information from street view imagery,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 844–850.
-  M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3367–3375.
C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for ocr in the wild,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
-  A. Chowdhury and L. Vig, “An efficient end-to-end neural model for handwritten text recognition,” arXiv preprint arXiv:1807.07965, 2018.
-  T. Bluche, J. Louradour, and R. Messina, “Scan, attend and read: End-to-end handwritten paragraph recognition with mdlstm attention,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1050–1055.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  S. Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batch-normalized models,” in Advances in Neural Information Processing Systems, 2017, pp. 1945–1953.
-  J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient object localization using convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 648–656.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258.
-  L. Kaiser, A. N. Gomez, and F. Chollet, “Depthwise separable convolutions for neural machine translation,” arXiv preprint arXiv:1706.03059, 2017.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  K. Greff, R. K. Srivastava, and J. Schmidhuber, “Highway and residual networks learn unrolled iterative estimation,” arXiv preprint arXiv:1612.07771, 2016.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 6450–6458.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, 2017.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
-  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” Neural Computation, vol. 12, no. 10, pp. 2451–2471, 2000.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
-  N. Kalchbrenner, I. Danihelka, and A. Graves, “Grid long short-term memory,” arXiv preprint arXiv:1507.01526, 2015.
-  Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models.” in AAAI, 2016, pp. 2741–2749.
-  D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
-  P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Aug 2003, pp. 958–963.
-  D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, June 2012, pp. 3642–3649.
-  C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Cohen, “Data augmentation for recognition of handwritten words and lines using a cnn-lstm network,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 639–645.
-  M. D. Bloice, C. Stocker, and A. Holzinger, “Augmentor: an image augmentation library for machine learning,” arXiv preprint arXiv:1708.04680, 2017.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
-  V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, pp. 838–855, 1992.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-digit number recognition from street view imagery using deep convolutional neural networks,” arXiv preprint arXiv:1312.6082, 2013.
-  J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
M. Jaderberg, K. Simonyan, A. Zisserman et al.
, “Spatial transformer networks,” inAdvances in neural information processing systems, 2015, pp. 2017–2025.
-  A. Ablavatski, S. Lu, and J. Cai, “Enriched deep recurrent visual attention model for multiple object recognition,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 971–978.
-  C. Wu, S. Xu, G. Song, and S. Zhang, “How many labeled license plates are needed?” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2018, pp. 334–346.
-  R. Ahmad, S. Naz, M. Z. Afzal, S. F. Rashid, M. Liwicki, and A. Dengel, “Khatt: A deep learning benchmark on arabic script,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 10–14.
-  D. Castro, B. L. D. Bezerra, and M. Valença, “Boosting the deep multidimensional long-short-term memory network for handwritten recognition systems,” in 16th International Conference on Frontiers in Handwriting Recognition. IEEE, 2018.
-  M. Villegas, V. Romero, and J. A. Sánchez, “On the modification of binarization algorithms to retain grayscale information for handwritten tet recognition,” in Iberian Conference on Pattern Recognition and Image Analysis. Springer, 2015, pp. 208–215.
-  S. Johansson, “The lob corpus of british english texts: presentation and comments,” ALLC journal, vol. 1, no. 1, pp. 25–36, 1980.
-  P. Krishnan and C. Jawahar, “Generating synthetic data for text recognition,” arXiv preprint arXiv:1608.04224, 2016.
-  S. Johansson, “The tagged LOB corpus: User’s manual,” 1986.
-  W. N. Francis, A manual of information to accompany A standard sample of present-day edited American English, for use with digital computers. Department of Linguistics, Brown University, 1971.
-  L. Bauer, Manual of information to accompany the Wellington corpus of written New Zealand English. Department of Linguistics, Victoria University of Wellington Wellington, 1993.
-  E. Chammas, C. Mokbel, and L. Likforman-Sulem, “Handwriting recognition of historical documents with few labeled data,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018, pp. 43–48.
-  “ICFHR2018 competition on automated text recognition on a read dataset,” https://scriptnet.iit.demokritos.gr/competitions/10/scoreboard/, accessed: 2018-11-20.
-  “READ: Recognition and enrichment of archival documents,” https://read.transkribus.eu/, accessed: 2018-11-20.
-  M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai, “Scene text recognition from two-dimensional perspective,” arXiv preprint arXiv:1809.06508, 2018.
-  O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.