Unlike other object recognition problem, text line patches in text images usually have large aspect ratio as shown in Figure 1. This property becomes problematic when image is distorted with planar transformation like rotation and perspective. A horizontal bounding box based algorithm usually pick out the lines with low percentage of interest pixels from distorted images.
Some solutions attempt to solve recognition with distortion by data augmentation with more labeled samples. It’s very expensive to collect and label large dataset with different rotational and perspective distortions. Besides, more complicated models are required to learn on larger dataset, which bring more utilization of computation and storage resources in training and application. Synthesis data are recently proposed and used in learning scene text images. For Chinese characters, this would be difficult and impractical to overcome this problem using data augmentation due to the scale of charset.
An orthogonal way is to learn its transforming parameters and recover its original view as processed by human. Methods worked in this way usually have assumptions like some borderline exists, camera parameters available or low rank property of text images which may not be satisfied in many situations when noise and complex textures exist.
Since the emergence of deep neural network(DNN)[6, 7, 8], it gradually reshapes the methodology of machine learning by pushing the state of the art of different tasks much forward, and bring out novel applications[9, 10]11] with deep architecture has proven to be top performance in many computer vision tasks[12, 13]. In text images analysis, deep methods for OCR using and LSTM has shown superior performance in recognition but may fail with planar transformation.
Recently, some researches try to solve affine and perspective transformation using deep neural network. Among them, one research is spatial transformer network(STN) which attempts to improve classification accuracy by inserting a transformer layer into an end-to-end neural network. It could detect and transform features learned into better view hence increase accuracy. However, one of its drawbacks is its inability to rectify images from planar transformation stably as human beings.
In this paper, a new framework of DNN is proposed combining the advantages of STN and supervised learning to learn rectification of planar transformations as shown in Figure1. The only assumption is that existence of a few parallel human readable text lines in the context of rectified images. We give transformation parameters directly to DNN, training it to learn rectification parameters in stage-wise method. Classification of rotation degrees is used instead of regression by discretizing range of rotation into intervals. Initialization of convolution kernel is different with commonly used methods to achieve better performance.
The advantages of this model include no only much milder assumption with no formats or prior knowledge of datasets. Deep models have much better robustness, and adaptive to different variations than traditional image processing methods. Besides, we found although no segmentation labels is directly provided to the model, it learns to distinguish part of the input image into different types according to context. It seems the model have mastered learning to focus on meaningful elements and regions with rectification training.
This model is benchmarked on a new dataset collected containing Chinese texts of different types and illumination conditions. We use real world data collected and transformed with generate parameters for training and testing. The outcome indicates the effectiveness and robustness of the proposed architecture.
2 Methods and Models
2.1 Background, Notation and Formulation
Planar transformations are common in image processing and can be modeled as a perspective transformation. Define as original image, as transformed image. Each pixel in with coordinate
in one image is rectified and interpolated towith coordinate in . Here define related parameters including rotation as , scaling factor , perspective parameters and translation of . Planar transformation can be formulated as:
Most recognition algorithms use bounding boxes and object detection methods to locate text area which is able to deal with different and . While some algorithms can handle problem of non-horizontal text lines, most of them require parallel text lines, or simply separately locate each word or neighboring texts
. With perspective transformation, parallel condition will disappear and a crossing point of these lines can be estimated using image processing method with strong assumptions as surveyed in.
2.2 Parameter Entangling and Stage-wise Model
Given the label data, it would be straight forward to build a CNN and solve it as a regression problem. The capability of feature learning, high representative ability and robustness to noise of CNN make it best choice for this task. However, it’s hard to estimate all the parameters simultaneously due to the nonlinearity of the mapping to approximate. Rotation parameters are serious interfered by perspective transformation as parallel property disappears as shown in (1).
Though difficult to regress parameters simultaneously, we could build models in stage-wise style inspired by past pipeline based research. If we eliminate perspective distortion first, and recover the parallel property of text lines, estimation of would be feasible. For a mathematic expression, we can decompose (1) into:
Assuming the position text area not biased much over the image, right-down corner element varies quite a little from 1. It means, transformation is equivalent to translation first, then rotation and perspective transformation in order. We rectify it in the inverse order, by learning and first, then estimating rotation angle , and locating the text area at last. There have been lots of research done on locating the text areas, here we mainly solve the first 2 steps.
Transformation parameters in these 2 steps are learned using 2-stage training model. The CNN architecture proposed in this paper rectify perspective and rotation transformation parameters for images with text in 2-stages using loss function
In (4), is the th sample in the dataset. is a tuple as the corresponding perspective parameter. is one-hot expression of angle parameter in predefined discretized precision which in our paper set to 2 degrees. The loss function consists 2 parts: the first part is , a neural network aiming at learning to estimate and
. While the second part is an angle classifierto estimate using cross-entropy loss instead of regression.
Supervised STN is used to connect these 2 components and form an end-to-end neural system, which will be introduced in section 2.5. We give explicit rectification value to DNN and use its prediction as input of STN, hoping this end-to-end model could better understand the geometric concept from the perspective of human.
2.3 Perspective Learning
In traditional methods, researchers studied the relationship between different indicators and and . There indicators either summarize local features, or attempt to find the vanishing point brought by the perspective transformation, which measures level of parallel distortion. In estimation of and , they are on the same magnitude, hence its Jacobian matrix is not ill-conditioned any more. Therefore, we can numerically approximate the mapping using fully connected neural networks.
2.4 Rotation Learning
To get accurate prediction, it’s intuitive to regress on low level features as a continuous variable. norm, which is defined as is usually used to learn the regression. However, after several layers of pooling, straight lines become zigzag shape in digital images. And difference between close samples are so small on feature map in value of losses. Therefore this mapping has high nonlinear property and hard to regress use classical neural network, especially difficult to identify the proper structure of hidden layer.
It is a general method to discretize continuous variable into disjoint intervals. That is using intervals labels as the surrogate of continuous value. Despite that we use discrete labels as output, what we care is not the accuracy, but the regression residue error since we are estimate an ordinal not cardinal value.
Clearly, discretize the range into infinite number of intervals is equivalent to continuous regression. Actually we can improve effectiveness of classifier as an estimator of by adding other penalty term. To prove this, a penalty term aiming at minimize the distance between ground truth and prediction is added, and use to tradeoff between loss and cross entropy. Classification accuracy of this additional loss term has no significant difference with (4
), but variance of errors decreases. Cross entropy makes no assumption on prior distribution while theterm partly introduce Gaussian prior into estimation. From another view, this term add penalty to large bias from ground truth.
2.5 Supervised Spatial Transformer Network
Angle classification in our model need features free from perspective transformation. An end-to-end neural network is more appropriate and usually have better performance. We proposed supervised STN to connect stages to an integral system.
. The transformer layer is a multilayer neural network whose input is feature map from previous layers’ output, while its output is a set of parameters which could describe the transformation it learned. No other information is fed into the layer, and features used for learning only comes from back-propagation gradient of classification loss. After epochs of training, the transformer are capable of transforming feature maps best for classification.
However, its transformation is different from human vision since no information on concept of geometric rectified from human perspective is provided for training. In practice, we find the performance of STN is quite sensitive to the ratio of object area respective to the image size. If the object of interest is small relative to image size, it fails to locate the object, nor estimation of other transformation parameters.
The difference between transformation of STN and human vision is still an obstacle for better recognition and understanding of document images. Supervised STN proposed in this paper shown in Figure 3(b), connect the hidden layer of localization network of STN with labels of transformation parameter , and put it as a component of final loss function. This method enables the neural network to obtain an estimator learning rectifications from humans’ perspective. In other words, supervised information is provided to the hidden layer of STN, enabling these layers to learn its specified objective. For even larger neural network with more objectives, it can be used as a essential part aiming at transform the features into more proper spaces.
Like STN, several supervised STN can be inserted into network layers with different loss components. If the transformation is not feasible with single stage, we can decompose the transformation into several stages, and arrange each stage with corresponding subset of label values. In this way, transformation with high complexity can be approximated and eliminated, helping to build an end-to-end neural system.
2.6 Integral Architecture
The primary architecture of model we used in this paper is shown in Figure 5. Each stage is a supervised STN using the respective transformation parameters as labels and targets. and are estimated in the first stage, using CNN describe in section 2.3. The low level feature map produced by convolution is transformed by the supervised STN, and feed into the second stage. In back-propagation, if the parameters in first stage is trained with acceptable accuracy, the feature map can be inversely transformed backward to provide gradient to transformation parameters in the first stage.
Inside the network, dropout is applied to increase generalization. A batch normalization layer is inserted after each pooling layer and connected it with next convolution substructure with positive effect. STN receives transformation parameters from regression result of and
and input feature map output by front layers, then inversely transforms the input tensor with linear interpolation.
On estimation of , output feature map from 1st stage STN are processed with convolutions and pooling for the first. The input image went through STN of both stages in succession with and given as output to produce the rectified image.
This model, unlike traditional separated pipeline framework, put each component together to form an end-to-end DNN. Training such an integral system altogether would benefit from enhancement of combined feature learning.
2.7 Convolution Kernel Setting
Angle detection is a special task. Unlike most tasks that have much diversified samples, like classification between birds and chair, rotation of the same object of arbitrary shape will have nearly the same output feature map. But for strings of text in image, we can take it as a flat rectangles or wide lines. To reduce number of parameters in model, global average layer like GoogleNet are chosen instead of dense layer. This structure helps improving generalization ability with much less model parameters.
If the convolution kernel is a tilted banded matrix which has direction , with nonzero elements in banded and others setting to zero. Then convolve it with a tilted banded matrix, then its convolution will be maximized on that location if they are tilted in the same direction. If they perfectly match, then the maximum element in the convolution output should be the maximum among all candidate angles. This initialization could help the system find a better starting point, and the mathematic derivation make it not a surprise that global average pooling using maximum method have better performance than fully connected model. This reduces much of model parameters and increase its generalization ability.
Another important hyper-parameter is the kernel size of angle convolution. Kernel of smaller size have less representative capability of direction. Drawbacks of larger include worse generalization as more parameters are involved in model. Besides, the time complexity of convolution which is increasing in quadratic order of and scale of model parameters also increase quadratically. We need to balance between less computation, better generalization, and representative power.
As far as we know, no public text image rectification dataset is appropriate for this task. We need the dataset collected follows several properties: 1). No marks, borders or boundaries of regular shape in patches exists unless they occur in the context of image content; 2). Text comes from large scale charset with more diversified patterns; 3) Lines of text should vary in font, fontsize, and other settings. Diversification on illumination condition and camera parameters is also necessary.
The dataset are divided into captions and texts samples. The caption samples contain a image and a paragraph of text under it, and the font and font-size of text are the same for each single patch. For text images, we generate random setting for each very line, and draw them on each patch.
Six patches are put in a grid and drawn on a large sample image. Four different kind of marks are drawn on the four very corner of samples images leaving enough space to keep generated patches free from marks, so that patches used for training contain no interfering marks.
We then printed these samples on different type of paper, including plain A4 papers of different colors, and kraft paper, card paper with different texture types. These printed samples are paragraphed using cameras with different configuration under various illumination conditions.
These original digital images are manually labeled on marks around the corners and aligned. 243 images, including 158 texts and 85 of captions are collected. Using preserved location values, we get 6 patches from each sample. For each patch, we generate several random transformation parameters of translation, scaling, rotation and perspective transformations and applied on it. These generated parameters and sample patches works as target values and samples for training and verification. Some sample patches are shown in Figure 1. The only difference of this synthetic dataset with real world data comes from interpolation used when applying the transformations, which is not sensitive to deep models.
3.2 Implementation Details
|A||1st stage||B||2nd stage|
We establish DNN architecture as shown in Tab. 1, where stands for number of angle intervals, conv stands for m connected convolution layers of channel with kernel size , STN(A, B) means a STN layer with A as transformation parameters and B the feature map to transform, and the chosen kernel size. We will use indexes in Tab. 1 to indicate layers in later description.
Our input image size is 256 pixels wide. Fully connected layers act as approximation learner, and we need to balance the scale of them. Also it is necessary to guarantee that the output feature map is not too small for angle convolution layer detecting angles. We choose 4 convolution-pooling layer at last to keep model parameters scale appropriate.
Larger kernel size could improve accuracy of angle estimation, but decrease generalization power and increase time for training and validation. We finally choose 9 for B3. We choose as 1.0, to balance between 2 component of loss.
We train regression NN firstly and use trained model parameters for angle classifier. Choice of learning rate is more art than science that neither large or small is appropriate: larger learning rate will make learning oscillate far from optimal while with small ones training may stuck on bad local minimums. We finetune learning rate carefully from .
3.3 Performance and Analysis
The experiment is running on a server with GTX 1080 GPGPU and Xeon-2630v3 CPU. For this paper, we use 10000 patches in the dataset and break into 8000 for training, 1000 for validation, and 1000 for test. We generate and
in uniform distribution with interval, which give largest parallel distortion of 24 degrees. and are transformed using
, as ReLU can only give positive response, and inversely transformed in STN. Angles for rotation of different scale are generated and used to test whether its learning ability is correlated with entangling transformation parameter range.and are generated with guarantee that the generated patch will not cross the boundaries after all four transformation are applied.
We conduct many experiments with different model configurations and hyper-parameter settings. Comparison between these settings could give more detailed understanding about the method and its functioning principles.
3.3.1 Perspective Regression
Some researches for estimation of and need assumptions on range of , or find better performance in small scale range. Estimation of and is the fundamental part of entire system, hence we need to check whether the range of have large effect on estimation of and .
We generate dataset with different scale of , in . If learning capability is limited by scale of rotation, their performance should be quite different. We give loss and bias of regression result, which is shown in Tab. 2.
Figures in Tab. 2 imply the generalization capability under different angle scales. The outcome varies yet within a regular range. With these results, we can conclude that deep regression have nice generality in different scale. But there are still bias on validation dataset, whether it’s accurate enough to estimate remains to be tested.
3.3.2 Angle Classification
|Expr.||Acc||Var||Top 2||Top 5|
As introduced in section 3.2, training of angle classifier are based on model parameters trained in the perspective stage. We designed experiments on verify angle classifier in different settings. Experiment result on angle classification are shown in Tab. 3. We choose parameters with best benchmark performance according to its results on training and validation dataset, and test its capability on test dataset. Models in comparison include original shared kernel model, independent kernel model, and loss with penalty. The consistence and effectiveness of estimation are analyzed. Furthermore, we investigate more on output by analyzing top-k accuracy, which means accuracy of closest prediction within the k largest output of softmax.
Shared versus Independent Kernels
As we explained in previous section, kernel parameter sharing reuse convolution kernels in front layers. Besides reducing model parameters, the collaborated training process of these two tasks helps the model to learn better features for estimation of and . Compare the first 2 rows in Tab. 3, we found parameter sharing bring improvement in faster convergence and better accuracy: in same epochs of training, it get higher training accuracy compared with independent model and its performance on validation and test dataset also show its better generalization power.
Also, as shown in Tab. 3, variance of prediction error, altogether prove that shared kernel model have better effectiveness. We demonstrate the property of error in violin plot as Figure 6. Judging from the distribution, prediction error is more concentrated to 0 in shared kernel model. And Tab. 3 also indicated its better robustness in all entries.
The nature of this result could be attributed to difficulty of training large neural network for angle prediction. Using shared kernels will have optimization of front layers to start from a better initial point. But for independent model, gradient propagated to front layers would have little effect since differences between samples of neighboring angles is quite small after pooling. It’s then difficult to learn appropriate features for this task.
Classification With Additional Term
Another experiment focus on the impact of term. Based on the analysis, we expect adding an term in the loss function will give larger penalty to large bias. From another view, the term assume a Gaussian prior. This constraint, if meaningful, would help improve the consistence and effectiveness of estimation.
As shown in Tab.3, within same period of training, although result on training dataset implies model without have better performance, on validation and test dataset, model with penalty have achieved best result. However, their outcomes are pretty close. The difference on test dataset are within 3% at most. Indicators in Tab. 3 also show similar conclusion.
This comparison evaluation is reasonable, since the prior knowledge and penalty maybe not helpful in learning such a complicated mapping. Choice of doesn’t affect much since loss component of cross-entropy and estimation of and only takes if we choose .
3.3.3 Integral Performance
Internal Evolution Mechanism
There has always critical statements on deep learning about its black box property with puzzling internal mechanism. As many research using classical image processing method to rectify each transformation have been proposed, it’s better to figure out how it evolves to learn rectification and how we can reference them to understand DNN’s methodology.
Internal mechanism is explored by observing and comparing feature map output of middle layers. It’s amazing to find even no explicit segmentation or context information is given to the DNN, it evolve itself focusing on meaningful areas: some kernels segmented input features into background, text lines, space between lines, and other meaningful non-text regions.
Figure 7 visually explains this statement which contains several manually picked but typical component of output feature map. The first 4 rows come from first stage, and last row from the second. These 2 patches are representative in dataset: one contains only text lines, while the other is caption image with lines of description under the image. From front to end layer, kernels gradually learn to focus on different image elements. For 1-3 rows, feature maps describe image in different scopes, from local to larger scale, from edge detection towards a fuzzy shape contour segmentation. From the third row of Figure 7 of text line case, we observe that convolution tries to make a distinction between regions of text, line spacing and background. To the caption case, with complicated background to analyze, it achieves comparable result.
The way it learns to estimate and in rear layers is hard to analyze. Here some components of feature map indicate how it works in text line case: given the contour and segmentation from front layers, it locates the very upper, bottom and left most border, these line should be perpendicular or parallel lines. The bias from its proper outlook is used by the fully connection layers to estimate and . In contrast, however, it’s more puzzling to analyze mechanism for caption case.
For angle estimation stage, after processed by first STN, convolution in layer 2 in shared kernel seems to segment text area more significantly. In both cases, text area are different in value with other part of feature map. This implies that even with no explicit segmentation label, after learning the concept of rectification, the DNN understand roles of different elements inside images to some extent.
In this part, we will give both good and bad cases in overall rectification. We choose from test datasets, and show them in order of sample, perspective rectified, and final output. Figure 8 shows some patches perfectly rectified including both caption and text line patches.
It matters more on bad cases plotted in Figure 9. We found our model works pretty well and stable on text line patches. But in case of caption, it sometimes fails to restore its original outlook. With detailed analysis, we guess the problem mainly exists in the regression performance in perspective stage: in caption case, with fewer lines of text, and estimation rely more on information of non-text area. Context images are different and varies in style as observed in samples, that some are geometric shapes while some others are complicated images. With such complexity, regression need more data and more various types of context images to learn more accurate and robust regression mapping.
4 Conclusion and Future Work
In this paper, a new deep architecture aiming at rectifying images with parallel text lines are proposed with thorough experiments comparing with different configurations and hyper-parameters. Experiments on newly collected dataset show its effectiveness and robustness.
However, even though the dataset we use have thousands of text image formats, more diversified data is needed to verify model’s robustness to variations of formats. Common cases include name card, course slides, or more complicated texts in nature scenes. With more data collected, we can get better generalization capability.
-  J. Liang, D. DeMenthon, D. Doermann, Geometric rectification of camera-captured document images, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (4) (2008) 591–605.
-  Q. Ye, D. Doermann, Text detection and recognition in imagery: A survey, IEEE transactions on pattern analysis and machine intelligence 37 (7) (2015) 1480–1500.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, in: Workshop on Deep Learning, NIPS, 2014.
-  A. Cambra, A. Murillo, Towards robust and efficient text sign reading from a mobile phone, in: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, IEEE, 2011, pp. 64–71.
-  Z. Zhang, A. Ganesh, X. Liang, Y. Ma, Tilt: Transform invariant low-rank textures, International Journal of Computer Vision 99 (1) (2012) 1–24.
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
-  Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
-  K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition.
-  S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
-  Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks 3361 (10) (1995) 1995.
-  R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd international conference on Machine learning, ACM, 2006, pp. 369–376.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al., Spatial transformer networks, in: Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
-  J. Sun, W. Cao, Z. Xu, J. Ponce, Learning a convolutional neural network for non-uniform motion blur removal, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 769–777.
-  X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks., in: Aistats, Vol. 9, 2010, pp. 249–256.
-  H. I. Koo, N. I. Cho, Text-line extraction in handwritten chinese documents based on an energy minimization framework, IEEE Transactions on Image Processing 21 (3) (2012) 1169–1175.
-  S. Boyd, L. Vandenberghe, Convex optimization, Cambridge university press, 2004.
-  S. Saarinen, R. Bramley, G. Cybenko, Ill-conditioning in neural network training problems, SIAM Journal on Scientific Computing 14 (3) (1993) 693–714.
-  N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of Machine Learning Research 15 (1) (2014) 1929–1958.
-  S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 448–456.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.