Recovering Homography from Camera Captured Documents using Convolutional Neural Networks

09/11/2017 ∙ by Syed Ammar Abbas, et al. ∙ 0

Removing perspective distortion from hand held camera captured document images is one of the primitive tasks in document analysis, but unfortunately, no such method exists that can reliably remove the perspective distortion from document images automatically. In this paper, we propose a convolutional neural network based method for recovering homography from hand-held camera captured documents. Our proposed method works independent of document's underlying content and is trained end-to-end in a fully automatic way. Specifically, this paper makes following three contributions: Firstly, we introduce a large scale synthetic dataset for recovering homography from documents images captured under different geometric and photometric transformations; secondly, we show that a generic convolutional neural network based architecture can be successfully used for regressing the corners positions of documents captured under wild settings; thirdly, we show that L1 loss can be reliably used for corners regression. Our proposed method gives state-of-the-art performance on the tested datasets, and has potential to become an integral part of document analysis pipeline.



There are no comments yet.


page 1

page 4

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hand-held cameras and smart-phones have become an integral part of our today life, and so do the images captured using these devices. A sizable amount of these images contain textual content and thus requires automatic document analysis pipeline for text detection and understanding. Applying perspective correction to these perspectively warped documents is an important preprocessing step before proceeding with more advanced stages of document analysis pipeline such as binarization, segmentation and optical character recognition.

Images captured using hand-held devices in real world are significantly different than those captured using dedicated hardwares (such as scanners) in controlled environments, due to the challenging photometric and geometric transformations these images undergo in wild settings. Therefore, traditionally developed methods [26, 27, 13] for perspective correction either completely fail or give poor performance, and if not corrected manually (as in majority of commercial applications111Like and these errors can led to failure of complete document analysis pipeline.

This degradation in performance can be mainly attributed to the manual and complex pipelines used to estimate the geometric transformation matrix. These pipelines mainly follow the similar setup, where initially low level features such as lines orientation, edges, contours,

etc. are used to estimate the page layout and then in later stages, images are deskewed via estimated similarity or homography transformation matrices. However, these methods completely fail in the wild settings due to challenging lighting conditions, cluttered background, motion blur, etc., – c.f. Sec. 5.8.

To this end, we propose a generalized and completely automatic method, trained end-to-end, for perspective correction of document images captured in the wild.

We make following main contributions. Firstly, we introduce and publicly release222We will release the link to dataset. a large scale synthetic dataset for learning homography matrix for perspective correction of textual images in the wild settings. Secondly, we introduce a convolutional neural network based architecture for recovering homography matrix using four points parameterization [3] from a single input image. Thirdly, we empirically show that loss function performs better for homography estimation.

Since our method does not use hand crafted features and is trained over a large enough dataset to capture the different range of photometric and geometric transformations, it works independent of text layout assumptions such as availability of page margins, parallel textlines, etc. in the captured image. Also in comparison to earlier methods, our method is more robust, works under different lighting conditions and presence of different noises, as illustrated by our results. It even works on the occluded documents where significant portion of document is either missing or occluded. Overall, our method gives state-of-the-art performance on tested datasets. Figure 1 show some sample results, please refer to Sec. 5 for detailed results.

In addition to being robust, our method is quite simple and relatively fast to train and test. Precisely, it requires around 5 hours for training and requires 0.04s sec for complete forward pass and perspective correction on a Tesla K40 GPU machine.

The rest of the paper is organized as follows. Sec. 2 reviews the related work, while Sec. 3 provides details on our synthetic dataset. Sec. 4 explains our CNN model and architecture. Sec. 5 discusses in detail different experimental choices, parameter settings and our results. Finally, Sec. 6 concludes the paper with relevant discussion.

2 Related Work

Traditionally, to find homography transformation between a pair of reference and transformed images, first a set of corresponding features is built and then based on this set either direct linear transform

[10] or cost based methods have been used to estimate the projective transformation [8]

. This step is followed by other post-processing steps to remove false matches or outliers. Over the years, researchers have used corner points, lines, and conics for defining correspondences between pair of images. However, the whole pipeline is dependent on the quality of detected feature sets and their repeatability, where false correspondences or lack of quality correspondences can lead to large errors in computed transformation matrix.

For the problem at hand, the above mentioned pipeline cannot be directly used due to absence of reference image. Although a canonical image with white background can be used as reference image however absence of text in the canonical image can lead to misfiring of corner detectors and indirectly to failure of complete methodology.

In contrast, in document analysis different manual pipelines have been used to restore a perspectively distorted image. We can broadly classify these approaches into two classes. First class of methods

[27, 16, 26] make assumptions about the image capturing process to recover the transformation matrix. For instance, Zhang et al [27] develop a method for 3D reconstruction of the paper from the shading information in a single image. For this purpose they use special hardware consisting of light sources and sensors.

In contrast, second class of methods make assumptions about document layout [13, 17, 22, 23, 4]. For example, Jagannathan & Jawahar [13] extract clues about document layout such as document boundaries, text orientation, page layout information, etc. to either impose constraints for solving the system of linear equations or finding vanishing points and lines for homography estimation. Liang et al [17]

performs projection profile analysis for detecting the orientations of text lines. These text lines orientations are then used for the identification of vanishing points which are then used for the estimation of affine matrix. In 2007, Shafait

et al [22] launched a competition in Camera-Based Document Analysis and Recognition Conference (CBDAR) to evaluate different image dewarping algorithms on a standard dataset. In this competition, coordinate transform model produced the best results among the three entries. This method also uses the principle of text lines detection for the rectification of the document. Another method [4] was also applied on the dataset later using ridges based coupled snakes model which obtained even better results. However, all these methods have same limitations and fail when applied to real world images of captured documents. This is due to the fact that the CBDAR dataset does not contain enough variations to capture the distribution of real world examples.

Recently, Simon et al [23] has proposed another method for dewarping of document images. Their developed complex pipeline for perspective rectification involves binarization, blobs and lines detection, and application of morphological operators to produce a final perspectively rectified image.

Almost all of these above discussed methods work on images captured in controlled settings with considerable textual cues such as lines, etc. In addition, these methods use hand crafted features, involve tuning of multiple parameters and are not robust to variations and noises introduced during images captured in wild settings. In comparison, our proposed CNN based method can reliably estimate homography from perspectively distorted images without making any assumptions about image content or capturing environments. Recently [7] have also successfully used convolutional neural networks to estimate homography transformation between pair of natural scene images. Since their method requires a pair of reference and transformed images as input to the network for homography estimation, thus cannot directly work on textual images. In contrast, our method estimates homography from a single input image without the need of reference image.

Figure 2: Sample images from synthetic dataset.

3 Synthetic Dataset Generation

Availability of large scale datasets (such as ImageNet 

[6]) has played a significant role in the recent upsurge of deep networks. Unfortunately, as discussed above, no large enough dataset is available for the problem at hand. Currently, the largest available public datasets are CBDAR 2007, CBDAR 2011 [22], and SmartDoc-QA [18]. These datasets either contain very few images or have very limited amount of variability in their images. For instance, CBDAR 2007 and CBDAR 2011 contain around 100 grayscale images of books and periodicals, with limited variations. SmartDoc-QA dataset although contains large number of document images captured under different conditions, these conditions are quite limited compared to wild-settings – c.f. Sec. 5.7. However, to build a generic image rectification algorithm we need a large dataset of RGB images with large variations in illumination conditions, background cluttering, geometric transformations, etc., to model the distribution of real world image capturing conditions.

Recently, researchers have produced and used synthetic datasets to solve data scarcity problem [19, 9, 7]. Peng et al [19] build a synthetic dataset to train deep CNNs for learning deep object detectors from 3D models due to limitations of the available dataset. Gupta et al [9] also train a CNN on a synthetic dataset to solve the problem of text localization in natural images for the same reason. Motivated by the success of these methods, we have developed our own synthetic dataset for the problem at hand.

We use the 3000 document images captured using hand held cameras [12]333This dataset was graciously donated by authors. for building our synthetic dataset. These documents images contain different types of textual content such as text, figures, equations, etc., and thus serve ideally to capture content variations in the dataset.

Geometric Transformations:

We first apply different random geometric transformations on these documents to produce perspectively distorted documents. We sample different values of homography matrix

coefficients from a uniform distribution with different ranges.


Precisely, and are randomly sampled from to , and from - to , and and from the range - to . We further add variable length horizontal and vertical margins to these images to give them camera captured appearance.

Background Variations:

To introduce the background clutter and variations, as a next step, we add randomly sampled textured backgrounds to these images. For this purpose, we use the Describable Textures Dataset (DTD) [5] which contains over 5000 textures from different categories like fibrous, woven, lined, etc. We have also used Brodatz dataset [2] which contains 112 textures of different colors and patterns. However, we show in section 5.6 that simple textures alone are not enough to represent the variety of backgrounds that appear in camera captured documents.

As it turns out, the document images are mainly captured in indoor environments consisting of more complex backgrounds than simple textures. Thus, to model complex indoor backgrounds, we also use MIT Indoor scenes dataset [20] to sample backgrounds for synthetic images.

Photometric Transformations:

Images produced using above pipeline appear as real as captured using hand-held cameras but they still lack illumination variations and different noises (such as motion and defocus blur) encountered while capturing images in the wild. To this end, we add motion blur of variable angles and magnitudes to the resultant images to simulate camera shaking and movements effects. We also add Gaussian blur to the images to model dirty lenses and defocus blur. To introduce different lighting variations, we create different filters based on gamma transformations in spatial domain (we use gamma transformation as a function of displacement from a randomly sampled image position instead of pixel intensity, i.e. ) of variable pixel intensities in different directions and shapes. Next we use alpha blending with alpha uniformly sampled from to to merge these filters with the geometrically transformed image. This results in introduction of effects of different lighting variations in the resultant image. Some sample images from our synthetic dataset are shown in figure Figure 2.

4 Proposed Method

In this section, we introduce our convolutional neural network architecture. We experimented with different design choices (such as number of layers, filters, nonlinearities) for our CNN architecture before arriving at final architecture. Figure 3 shows our final architecture.

Figure 3: Our architecture consists of 11 convolutional layers and a fully connected layer, initial two layers use filters of size , all of the remaining layers except the last one use filters of size whereas final layer uses filters of size

. The fully connected layer uses 8 neurons to regress the corner positions.

Our final architecture consists of 11 convolutional and maxpooling layers and draws inspirations from the VGG [24] and FAST YOLO architectures [21]

. We use ReLU nonlinearity after each convolutional layer except the last layer of

convolutions. We use max pooling layer after each of the first three, , and layer. The initial two layers use filters of size whereas all the remaining layers except the last one use filters of size . The final convolutional layer uses filters for efficient computation and storage. This is followed by a final fully connected regression layer of neurons. We use a dropout with after the last convolutional layer.

For our loss function, we use distance to measure the displacement of eight corner coordinates from their canonical positions, i.e.

Here and represent the point predicted and original coordinate values respectively. As mentioned earlier, this formulation is similar to four points homography formulation of [3].

5 Experiments and Results

We randomly split our synthetic dataset into three datasets: training, validation and testing. All the configurations have been done using validation set and we report the final performance on the test set.

In literature, researchers [7, 22] have used different metrics to measure the performance of their methods. For instance, [7] used mean average corner error which is measured using distance between the estimated and original corners positions. In comparison, [22] used mean edit distance to compare the methods on the CBDAR 2007 dataset. In our experiments, we use Mean Displacement Error (MDE) by computing the average of distance between ground truth corner coordinates and the predicted corner coordinates of a document, this measure is then averaged over complete dataset to get the final score for the dataset. In comparison to distance, MDE gives better intuition and insights (as we can directly know in pixel units how well the system is performing) into the performance of the algorithm for the given problem.

5.1 Implementation Details

We have implemented our system using TensorFlow 

[1]. Although our synthetic dataset is composed of different resolutions images, for our network training and evaluation we use fixed size images of resolution, this helps us to train our network with limited resources and employing multiple max pooling layers. For initialization of our networks we use He et al [11] initialization scheme for ReLU based CNN. We use a batch size of 4 during training. We use Adam’s Optimization method [14] with default parameters to train our networks. We set the initial learning rate to

and reduce it by half whenever the loss stops decreasing. We repeat this reduction process until the absolute change in loss is very small over a few hundred iterations. We also experimented with RMSProp 

[25] as a choice of optimization method but in our experiments, Adam consistently gave better performance.

Training our method for around 10 epochs takes on average five hours. During testing, our method takes on average 0.04 seconds on a GPU (NVIDIA Tesla K40) machine per image which translates to roughly 25 images per second.

5.2 Homography vs 4-Points Estimation Method

Initially, we trained our convolutional neural networks to directly predict the homography matrix (c.f. Eq. (1)) as the output. However, these CNN were not able to produce the desired results and were difficult to train. The reason is that the homography matrix is extremely sensitive to the and values. That is, even a change in the order of to the and values results in an incomprehensible resultant image. Later on, we adopted the 4-points method to recover homography from the input image. CNNs trained using this method were more robust to errors in coordinate values, gave much better results and were relatively easier to train than the direct approach.

5.3 Evaluation of Different Architectures

Initially, We transfer learned a CNN from VGG-13 

[24] by replacing the last layer with our prediction layer. This network obtained 6.95 MDE on the test set. We also trained a variant of FAST YOLO [21]. This variant reported an MDE of 10.46 pixels on the test dataset. In our initial analysis, we found out that having large number of filters with large receptive fields in initial layers plays a critical role in the performance of our system. This is because large filters with large receptive fields are able to capture local co-occurrence statistics much better in the initial layers. Secondly, we found out that going deeper leads to much better results. Therefore, based on these findings we designed our own architecture as already discussed in Sec. 4. Our this final architecture includes more number of filters and convolutional layers as compared to YOLO and large receptive fields as compared to VGG. Our this architecture was able to obtain state of the art MDE of 2.45 pixels on our test set – c.f. Figure 4.

Figures 1 and 7 show the results of our proposed architecture on unseen real world images.

Figure 4: MDE for different architectures on test set.
Figure 5: Results of experiment with border pixels set to zeros.

5.4 Evaluation of Different Loss Functions

We evaluated different loss functions (such as , , reverse Huber) to find the best one for the problem at hand. The loss function was able to achieve better MDE (2.59 pixels) on the validation set compared to MDE of 3.30 pixels with loss. Actually, in all our initial experiments, performed better than loss. This can be attributed to the fact that for the problem in hand handles extreme scenarios better than . For instance if a corner is occluded or lies out of image frame will give relatively high penalty and thus would be forcing the network to overfit these conditions. We also tried the reverse Huber loss, as discussed in [15, 28], to train our convolutional neural networks. This loss function is a piecewise function of and loss.

We validated this loss for different values of but all the results were worse than the ones obtained via loss function. In fact we found that value of is extremely sensitive to network initialization, i.e. this loss function under different network initialization, with identical value produces different results.

5.5 RGB vs Grayscale Images

Most of the earlier methods, discussed in section 2, convert the input image to grayscale image before proceeding with perspective correction pipeline. This removes pivotal color information that can significantly help in homography estimation. For instance, color can act as an important clue for distinguishing the document from its background since majority of documents are usually white in color.

For this experiment, we first converted training and validation set RGB images into grayscale images. Next, we trained a CNN network with same architecture as discussed in Sec. 4 on the grayscale training set. The MDE obtained on the grayscale validation set was 11.59 pixels. This is far worse compared to MDE obtained using RGB images. This supports our original hypothesis that color plays an important role in recovering homography in textual documents.

5.6 Evaluation of Synthetic Dataset Design Choices

For evaluating the synthetic dataset design choices, we designed a set of experiments by selecting different subsets from our final training dataset. We used the same CNN architecture for all these experiments. In our first experiment, we build a training dataset excluding the lighting variations, motion and Gaussian blurs. The model trained on this dataset gave MDE of 19.52 pixels on the validation set. From this, we can infer that having a dataset that covers large photometric variations can help in better homography estimation. Next, we created a training dataset where we only included background textures from the DTD and Brodatz datasets without including the background images from MIT indoor scenes. The network trained on this dataset gave MDE of 4.54 pixels on the validation set. This error is greater than the error obtained when indoor scenes are also added to represent background.

Our method worked well for the documents whose corners were present inside the image despite the level of noise, background clutter, lighting variation, absence of page layout clues such as text lines, etc. However, it occasionally misfired for documents whose corners (more than one) were occluded or outside the image boundaries. To tackle this problem, we did an experiment where we set image margins pixels to zero values to simulate the occluded corners in our synthetic dataset. Precisely, we set 30 pixels from left and right image margins and 40 pixels from the top and bottom margins to zeros. However, we did not change the true annotations positions. Although, the model trained on this version of dataset was able to reliably estimate unseen corners of the documents, however, its MDE was relatively higher than the model trained on dataset without occluded corners. Figure 5 shows the results of our experiment on some sampled documents. We have not yet explored this any further.

Our these experiments and their results validate that design choices we made during dataset creation indeed represent the diversity of background textures, and photometric and geometric transformations the system is likely to encounter in the wild.

5.7 Performance on SmartDoc-QA Dataset

SmartDoc-QA444 dataset [18] is a recently proposed dataset for evaluating the performance of OCR systems on camera captured document images. Although it contains document images captured under different simulated conditions, these conditions are quite limited compared to wild-settings and to our proposed dataset. That is, all images in this dataset are captured: (i) across fixed red background and thus have clear contrast with background; (ii) with only fixed set of blurs (6 different blurs) and lighting conditions (5 different). In short, this dataset lacks variations in image appearance encountered in wild settings.

Figure 6 shows the results of our algorithm on a set of sampled images from SmartDoc-QA dataset. As expected, our method is able to correctly rectify all the warped documents due to presence of strong document boundary cues.

To throughly and analytically evaluate the effect of our system on the OCR performance, we designed another experiment where we replace the Orientation and Script Detection (OSD) module of a publicly available OCR system (Tesseract555 with ABBYY Reader and proposed perspective correction algorithms.

Table 2 shows the results of these different configurations on the SmartDoc-QA dataset. Here, we use fraction of character matches as a metric to measure the performance of OCR, i.e. where is number of character matches and is total number of characters in both documents. Our algorithm improves the performance of Tesseract OCR over default OSD system as well as give on-par performance to ABBYY reader. Note that here the difference is all because of superior performance of proposed image rectification algorithm. Furthermore, Tesseract OCR bad performance is due to presence of significant motion-blur at the character level in SmartDoc-QA dataset which is leading to failure of character recognition pipeline.

Figure 6: Results of our already trained algorithm on a set of sampled images from SmartDoc-QA dataset.
Methods SmartDoc Dataset OSD + Tesseract OCR 11.61% ABBYY + Tesseract OCR 16.18% Our Method + Tesseract OCR 16.14%
Table 1: Performance of Tesseract OCR with different perspective correction modules on SmartDoc-QA dataset.
Methods Dataset Simple Complex
OSD + Tesseract OCR
10.99% 2.94%
ABBYY + Tesseract OCR 23.78% 10.04% CamScanner + Tesseract OCR 30.64% 7.55% Our Method + Tesseract OCR 31.58% 14.35%
Table 2: OCR performance for different image rectification algorithms on Simple and Complex variants of datasets from SmartDoc-QA.
Original Our Method CamScanner ABBYY FineReader
Figure 7: Visual comparison of results produced by our method and popular commercial softwares. On this sample set, our method is outperforming both the other softwares.
Original Our Method CamScanner ABBYY FineReader
Figure 8: Visual comparison of results produced by our method and popular commercial softwares on test images. The proposed method gives better results over different range of photometric and geometric transformations and works independent of document underlying content.

5.8 Comparison with Commercial Software Applications

Algorithms for rectifying perspectively distorted documents are also being used by many commercial software applications for the purpose of optical character recognition and documents digitization. Here we compare our method with two popular commercial applications, i.e. CamScanner666It is one of the most famously used application with around 50 million downloads, the most number of downloads for an android scanner application. and

We performed comparison with these commercial software at the three levels. At first level, we did the comparison via visual inspection of rectified images. CamScanner performed well for the cases where corners or edges of the documents were clearly visible and the documents could be distinguished from the background. In other cases where there were strong illumination artifacts, background clutter or the corners were not visible, the application failed to remove perspective distortion from the documents – c.f. Figures 7 and 8 ( column) for more details. ABBYY-Reader gave good performance for the cases where documents edges were strong and documents could be easily differentiated from the background clutter. However, it failed in the cases where there were: (i) no textual line cues; (ii) strong illumination artifacts; or (iii) large scale geometric transformations in the captured documents.

At second level, we first randomly sampled 400 images from our test set and passed them to CamScanner, ABBYY-Reader and our method for perspective correction. We then manually annotated true corner positions in these rectified documents and finally used these annotations to measure MDE w.r.t. ground truth. Compared to MDE of 21.5 and 20.6 pixels for ABBYY-Reader and CamScanner, our method achieves a MDE of 2.56 pixels on this dataset.

At the third level, we compare the OCR performance of these methods on a pair of test datasets (named as Simple and Complex) generated from the high-resolution ground truth images of SmartDoc-QA dataset. The simple version of dataset was generated with least variations in photometric and geometric transformations, whereas the complex version includes the same level of variability (except motion blur) as included in our original synthetic dataset. Table 2 compare the performance of Tesseract OCR with different image rectification algorithms on these datasets. Our algorithm here once again consistently gives better performance than these competing methods. Although on the simple variant of dataset CamScanner is able to give comparable performance, however as the large variations are introduced both the CamScanner and ABBYY-Reader give much worse performance than our methods.

These results prove that our method is indeed a generic method and gives excellent results for document images captured under wide range of wild-settings.

6 Conclusions

In this paper, we have proposed a simple and efficient method to recover homography from the perspectively distorted document images. We have performed extensive experiments and shown that our proposed method gives excellent results in wide range of realistic image capturing settings. In comparison to earlier methods, our method works independent of documents contents and is fully automatic, as it does not require any manual input.

Furthermore, for training deep networks, we have introduced a new synthetic dataset with warped camera captured documents that contains a large number of images compared to the present ones. Overall, following are the major findings of this study: (i) a rich dataset, even a synthetic one, that records the true underlying real world distribution of problem plays a critical role in the overall performance of deep networks; (ii) in initial layers filters, large receptive fields are crucial for improved performance, also having large number of filters and convolutional layers are necessary to achieve state of the art performance for problem at hand; (iii) loss can be reliably used for regressing corner positions compared to traditionally used loss, however the overall difference in performance is not statistically significant. (iv) similar to [7], we found that 4-points homography parameterization method works better than traditionally used matrix representation and results in a stable loss function that gives state of the art performance.

We are of the view that our method can become an integral part of complete document analysis pipeline.