1 Introduction
The smartphone camera have simplified the capture of various physical documents in digital form. The ease of share of digital documents (e.g., via messaging/networking apps) have made them a popular source of information dissemination. However, readability of such digitized documents is hampered when the (original) physical document is degraded. For instance, the physical document may contain extraneous elements like stains, wrinkles, ink spills, or can undergo degradation over time. As a result, while scanning such documents (e.g., via a flat-bed scanner), these elements also get incorporated into the document image. In case of capturing document images via mobile cameras, the images are prone to being impacted by shadow, non-uniform lighting, light from multiple sources, light source occlusion, etc. Such noisy elements not only effects the comprehensibility of the corresponding digitized document to the human readers, it may also break down the automatic (document-image) processing/understanding pipeline in various applications (e.g., OCR, bar code reading, form detection, table detection, etc). Few instances of noisy document images are shown in Fig. 1.
Given an input noisy document image, the aim of document image cleanup is to improve its readability and visibility by removing the noisy elements. While general (natural scene) image restoration has been traditionally explored by the computer vision community, recent works have also focused on developing cleanup techniques for document images depending on the type of noise and document-class.
These include foreground background separation
![]() |
![]() |
![]() |
![]() |
(a) | (b) |
Recent works [gangeh2019document, illu_cor2019, Skip-Connected_ICPR18] view document cleanup as an image to image translation problem, modeled using deep networks. A general direction of research has been to explore deeper and more complicated networks in order to achieve better accuracy
In this work, we propose a light-weight encoder-decoder based convolutional neural network (CNN) with skip-connections for cleaning up document images.
Focusing on memory constrained mobile and embedded devices, we design a light-weight deep network architecture.
It should be noted that light-weight deep network architecture usually costs generalization performance when compared with deeper networks.
Hence, in order to obtain a healthy interplay between resource/latency and accuracy, we propose to employ perceptual loss function (instead of the more popular per-pixel loss function) for document image cleanup.
The perceptual loss function enables transfer learning by comparing high-level representation of images, obtained from a pre-trained CNNs (e.g., trained on image classification tasks).
We empirically show the effectiveness of the proposed network on several real-world benchmark datasets.
2 Related work
In this section, we briefly discuss existing approaches that aim to recover/enhance images of degraded documents via techniques involving binarization, and illumination/shadow correction, and deblurring, among others.
Document image binarization:
A popular framework for document image cleanup is background foreground separation [pg_seg_book], where the foreground pixels are preserved and enhanced and the background is made uniform. Binarization is a technique to segment foreground from the background pixels.
Analytical techniques for document image binarization involve segmenting the foreground pixels and background pixels based on some thresholding. Traditional image binarization technique such as [otsu] compute a global threshold assuming that the pixel intensity distribution follows a bi-modal histogram. As estimating such thresholds may difficult for degraded document images, Moghaddam and Cheriet, in
Deep convolutional neural networks (CNNs) have become all-pervasive in computer vision ever since AlexNet [krizhevsky12a] won the ILSVRC 2012 ImageNet Challenge employed a long short-term memory (LSTM) network to classify each pixel as background and foreground by considering images to be a two-dimensional sequenec of pixels.
In proposed a multi-resolutional attention model to learn the relationship between the text regions and background through convolutional conditional random field
Document image enhancement:
In addition to working within the background foreground separation framework, existing works have developed noise-specific document image cleanup methods such as shadow removal. Bako et al. [Bako16] assumes a constant background color generates a shadow map that matches local background colors to a global reference. Similar to Bako’s method, local and background colors are estimated to remove shadow from document images in [shadow_icip2019, shadow_icassp20]. Inspired by the topological surface filled by water, Jung et al. proposed an illumination correction algorithm for document images in [Water-Filling_ACCV18]. A document image enhancement approach have been proposed by Krigler et al. by representing the input image as 3D point cloud and adopting the visibility detection technique to detect the pixels to enhance [doc_enhance_pointcloud_cvpr18]. Recently, Lin et al. [BEDSR-Net_CVPR20] proposed a deep architecture to estimate (i) the global background color of the document, and (ii) an attention map which computes the probability of a pixel belonging to the shadow-free background. An illumination correction and document rectification technique using patch based encoder-decoder network is proposed in
Existing works have also explored deep networks for overall document enhancement rather than focusing on correcting specific document degradations.
A skip-connected based deep convolutional auto-encoder is proposed in [Skip-Connected_ICPR18] .
Instead of learning the transformation function from input to output, this network learns the residual between input and output.
This residual when subtracted from the input image results in a noise free enhanced image.
An end to end document enhancement framework using conditional Generative Adversarial Networks (cGAN) is proposed in
Document image cleanup for mobile and embedded applications: Low-resource consuming models are desirable for mobile document image processing, e.g., in apps like Adobe Lens, CamScanner, Microsoft Office Lens, etc. However, existing CNN based methods [DEGAN_TPAMI20, Skip-Connected_ICPR18], discussed above, propose deep architectures with huge number of parameters, making them unsuitable for memory and energy constrained devices. In this work, we propose a comparatively light-weight deep encoder-decoder based network for document image enhancement task. We employ the perceptual loss based transfer learning technique to compensate for generalization performance with a low network capacity.
3 Proposed approach
As discussed, we propose a light-weight deep network, suitable for mobile document image cleanup applications.
3.1 Network architecture
We design an encoder decoder based image to image translation network.
The encoder part of the model consists of three convolution layers followed by five residual blocks.
The residual blocks were first introduced in [RESNET_CVPR2016] for generic image processing tasks.
We modify the residual blocks from the original design [RESNET_CVPR2016] to suite our network design.
The decoder part of the model consists of five convolution layer along with skip connections from the encoder layers.
This type of skip connection helps mitigate the vanishing gradient and the exploding gradient issues [RESNET_CVPR2016, Skip-Connected_ICPR18, DEGAN_TPAMI20] .
Hence, the skip connections help to simplify the overall learning of the network.
Each convolution layer is followed by batch normalization layer and ReLU6
The kernel size of the convolution layer is and strides for all the layers is set to .
The padding at each layer is set as helps to keep the spatial dimension of the convolution layer output same as its input.
At the end of the network a sigmoid activation function is used to obtain a normalized output between
We term our models as M-x, where x represents the value of the maximum width of the network. In our experiments, we have considered x to be , and . The M-64 model is shown in Fig. 2. In M-32 model’s architecture, the output dimensions of the residual blocks are and the CNN blocks with output dimensions are removed. Similarly, in the case of M-16, the output dimensions of the residual blocks are set to 16 and the CNN blocks with output dimensions and are removed.
3.2 Loss function
The network is optimized by minimizing the loss function computed using Eq. 1:
(1) |
Here, is the -norm loss between the translated image and the ground truth image in color space for color image cleanup. For gray scale cleanup, , refer to the -norm loss between the two images in gray scale. In addition to pixel-level loss function , we also employ the perceptual loss functions and in Eq. 1.
Perceptual loss functions [Johnson2016Perceptual, rad19a, zhao17a] compute the difference between images and at high level feature representations extracted from a pre-trained CNNs such as those trained on ImageNet image classification task. They are more robust in computing distance between images than pixel-level loss functions.
In the context of developing light-weight document image cleanup models, perceptual loss functions serve an additional role of enabling transfer learning. The perceptual loss functions in Eq.
The perceptual loss has two components [Johnson2016Perceptual] and style loss . Feature reconstruction loss encourages the transformed image to be similar to ground truth image at high level feature representation as computed by a pre-trained network . Let be the activations of the layer of the pre-trained network . Then, Eq 2 represents the feature reconstruction loss:
(2) |
where the shape of is . The feature reconstruction loss penalizes the transformed image when it deviate from the content of the ground truth image. Additionally, we should also penalize the transformed image if it deviate from the ground truth image in terms of common feature, texture, etc. To achieve this style loss is incorporated as proposed in [Johnson2016Perceptual]. The style loss is represented in Eq. 3 as follows:
(3) |
where represent pre-trained CNN network, represent set of layers of used to compute style loss, and represent a Gram matrix containing second-order feature covariances. Let be the activation of the layer of the pre-trained network , where the shape of is . Then, the shape of the Gram matrix is and each element of is computed according to Eq 4 as follows:
(4) |
In our work, we use the network [simonyan2014deep] trained on the ImageNet classification task [imagenet_cvpr09] as our pre-trained network for the perceptual loss.
Here, feature reconstruction loss is computed at layer conv1-2 and style reconstruction loss is computed at layers conv1-1, conv2-1, conv3-1, conv4-1, and conv5-1.
4 Experimental results and discussion
We evaluate the generalization performance of the proposed models on binarization, gray scale, and color cleanup tasks.
Model | Mult-Adds | Parameters | Size | Inference time | Load time |
---|---|---|---|---|---|
(in billions) | (in millions) | (in KB) | (in seconds) | (in seconds) | |
DE-GAN [DEGAN_TPAMI20] | |||||
SkipNetModel [Skip-Connected_ICPR18] | |||||
M-64 (proposed) | |||||
M-32 (proposed) | |||||
M-16 (proposed) |
Experimental setup: In our experiment, the input of the network is set as . The input to the network is a channel RGB image, whereas the output dimension of the network is set as or depending on the downstream task. If the downstream task is to obtain an image in gray scale or a binary image, then the output dimension is set as . The output dimension is set as for color cleanup task.
To handle different type of noise at various resolution, the training images are scaled at scale , , and .
Further at each scale, the training images are divided into overlapping blocks of .
During training, a few random patches from the training images are also used for data augmentation using random brightness-contrast, jpeg noise, ISO noise, and various types of blur [albumination].
Randomly selected of the training patches is used to train the model while the remaining is kept for validation.
The model with best validation performance is saved as the final model.
The network is optimized using Adam algorithm [adam] with default parameter settings.
The parameters , , and of Eq 1 is set to , , and respectively.
During inference, an input image is divided into overlapping blocks of . Each patch is inferred using the trained model. Finally, all the patches are merged to obtain the final result. We use simple averaging for the overlapping pixels of the patches.
Compared algorithms:
We compare the proposed models with recently proposed deep CNN based document image cleanup models: SkipNetModel [Skip-Connected_ICPR18] and DE-GAN [DEGAN_TPAMI20].
Table 1 presents a comparative analysis of our proposed models with SkipNetModel and DE-GAN in terms of: (i) number of multiplication and addition operations (Mult-Adds) associated with the model [howard17a] , (ii) number of parameters, (iii) actual size on device, (iv) model load time, and (v) model inference time. A comparison with respect to these parameters is essential if the applicability of any model for memory and energy constrained devices is to determined. We implemented the models using TensorFlow Lite (
Model | F-measure | PSNR | DRD | |
---|---|---|---|---|
Otsu [otsu] | ||||
Sauvola et.al. [Sauvola00adaptivedocument] | ||||
Tensmeyer et.al. [Tensmeyer_Binary_ICDAR17] | ||||
Vo et.al. [Vo_bin_PR18] | ||||
DE-GAN [DEGAN_TPAMI20] | ||||
SkipNetModel [Skip-Connected_ICPR18] | ||||
M-64 (proposed) | ||||
M-32 (proposed) | ||||
M-16 (proposed) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
input | ground truth | SkipNetModel [Skip-Connected_ICPR18] | M-32 |
4.1 Binarization
We begin by discussing our results on document image binarization task.
For our experiment, we have considered the publicly available binarization dataset DIBCO13 [DIBCO13] and DIBCO17 [DIBCO17] as test sets.
The proposed models and SkipNetModel are trained on the datasets [DIBCO09], [DIBCO10], [DIBCO11], [DIBCO12], [DIBCO14], [DIBCO16] and [DIBCO18].
While training the models for the test set DIBCO13 [DIBCO13], we also include the dataset DIBCO17 [DIBCO17] into our training data.
The models for this task are trained using the augmentation strategy described in Sec. 4.
The same training strategy is also followed while training the models for the task DIBCO17 [DIBCO17].
The models are compared using the DIBCO13 [DIBCO13] evaluation criteria: F-measure, pseudo F-measure ( ), peak signal to noise ratio (PSNR), and distance reciprocal distortion (DRD).
For the metrics F-measue,
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
input | ground truth | SkipNetModel [Skip-Connected_ICPR18] | M-32 |
Model | F-measure | PSNR | DRD | |
---|---|---|---|---|
10 [DIBCO17] | ||||
17a [DIBCO17] | ||||
12 [DIBCO17] | ||||
1b [DIBCO17] | ||||
1a [DIBCO17] | ||||
DE-GAN [DEGAN_TPAMI20] | ||||
SkipNetModel [Skip-Connected_ICPR18] | ||||
M-64 (proposed) | ||||
M-32 (proposed) | ||||
M-16 (proposed) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
input | ground truth | SkipNetModel [Skip-Connected_ICPR18] | M-32 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
input | SkipNetModel [Skip-Connected_ICPR18] | M-32 |
4.2 Gray scale and color cleanup
For the purpose of document cleanup, we first show the effectiveness of the proposed methods in gray scale. The gray scale cleanup part of our experiment is conducted on the publicly available dataset NoisyOffice [Dua:2019, NoisyOffice] . This dataset consists of two parts, first one is real noisy images consisting of files, and a synthetic dataset consisting of files. There were no groundtruth images available for the real data, therefore, we are not able to include the real dataset for quantitative analysis of our experiment. The model is trained and evaluated on the synthetic data. We divide the synthetic data into two parts - images for training and images for testing. The authors of [DEGAN_TPAMI20] did not share the saved model for gray scale cleanup. Therefore, for this experiment, only the method proposed in [Skip-Connected_ICPR18] is used for comparison. To measure the capability of the proposed models with respect to removing noise, we adopt peak signal to noise ratio (PSNR) as the quality metric. In order to determine the dependence of the model performance on the amount of available training data, we have trained the SkipNetModel and the proposed models M-64, M-32, and M-16 by varying the amount of training data from to . The performance of the models with respect to PSNR score is shown in Fig. 7. It can be observed from this figure that in presence of training data, SkipNetModel [Skip-Connected_ICPR18] performs better than the proposed models. However, it can also be observed from this figure that the performance of the proposed models is more or less remains the same. It can also be observed from this figure that the performance of SkipNetModel varies a lot with the variation in the amount of training data. Typical examples of inputs, groundtruths from the test set along with the outputs of the models trained on training data are shown in Fig. 5. We have also shown a few examples of inputs and outputs of the trained models on real data in Fig. 6. It can be seen from this figure that the model M-32 performs better than SkipNetModel in few of the examples. From Figs. 6 and 7, we can conclude that the proposed model is more generalized and performs more robustly with respect to SkipNetModel.
Model | SSIM | PSNR |
---|---|---|
M-64 (proposed) | ||
M-32 (proposed) | ||
M-16 (proposed) |
Finally, we present our experimental results with respect to document color cleanup task. One challenging aspect of color cleanup is the preservation of color of the foreground pixels. For this experiment, we used a color dataset consisting of 250 mobile captured images. Each of the images are manually cleaned. We followed the same training strategy for training our model as described in Sec. 4. Random real life images (not belonging to the train/test set) and their corresponding outputs are shown in Fig. 8. From this figure, we can observe a decent performance of the proposed model in performing color cleanup of document images. However, to provide a quantitative measure of our method, we compute PSNR, and structural similarity index (SSIM) score on the test set of the data in Table 4.
![]() |
![]() |
![]() |
![]() |
(a) | |||
![]() |
![]() |
![]() |
![]() |
(b) |
5 Conclusion
We have proposed an encoder-decoder based document cleanup model for resource constrained environments. To this end, we design a light-weight deep network with only a few residual blocks and skip connections. Our loss function incorporates the perceptual loss, which enables transfer learning from pre-trained deep CNN networks.
We develop three models based on our network design, with varying network width. In terms of the number of parameters and product-sum operations, our models are 65-1030 and 3-27 times, respectively, smaller than a recently proposed GAN based document enhancement model [DEGAN_TPAMI20]. In spite of our relatively low network capacity, the generalization performance of our models on various benchmarks are encouraging and comparable with several document image cleanup techniques with deep architectures such as [Tensmeyer_Binary_ICDAR17, Skip-Connected_ICPR18]. In addition, our models are more robust to low training data regime than [Skip-Connected_ICPR18]. Hence, the proposed models offer a favorable trade-off between memory/latency and accuracy, making them suitable for mobile document image cleanup applications.
Comments
There are no comments yet.