Light-weight Document Image Cleanup using Perceptual Loss

by   Soumyadeep Dey, et al.

Smartphones have enabled effortless capturing and sharing of documents in digital form. The documents, however, often undergo various types of degradation due to aging, stains, or shortcoming of capturing environment such as shadow, non-uniform lighting, etc., which reduces the comprehensibility of the document images. In this work, we consider the problem of document image cleanup on embedded applications such as smartphone apps, which usually have memory, energy, and latency limitations due to the device and/or for best human user experience. We propose a light-weight encoder decoder based convolutional neural network architecture for removing the noisy elements from document images. To compensate for generalization performance with a low network capacity, we incorporate the perceptual loss for knowledge transfer from pre-trained deep CNN network in our loss function. In terms of the number of parameters and product-sum operations, our models are 65-1030 and 3-27 times, respectively, smaller than existing state-of-the-art document enhancement models. Overall, the proposed models offer a favorable resource versus accuracy trade-off and we empirically illustrate the efficacy of our approach on several real-world benchmark datasets.



There are no comments yet.


page 2

page 4

page 9


Exploring Image Enhancement for Salient Object Detection in Low Light Images

Low light images captured in a non-uniform illumination environment usua...

Advanced Hough-based method for on-device document localization

The demand for on-device document recognition systems increases in conju...

BiNet: Degraded-Manuscript Binarization in Diverse Document Textures and Layouts using Deep Encoder-Decoder Networks

Handwritten document-image binarization is a semantic segmentation proce...

First Steps Toward CNN based Source Classification of Document Images Shared Over Messaging App

Knowledge of source smartphone corresponding to a document image can be ...

DSM Refinement with Deep Encoder-Decoder Networks

3D city models can be generated from aerial images. However, the calcula...

RectiNet-v2: A stacked network architecture for document image dewarping

With the advent of mobile and hand-held cameras, document images have fo...

Domain Generalization for Document Authentication against Practical Recapturing Attacks

Recapturing attack can be employed as a simple but effective anti-forens...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The smartphone camera have simplified the capture of various physical documents in digital form. The ease of share of digital documents (e.g., via messaging/networking apps) have made them a popular source of information dissemination. However, readability of such digitized documents is hampered when the (original) physical document is degraded. For instance, the physical document may contain extraneous elements like stains, wrinkles, ink spills, or can undergo degradation over time. As a result, while scanning such documents (e.g., via a flat-bed scanner), these elements also get incorporated into the document image. In case of capturing document images via mobile cameras, the images are prone to being impacted by shadow, non-uniform lighting, light from multiple sources, light source occlusion, etc. Such noisy elements not only effects the comprehensibility of the corresponding digitized document to the human readers, it may also break down the automatic (document-image) processing/understanding pipeline in various applications (e.g., OCR, bar code reading, form detection, table detection, etc). Few instances of noisy document images are shown in Fig. 1.

Given an input noisy document image, the aim of document image cleanup is to improve its readability and visibility by removing the noisy elements. While general (natural scene) image restoration has been traditionally explored by the computer vision community, recent works have also focused on developing cleanup techniques for document images depending on the type of noise and document-class. These include foreground background separation

[adotsu_pr12, binary_fcmean_icdar19, Sauvola00adaptivedocument], differential fading problem [Liu_Binary_ICDAR17, sayed10], removal of shadow/smear/strain [BEDSR-Net_CVPR20, Liu_Binary_ICDAR17, Water-Filling_ACCV18, Tensmeyer_Binary_ICDAR17, valizadeh_binarization, shadow_icip2019, shadow_icassp20], and handling ink bleed [silva08, Tensmeyer_Binary_ICDAR17], etc.

(a) (b)
Figure 1: Typical examples of noisy document images; (a) real world degraded images from various datasets [Dua:2019, DIBCO17, NoisyOffice], (b) noisy real world images captured via mobile devices.

Recent works [gangeh2019document, illu_cor2019, Skip-Connected_ICPR18]

view document cleanup as an image to image translation problem, modeled using deep networks. A general direction of research has been to explore deeper and more complicated networks in order to achieve better accuracy 

[DEGAN_TPAMI20, Skip-Connected_ICPR18]. However, such deep networks often require high computational resources, which is beyond many mobile and embedded applications on a computationally limited platform. Deeper networks also usually entail a higher inference time (latency), which document image processing mobile apps such as Adobe Lens, CamScanner, Microsoft Office Lens, etc., aim to minimize for best human user experience.

In this work, we propose a light-weight encoder-decoder based convolutional neural network (CNN) with skip-connections for cleaning up document images. Focusing on memory constrained mobile and embedded devices, we design a light-weight deep network architecture. It should be noted that light-weight deep network architecture usually costs generalization performance when compared with deeper networks. Hence, in order to obtain a healthy interplay between resource/latency and accuracy, we propose to employ perceptual loss function (instead of the more popular per-pixel loss function) for document image cleanup. The perceptual loss function 


enables transfer learning by comparing high-level representation of images, obtained from a pre-trained CNNs (e.g., trained on image classification tasks). We empirically show the effectiveness of the proposed network on several real-world benchmark datasets.

The outline of the paper is as follows. We discuss the existing literature in Section 2. In Section 3, we detail our methodology. The empirical results are presented in Section 4, while Section 5 concludes the paper.

2 Related work

In this section, we briefly discuss existing approaches that aim to recover/enhance images of degraded documents via techniques involving binarization, and illumination/shadow correction, and deblurring, among others.

Document image binarization: A popular framework for document image cleanup is background foreground separation [pg_seg_book], where the foreground pixels are preserved and enhanced and the background is made uniform. Binarization is a technique to segment foreground from the background pixels. Analytical techniques for document image binarization involve segmenting the foreground pixels and background pixels based on some thresholding. Traditional image binarization technique such as [otsu]

compute a global threshold assuming that the pixel intensity distribution follows a bi-modal histogram. As estimating such thresholds may difficult for degraded document images, Moghaddam and Cheriet, in 

[adotsu_pr12], proposed an adaptive generalization of the Otsu’s method [otsu] for document image binarization. In [Sauvola00adaptivedocument], Sauvola and Pietikäinen proposed a local adaptive thresholding method for the image binarization task. To improve Sauvola’s algorithm’s performance in low contrast setting, Lazzara and Geraud developed its multi-scale generalization in [lazzara_IJDAR14]. Recent works on document image binarization have also explored techniques based on conditional random fields [peng_binary_icdar13], fuzzy C-means clustering [binary_fcmean_icdar19], robust regression [Vo_bin_PR18], and maximum entropy classification [Liu_Binary_ICDAR17].

Deep convolutional neural networks (CNNs) have become all-pervasive in computer vision ever since AlexNet [krizhevsky12a]

won the ILSVRC 2012 ImageNet Challenge 

[russakovsky15a]. Tensmeyer and Martinez [Tensmeyer_Binary_ICDAR17] posed document image binarization as a pixel classification problem and developed a fully connected convolution network for it. An encoder-decoder network was proposed in [deep_otsu_pr19] to estimate the background of a document image. Then, Otsu’s global thresholding technique [otsu] is used to obtain a binarized image with uniform background. Afzal et al. [afzal_LSTM_HiP15]

employed a long short-term memory (LSTM) network to classify each pixel as background and foreground by considering images to be a two-dimensional sequenec of pixels. In 

[binary_attention_icdar19], Peng et al.

proposed a multi-resolutional attention model to learn the relationship between the text regions and background through convolutional conditional random field 

[krahenbuhl11a, teichmann19a]. To bypass the need of large training datasets with ground truths, Kang et al. [cascade_unet_binary_icdar19] employed modular U-Nets [ronneberger15a] pre-trained for specific tasks such as dilation, erosion, histogram equalization, etc. These U-Nets are cascaded using inter-module skip connections and the final network is fine-tuned for the document image binarization task.

Document image enhancement: In addition to working within the background foreground separation framework, existing works have developed noise-specific document image cleanup methods such as shadow removal. Bako et al. [Bako16] assumes a constant background color generates a shadow map that matches local background colors to a global reference. Similar to Bako’s method, local and background colors are estimated to remove shadow from document images in  [shadow_icip2019, shadow_icassp20]. Inspired by the topological surface filled by water, Jung et al. proposed an illumination correction algorithm for document images in [Water-Filling_ACCV18]. A document image enhancement approach have been proposed by Krigler et al. by representing the input image as 3D point cloud and adopting the visibility detection technique to detect the pixels to enhance [doc_enhance_pointcloud_cvpr18]. Recently, Lin et al. [BEDSR-Net_CVPR20]

proposed a deep architecture to estimate (i) the global background color of the document, and (ii) an attention map which computes the probability of a pixel belonging to the shadow-free background. An illumination correction and document rectification technique using patch based encoder-decoder network is proposed in 


Existing works have also explored deep networks for overall document enhancement rather than focusing on correcting specific document degradations. A skip-connected based deep convolutional auto-encoder is proposed in [Skip-Connected_ICPR18]

. Instead of learning the transformation function from input to output, this network learns the residual between input and output. This residual when subtracted from the input image results in a noise free enhanced image. An end to end document enhancement framework using conditional Generative Adversarial Networks (cGAN) is proposed in 

[DEGAN_TPAMI20], where an U-Net based encoder-decoder architecture is used for the generator network.

Document image cleanup for mobile and embedded applications: Low-resource consuming models are desirable for mobile document image processing, e.g., in apps like Adobe Lens, CamScanner, Microsoft Office Lens, etc. However, existing CNN based methods [DEGAN_TPAMI20, Skip-Connected_ICPR18], discussed above, propose deep architectures with huge number of parameters, making them unsuitable for memory and energy constrained devices. In this work, we propose a comparatively light-weight deep encoder-decoder based network for document image enhancement task. We employ the perceptual loss based transfer learning technique to compensate for generalization performance with a low network capacity.

3 Proposed approach

As discussed, we propose a light-weight deep network, suitable for mobile document image cleanup applications.

3.1 Network architecture

We design an encoder decoder based image to image translation network. The encoder part of the model consists of three convolution layers followed by five residual blocks. The residual blocks were first introduced in [RESNET_CVPR2016] for generic image processing tasks. We modify the residual blocks from the original design [RESNET_CVPR2016] to suite our network design. The decoder part of the model consists of five convolution layer along with skip connections from the encoder layers. This type of skip connection helps mitigate the vanishing gradient and the exploding gradient issues [RESNET_CVPR2016, Skip-Connected_ICPR18, DEGAN_TPAMI20]

. Hence, the skip connections help to simplify the overall learning of the network. Each convolution layer is followed by batch normalization layer and ReLU6 

[relu6] activation layer.

The kernel size of the convolution layer is

and strides for all the layers is set to

. The padding at each layer is set as

“same”, which helps to pad the input such that it is fully covered by the filter. Padding “same” with stride

helps to keep the spatial dimension of the convolution layer output same as its input. At the end of the network a sigmoid activation function is used to obtain a normalized output between

and . The output dimension of the last layer of the decoder is either one or three depending on the end task of the network. If the network is trained for the task of binarization or gray scale cleanup then the output dimension of the last layer is set to one. For color cleanup task the output dimension of the last layer is set to three.

We term our models as M-x, where x represents the value of the maximum width of the network. In our experiments, we have considered x to be , and . The M-64 model is shown in Fig. 2. In M-32 model’s architecture, the output dimensions of the residual blocks are and the CNN blocks with output dimensions are removed. Similarly, in the case of M-16, the output dimensions of the residual blocks are set to 16 and the CNN blocks with output dimensions and are removed.

Figure 2: Proposed light-weight CNN architecture used for document image cleanup

3.2 Loss function

The network is optimized by minimizing the loss function computed using Eq. 1:


Here, is the -norm loss between the translated image and the ground truth image in color space for color image cleanup. For gray scale cleanup, , refer to the -norm loss between the two images in gray scale. In addition to pixel-level loss function , we also employ the perceptual loss functions and in Eq. 1.

Perceptual loss functions [Johnson2016Perceptual, rad19a, zhao17a] compute the difference between images and

at high level feature representations extracted from a pre-trained CNNs such as those trained on ImageNet image classification task. They are more robust in computing distance between images than pixel-level loss functions. In the context of developing light-weight document image cleanup models, perceptual loss functions serve an additional role of enabling transfer learning. The perceptual loss functions in Eq. 

1 helps to transfer the semantic knowledge already learned by the pre-trained CNN network to our smaller network.

The perceptual loss has two components [Johnson2016Perceptual]

: feature reconstruction loss

and style loss . Feature reconstruction loss encourages the transformed image to be similar to ground truth image at high level feature representation as computed by a pre-trained network . Let be the activations of the layer of the pre-trained network . Then, Eq 2 represents the feature reconstruction loss:


where the shape of is . The feature reconstruction loss penalizes the transformed image when it deviate from the content of the ground truth image. Additionally, we should also penalize the transformed image if it deviate from the ground truth image in terms of common feature, texture, etc. To achieve this style loss is incorporated as proposed in [Johnson2016Perceptual]. The style loss is represented in Eq. 3 as follows:


where represent pre-trained CNN network, represent set of layers of used to compute style loss, and represent a Gram matrix containing second-order feature covariances. Let be the activation of the layer of the pre-trained network , where the shape of is . Then, the shape of the Gram matrix is and each element of is computed according to Eq 4 as follows:


In our work, we use the network [simonyan2014deep] trained on the ImageNet classification task [imagenet_cvpr09] as our pre-trained network

for the perceptual loss. Here, feature reconstruction loss is computed at layer conv1-2 and style reconstruction loss is computed at layers conv1-1, conv2-1, conv3-1, conv4-1, and conv5-1.

4 Experimental results and discussion

We evaluate the generalization performance of the proposed models on binarization, gray scale, and color cleanup tasks.

Model Mult-Adds Parameters Size Inference time Load time
(in billions) (in millions) (in KB) (in seconds) (in seconds)
SkipNetModel [Skip-Connected_ICPR18]
M-64 (proposed)
M-32 (proposed)
M-16 (proposed)
Table 1: No. of parameters, product-sum operations, and other statistics of various models. Average per-patch inference time is reported. As an example, a image has 100 patches.

Experimental setup: In our experiment, the input of the network is set as . The input to the network is a channel RGB image, whereas the output dimension of the network is set as or depending on the downstream task. If the downstream task is to obtain an image in gray scale or a binary image, then the output dimension is set as . The output dimension is set as for color cleanup task.

To handle different type of noise at various resolution, the training images are scaled at scale , , and . Further at each scale, the training images are divided into overlapping blocks of . During training, a few random patches from the training images are also used for data augmentation using random brightness-contrast, jpeg noise, ISO noise, and various types of blur [albumination]. Randomly selected of the training patches is used to train the model while the remaining is kept for validation. The model with best validation performance is saved as the final model. The network is optimized using Adam algorithm [adam] with default parameter settings. The parameters , , and of Eq 1 is set to , , and respectively. During inference, an input image is divided into overlapping blocks of . Each patch is inferred using the trained model. Finally, all the patches are merged to obtain the final result. We use simple averaging for the overlapping pixels of the patches.

Compared algorithms: We compare the proposed models with recently proposed deep CNN based document image cleanup models: SkipNetModel [Skip-Connected_ICPR18] and DE-GAN [DEGAN_TPAMI20]. Table 1 presents a comparative analysis of our proposed models with SkipNetModel and DE-GAN in terms of: (i) number of multiplication and addition operations (Mult-Adds) associated with the model [howard17a]

, (ii) number of parameters, (iii) actual size on device, (iv) model load time, and (v) model inference time. A comparison with respect to these parameters is essential if the applicability of any model for memory and energy constrained devices is to determined. We implemented the models using TensorFlow Lite ( on an Android device with Qualcomm SM8150 Snapdragon 855 chipset and 6GB RAM size. On the device, we observe that our models are 65-1090 and 3-55 times lighter in size than DE-GAN and SkipNetModel, respectively. Similarly, our models has lesser product-sum operations and prediction time during the inference stage, making them suitable to mobile and embedded applications.

Model F-measure PSNR DRD
Otsu [otsu]
Sauvola [Sauvola00adaptivedocument]
Tensmeyer [Tensmeyer_Binary_ICDAR17]
Vo [Vo_bin_PR18]
SkipNetModel [Skip-Connected_ICPR18]
M-64 (proposed)
M-32 (proposed)
M-16 (proposed)
Table 2: Results on DIBCO13 [DIBCO13]
input ground truth SkipNetModel [Skip-Connected_ICPR18] M-32
Figure 3: Typical examples of DIBCO13 [DIBCO13].

4.1 Binarization

We begin by discussing our results on document image binarization task. For our experiment, we have considered the publicly available binarization dataset DIBCO13 [DIBCO13] and DIBCO17 [DIBCO17] as test sets. The proposed models and SkipNetModel are trained on the datasets [DIBCO09], [DIBCO10], [DIBCO11], [DIBCO12], [DIBCO14], [DIBCO16] and [DIBCO18]. While training the models for the test set DIBCO13 [DIBCO13], we also include the dataset DIBCO17 [DIBCO17] into our training data. The models for this task are trained using the augmentation strategy described in Sec. 4. The same training strategy is also followed while training the models for the task DIBCO17 [DIBCO17]. The models are compared using the DIBCO13 [DIBCO13] evaluation criteria: F-measure, pseudo F-measure (

), peak signal to noise ratio (PSNR), and distance reciprocal distortion (DRD). For the metrics F-measue,

, and PSNR, higher values correspond to better performance whereas, in case of the metric DRD lower is better. While evaluating our methods on DIBCO13 dataset, we have compared our methods with traditional binarization algorithms [otsu, Sauvola00adaptivedocument], state of the art binarization techniques [Tensmeyer_Binary_ICDAR17, Vo_bin_PR18], DE-GAN [DEGAN_TPAMI20] and SkipNetModel [Skip-Connected_ICPR18]. Overall performance of these methods are reported in Table 2. In this table, performance of the methods [otsu, Sauvola00adaptivedocument, DEGAN_TPAMI20, Tensmeyer_Binary_ICDAR17, Vo_bin_PR18] are reported as they are reported in [DEGAN_TPAMI20]. From this table, it is evident that DE-GAN outperforms all other methods in terms of all metrics. However, the proposed method performs better than the traditional binarization algorithms [otsu, Sauvola00adaptivedocument] and they perform more or less similar to other state of the art techniques. Moreover, from Tables 1 and 2, we can observe that though the proposed method can not outperform the state of the art techniques but they perform similar to most of the state of the art techniques with much lesser computational and memory cost. We have also reported the performance of the proposed methods with the top 5 methods of DIBCO17 competition [DIBCO17], SkipNetModel and DE-GAN in Table 3. A similar performance of the proposed methods is also observed from this table in comparison to the state of the art techniques. Typical examples from the datasets DIBCO13 and DIBCO17 are shown in Figs. 3 and 4.

input ground truth SkipNetModel [Skip-Connected_ICPR18] M-32
Figure 4: Examples from DIBCO17 [DIBCO17].
Model F-measure PSNR DRD
10 [DIBCO17]
17a [DIBCO17]
12 [DIBCO17]
1b [DIBCO17]
1a [DIBCO17]
SkipNetModel [Skip-Connected_ICPR18]
M-64 (proposed)
M-32 (proposed)
M-16 (proposed)
Table 3: Results on DIBCO17 [DIBCO17]. Here 10, 17a, 12, 1b, and 1a are the top 5 methods from DIBCO 2017 competition [DIBCO17]
input ground truth SkipNetModel [Skip-Connected_ICPR18] M-32
Figure 5: Typical examples of noisy images from our test set of synthetic data from [NoisyOffice].
input SkipNetModel [Skip-Connected_ICPR18] M-32
Figure 6: Typical examples of noisy images of real data from [NoisyOffice].

4.2 Gray scale and color cleanup

For the purpose of document cleanup, we first show the effectiveness of the proposed methods in gray scale. The gray scale cleanup part of our experiment is conducted on the publicly available dataset NoisyOffice [Dua:2019, NoisyOffice] . This dataset consists of two parts, first one is real noisy images consisting of files, and a synthetic dataset consisting of files. There were no groundtruth images available for the real data, therefore, we are not able to include the real dataset for quantitative analysis of our experiment. The model is trained and evaluated on the synthetic data. We divide the synthetic data into two parts - images for training and images for testing. The authors of [DEGAN_TPAMI20] did not share the saved model for gray scale cleanup. Therefore, for this experiment, only the method proposed in  [Skip-Connected_ICPR18] is used for comparison. To measure the capability of the proposed models with respect to removing noise, we adopt peak signal to noise ratio (PSNR) as the quality metric. In order to determine the dependence of the model performance on the amount of available training data, we have trained the SkipNetModel and the proposed models M-64, M-32, and M-16 by varying the amount of training data from to . The performance of the models with respect to PSNR score is shown in Fig. 7. It can be observed from this figure that in presence of training data, SkipNetModel [Skip-Connected_ICPR18] performs better than the proposed models. However, it can also be observed from this figure that the performance of the proposed models is more or less remains the same. It can also be observed from this figure that the performance of SkipNetModel varies a lot with the variation in the amount of training data. Typical examples of inputs, groundtruths from the test set along with the outputs of the models trained on training data are shown in Fig. 5. We have also shown a few examples of inputs and outputs of the trained models on real data in Fig. 6. It can be seen from this figure that the model M-32 performs better than SkipNetModel in few of the examples. From Figs. 6 and  7, we can conclude that the proposed model is more generalized and performs more robustly with respect to SkipNetModel.

Figure 7: PSNR scores of the models on NoisyOffice dataset [NoisyOffice] with varying the training data from to
M-64 (proposed)
M-32 (proposed)
M-16 (proposed)
Table 4: Color cleanup performance of the proposed model: SSIM and PSNR score of the noisy input images with respect to the groundtruth are and respectively.

Finally, we present our experimental results with respect to document color cleanup task. One challenging aspect of color cleanup is the preservation of color of the foreground pixels. For this experiment, we used a color dataset consisting of 250 mobile captured images. Each of the images are manually cleaned. We followed the same training strategy for training our model as described in Sec. 4. Random real life images (not belonging to the train/test set) and their corresponding outputs are shown in Fig. 8. From this figure, we can observe a decent performance of the proposed model in performing color cleanup of document images. However, to provide a quantitative measure of our method, we compute PSNR, and structural similarity index (SSIM) score on the test set of the data in Table 4.

Figure 8: Typical examples of color clean up. (a) random inputs images; (b) cleaned outputs using M-32 based model on a mobile device

5 Conclusion

We have proposed an encoder-decoder based document cleanup model for resource constrained environments. To this end, we design a light-weight deep network with only a few residual blocks and skip connections. Our loss function incorporates the perceptual loss, which enables transfer learning from pre-trained deep CNN networks.

We develop three models based on our network design, with varying network width. In terms of the number of parameters and product-sum operations, our models are 65-1030 and 3-27 times, respectively, smaller than a recently proposed GAN based document enhancement model [DEGAN_TPAMI20]. In spite of our relatively low network capacity, the generalization performance of our models on various benchmarks are encouraging and comparable with several document image cleanup techniques with deep architectures such as [Tensmeyer_Binary_ICDAR17, Skip-Connected_ICPR18]. In addition, our models are more robust to low training data regime than [Skip-Connected_ICPR18]. Hence, the proposed models offer a favorable trade-off between memory/latency and accuracy, making them suitable for mobile document image cleanup applications.