Image super-resolution (SR) is an important and challenging low-level vision task in many real-world problems. In this paper, we focus on the application of super-resolution for the document images, which are one of the most pervasive types of input in our daily life [mancas2007introduction]. The document images with low-quality can affect the results of OCR and lead to low OCR accuracy. There are different kinds of marred document inputs, and the low-resolution images are a common case among these scenarios. In order to improve the OCR accuracy, super-resolution is usually considered as a pre-processing enhancement stage.
Super-resolution involves adding details and keeping a smooth structure based on the original low-resolution images (LR). It is a typical ill-posed problem to predict those unseen pixels for the real high-resolution images (HR) [arxiv:1902.06068]
. Traditional super-resolution methods usually employ the interpolation based approach such as Bilinear and Bicubic. Recently, the applications of deep learning and generative networks on computer vision research have created a significant breakthrough in many fields. For super-resolution of the natural images, the deep models such as SRCNN[Chao2014Learning, Dong2016Image] and SRGAN [Ledig2016Photo] have achieved state-of-the-art performance. However, natural images and document images contain different attributes, and the reasons to have low-resolution images are also different. The results of the previous method tend to improve the overall similarity with the HR images, which sometimes cause blurry edges and cannot bring improvement to the OCR accuracy.
Many previous methods use a single network with continuous up-sample blocks after the convolution blocks. After one single up-sample process, the intermediate image features may not be adequately extracted and the text regions under low-resolution may be processed into unrecognizable characters for the OCR system. In this paper, we propose to use the cascaded networks and the pipeline is illustrated in Fig. 1. Each Detail-Preserving Networks (DPNet) aims to preserve the detail with small magnification. They are trained with the same architecture and different parameters and then assembled into a pipeline model with a larger magnification. The low-resolution images can upscale gradually by passing through each DPNet until the final high-resolution images. For each DPNet, the loss function with perceptual terms is designed to simultaneously preserve the content and enhance the edge of the characters. We conduct extensive experiments with state-of-the-art image super-resolution methods on two scanning document image datasets and demonstrate its superiority in terms of Wang2004SSIM] over previous approaches. Besides, combining our Cascaded Detail-Preserving Networks framework with standard OCR system also lead to signification improvements on the recognition results.
The rest of this paper is organized as follows. Section II introduces the background of super-resolution. Section III discusses the model design, network architecture and training process in detail. In Section IV, we demonstrate the qualitative and quantitative study of the proposed network. And we conclude our work in Section V.
Ii Related Work
Super-resolution is a typical image restoration task, aiming to convert the low-resolution images into high-resolution. Super-resolution can be useful for many applications, especially for optical character recognition (OCR). Specifically, the loss of image details can seriously affect both text detection and recognition from the document images. Therefore, the super-resolution methods are usually introduced as a pre-processing step and can lead to improvement of a modern OCR system.
Image super-resolution is an ill-posed problem and the super-resolution of document images is a domain-specific task. Traditional super-resolution approach can be addressed by using priors. These methods include prediction based approach [Tomer2014A], gradient profile-based approach [Jian2008Image], image statistics based approach [Efrat2013Accurate, Fernandez2013Super], patch-based models [Qiang2005Patch, Aodha2012Patch], and external learning or example-based super resolution [Freeman2002Example].
In recent years advances in deep learning benefit the vision problems. A set of models have been built for super-resolution using deep convolutional neural networks (CNN). For instance,[Chao2014Learning] and [Dong2016Image] proposed a CNN based method to super-resolve natural images, by using the network to learn the mapping between interpolated bicubic images from LR images and corresponding HR images. VDSR network [Kim2016Accurate] is designed to predict the residuals instead of pixel values with fast convergence speed. With a deeply-recursive convolutional network architecture, DRCN [Kim2015Deeply] reported a high performance with fewer model parameters. More recently, SRGAN [Ledig2016Photo] introduced residual network for single image super-resolution (SISR) and combined generative adversarial network (GAN). GAN based method extracts texture features from images by a deep CNN, such as VGG-16 [simonyan2014deep], and makes the super-resolved images have proper texture and good perceptual quality. The discriminator network also makes the super-resolution network learn the capacity for transferring low-resolution images into high-resolution images with details.
The resolution of document images is an important factor for both OCR system and human vision to recognize text and characters. As a general rule, the lower the text resolution is, the more visual information lost, and the lower recognition accuracy will be reached. Besides, extremely high resolution may not bring higher accuracy but higher computation burden. Therefore, considering real-world OCR applications, the super-resolution model should have an adjustable magnification to handle varying degrees of low-resolution in text patches. If text resolution is especially low, the model should proceed with higher magnification. And as a preprocessing step, an efficient super-resolution model is helpful for the whole OCR pipeline. This motivates us to design light-weight network architecture and further build our composable model.
The goal of our framework is to super-resolve document images and text patches with adjustable magnification. It is designed to work as a cascade process. As shown in Fig. 1, the total model is composed of multiple networks. Each DPNet is with small super-resolution magnification (2). The networks trained for different scale of document images share the same network architecture but have different parameters. The whole model is connected with the DPNet trained from the neighboring scales. The input low-resolution image is magnified successively, results in a multiplicative magnified high-resolution image.
Iii-a Detail-Preserving Network
As shown in Fig. 2, the Detail-Preserving Network employs a generative CNN architecture, which follows a common single image super-resolution pattern and includes three parts. The first part is to extract features with constant size as the input image. Here we use a single convolutional layer with a kernel size of 9 to make low-level feature mapping from the input image. Then residual blocks will extract high-level features from a low-level feature map. Here we choose in our experiments for a trade-off between the performance and the model efficiency, and a kernel size of 3 for the convolutional layers. Skip connection is also included in this part and contributes to the residual blocks training and feature fusion between low-level and high-level. The second part is the upsampling. Using a series of upsample blocks cannot make the most of feature between each scales, so we employ a single upsample block with sub-pixel convolutional layer111Suppose the magnification of the upsample block is 2, the single channel input size is , and input/output channel number is /. Given the input , the convolutional layer will generate a matrix, which then will be converted to an output of by the pixel shuffle operation. [Shi2016Real]
. The final part is to generate the output map, including a single convolutional layer and sigmoid function.
Iii-B Model Training
Due to the cascade structure in this work, we divide the training process into two phases, i.e., parallel training and penetrating fine-tuning. An overview of the model training strategy is shown in Fig. 3.
Iii-B1 Parallel Training
Each network takes the image with lower resolution as the input and returns the images with higher resolution. In the first phase, we suppose the networks for different scales are independent and trained them separately. Here we choose a 4 model as an example in Fig. 3(a). After down-sampling, 2 and 4 low-resolution images are generated from original high-resolution images. 4 low-resolution images are the input to DPNet1. The outgoing super-resolved images are used to calculate loss with 2 low-resolution images. Then loss backward propagation will optimize parameters in this network.
In a similar way, DPNet2 is trained in parallel, using 2 low-resolution and original high-resolution images. Concerning the model with larger magnification, the networks can be trained paralleled in the same way, which are convenient when multiple GPUs are available.
The parallel training in the previous phase enables each DPNet to super-resolve images successfully with a small magnification. However, image restoration tasks are ill-posed problems and any model may not quickly find a perfect solution equal to the original high-resolution image. Therefore, we design this phase to adapt network parameters in Fig. 3(b).
In each step, all of the networks connected by arrows are used for fine-tuning. The parameter weights of DPNet2 are initially frozen and the whole model takes low-resolution images as the input then outputs super-resolved 4 images to update the weight of DPNet1. The networks are fine-tuned sequentially in this phase, from the second to the N-th (e.g., parameters of DPNet1 and DPNet2 are updated in Fig. 3(b)).
Iii-C Loss Function
For each phase and each network, the network employs the same loss function. Three terms are incorporated in the loss function as follows,
The first term of the loss function is the pixel loss, which is defined by the pixel-wise MSE. Inspired by [johnson2016perceptual]
, The second term is the perceptual loss, which is based on the difference of feature maps from an ImageNet[ILSVRC15] pre-trained VGG19 network [simonyan2014deep] between the generated and target image. Formally, the perceptual loss is defined as:
where and indicate the high-resolution and low-resolution images, represents the -th layer that outputs the feature maps with size (), and is the super-resolution function. We choose the feature maps before the activation layer. Both the pixel and perceptual terms represent the content of the images. Here we use the metric, as we found in our early experiments that the network trained using perceptual loss only or metric may cause unrealistic textures on generated images (which is also reported in previous work such as [mechrez2018contextual]).
The last term is the edge loss. Here we employ the class-balanced cross-entropy loss [Xie2015Holistically], by mapping the original high-resolution image and super-resolved image into the corresponding edge maps with holistically-nested edge detection (HED) [Xie2015Holistically], and then computing their loss. The benefits of the edge loss are two-folder. First, the enhancement of the edge information is able to preserve the detail information with small magnification. Second, as observed from the experiment, incorporating the edge loss accelerates the convergence speed during the model training. The loss function is defined as
where is the edge maps from the -th side-output layer of the network and indicates the class-balanced cross-entropy loss. We set in HED model to reduce the training and inference time.
Iii-D Implementation Details
We implement our model using PyTorch222https://pytorch.org/. The experiments are conducted using Intel Xeon-E5 CPU and NVIDIA Titan Xp GPUs. We evaluate some different methods with different fine-tuned network parameters but the same training dataset and configuration.
Iv-a Datasets and Evaluation Metric
To validate the efficiency of the proposed framework, we collect two document image datasets and design two groups of experiments.
Iv-A1 RVL-CDIP Region
The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset [harley2015icdar] consists of 16 document categories with 25K document images per category. Among these 400K grayscale document images, 80% images are considered as the training set, 10% images are as the validation set, and the rest are the testing images. In our experiment, we randomly sample 32K regions with a size of pixels from the original RVL-CDIP dataset for training, 4K regions for validation, and 4K regions as the test set. As the document images are various with different fonts and structures, we focus on both the quantitative and qualitative evaluation of super-resolution results and PSNR as well as SSIM are considered as the metric.
We also construct a dataset containing textline regions with recognition annotation to evaluate both the super-resolution metrics and the OCR accuracy with super-resolution. We randomly select 20 pages from the proceeding of ICDAR 2017, print them in the paper, and then scan to full-page digital images with the setting of 300 dpi and 75 dpi. For each page, 30 textline regions are randomly cropped with the text labels by annotators. All of the text region images are divided with 400 patches as the training set and 200 patches as the testing set.
Besides the PSNR and SSIM used in RVL-CDIP experiments, here we also evaluate the OCR performance with the help of image super-resolution. After the super-resolution process, the output images are sent into a commercial OCR system333ABBYY Fine Reader 14: https://www.abbyy.com/en-apac/finereader/. Two recognition precision metrics are defined, namely LCS score and Levenshtein score, with values Fall within the interval of [0,1]. The LCS score is based on the Longest Common Subsequence (LCS), with the definition as
where and indicate the predicted and target text respectively. The LCS score is the ratio of LCS length to the maximum length of the and , i.e., . It only reaches the maximum value of 1.0 when is completely the same as . The Levenshtein score is obtained with the Levenshtein distance. Levenshtein distance, which may also be referred to as edit distance, is a string metric for measuring the difference between two sequences. Therefore, we use the difference between Levenshtein distance and to evaluate the similarity between and , i.e.,
Iv-B Results and Comparison
Table I demonstrates the comparison of our full model with state-of-the-art super-resolution approaches. We compare with classical Bicubic method as well as recent deep learning based models SRCNN [Dong2016Image] and SRGAN [Ledig2016Photo]. All of these baseline methods and proposed framework are compared with the same magnification (4). Notably, our Cascaded DPNets performs better on both datasets under all the metrics. Fig. 4 and Fig. 5 demonstrate qualitative evaluations of our approach on the testing sets. We succeed in preserving the detail of the text regions in different document types and character fonts, especially when the small characters appear. However, there are also some failure cases where some characters are extremely small, or fails to identify multiple characters that are adjacent to each other. The recognition results on the ICDAR17-Textline dataset are also illustrated in Fig. 5. We can observe that combining the proposed Cascaded DPNets with the OCR system can further boost the recognition accuracy. Generally speaking, the super-resolution results show improvement on the full-reference image quality metrics comparing with baseline methods. Text characters and image details are with high quality for further post-processing such as layout extraction and character recognition. During inference, the Cascaded DPNet model achieves 75 FPS speed by consuming 2840M memory from an Nvidia GTX Titan Xp GPU with a LR image as input.
|Cascaded DPNet without Edge||24.96||0.7487|
|Cascaded DPNet with Edge||25.27||0.7541|
(a) Edge loss
Method PSNR SSIM Bicubic () 20.74 0.7113 Bicubic () + DPNet () 21.12 0.7218 DPNet () + Bicubic () 22.95 0.7361 Cascaded DPNet () 25.27 0.7541
(b) Different cascade structures
Iv-C Ablation Study
In this subsection, we evaluate the alternative implementations for the document image super-resolution. We report results on the RVL-CDIP Region dataset as it is larger and more diversified than ICDAR17-Textline.
Recall that the edge term is computed to represent the edge information, which is of great importance as mentioned in Section III-C. The super-resolved images and their corresponding metrics with or without the edge loss are shown in Fig. 6 and Table II(a). The cascaded networks without edge loss outperform the SRGAN framework, indicating the effectiveness of cascade architecture on document images. We observe performance gains when adding the edge term, and the super-resolved text regions are with better contour and more clear characters that are helpful for further recognition.
We also evaluate the effect of components within the cascade super-resolution structure. Fig. 7 and Table II(b) demonstrate the comparison with replacing the DPNet with bicubic. Quantitatively speaking, the model with DPNets performs the best among different cascade settings. The multiple stages of DPNet introduce a 10.5% gain on PSNR over the cascade of Bicubic and DPNet, and a significant improvement of the SR results as illustrated in Fig. 7.
We have introduced Cascaded DPNets, a deep super-resolution framework for the document images. Detail-Preserving Network with small magnification is able to preserve the content and enhance the edge of the characters. The cascade of the networks is assembled into a pipeline model with a larger magnification. Through an extensive set of document super-resolution experiments, we have shown that Cascaded DPNets is more effective than the baseline deep learning approaches, generating very competitive results from the low-resolution document images.