Binary Document Image Super Resolution for Improved Readability and OCR Performance

12/06/2018 ∙ by Ram Krishna Pandey, et al. ∙ ERNET India indian institute of science 6

There is a need for information retrieval from large collections of low-resolution (LR) binary document images, which can be found in digital libraries across the world, where the high-resolution (HR) counterpart is not available. This gives rise to the problem of binary document image super-resolution (BDISR). The objective of this paper is to address the interesting and challenging problem of super resolution of binary Tamil document images for improved readability and better optical character recognition (OCR). We propose multiple deep neural network architectures to address this problem and analyze their performance. The proposed models are all single image super-resolution techniques, which learn a generalized spatial correspondence between the LR and HR binary document images. We employ convolutional layers for feature extraction followed by transposed convolution and sub-pixel convolution layers for upscaling the features. Since the outputs of the neural networks are gray scale, we utilize the advantage of power law transformation as a post-processing technique to improve the character level pixel connectivity. The performance of our models is evaluated by comparing the OCR accuracies and the mean opinion scores given by human evaluators on LR images and the corresponding model-generated HR images.



There are no comments yet.


page 1

page 4

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The task of super resolution (SR) has been a classic problem ever since the earliest work of Tsai [1]

. Digital images are composed of tiny picture elements called “pixels” and their density in representing the image is commonly referred to as the spatial resolution of that image. Higher the resolution, higher are the details perceivable by the human eyes. The task of SR is said to be ill-posed, since there is no exact and unique solution to the problem and many distinct high-resolution (HR) images can be obtained from a single low-resolution (LR) image. Super resolution techniques come into play, where high resolution is really useful, but the requirement is not satisfied, such as, for example, old fax images. SR techniques enhance the image details and quality by performing non-linear transformations and by eliminating the artifacts generated by the imaging procedures.

Enhancing the resolution of gray images of text has been approached in a variety of ways as presented in [48] to improve the OCR accuracy. However, the data handled does not encompass the domain of binary document images, which have been scanned at a low-resolution. In the following sections, we have created a unique dataset of Tamil binary document images and propose convolutional neural network (CNN) based SR models for these document images. By exploiting these proposed methodologies, one can improve the readability of the scanned documents even when they are scanned at a low-resolution due to time, hardware or bandwidth constraints. This process eventually leads to better readability and improved accuracy of the optical character recognizer (OCR). These improvements in the perceptual quality and OCR accuracy can be of great value for voluntary initiatives such as “Project Madurai”, which aim to preserve ancient Tamil literature by creating their e-text versions, online [3].

It is generally advisable to scan text documents at a resolution of around 300-600 dots per inch (dpi) on a flatbed scanner for the OCR to achieve its best performance. Though it is necessary to scan under these settings, there already exist large collections of documents which have already been scanned at low resolution and later, the original documents have been destroyed or lost, which prevents us from scanning again. Also, scanning at a higher resolution implies that we are representing the digitized pixels with more number of dots. This takes time and the files consume a lot of system memory for storage or bandwidth for transmission, thus limiting the user to scan only a few documents in a given period of time and hardware capacity. For example, as mentioned above, in Project Madurai [3], committed volunteers across the world have shared scanned images of over 500 ancient literary works in Tamil. But, the documents were scanned as binary images at a low-resolution of 75 dpi. This quality was good enough for the volunteers to type in the text by visually inspecting the document image, but is grossly inadequate for the OCR’s to give a good recognition performance.

To address the above problem with the objective of achieving higher resolution, mainly two kinds of techniques are used: multiple image super resolution (MISR) and example based super resolution (EBSR). In MISR, we combine multiple images, which are misaligned at the sub-pixel level to obtain a HR image [4],[5],[6]. Whereas, EBSR refers to improving the resolution of an image by learning the correspondence between low and high resolution image patches from a database [7]. An improvement over EBSR methodology can be found in [8]

, which describes a single image super resolution (SISR) technique for improving the resolution of single LR images. SISR algorithms can be classified based on their field of usage 

[9]. Domain-specific SISR algorithms emphasize on a particular type of data: for instance, artwork [10] and faces [11] [12]. Generic SISR algorithms have priors based on common and primitive image properties [13] and can be applied to different kinds of images  [8][14] [15]

. Owing to the recent advances in the field of deep learning, we take into account the advantages of the SISR methodology and propose example learning based SISR neural network models, which learn the correspondence between low and high-resolution binary document images and increase the details of any scanned Tamil test document, irrespective of the scanner settings.

Since in super-resolution, our aim is to transform data from the low resolution to high-resolution space, algorithms such as [16, 17]

initially upscale the down-sampled image to the required resolution by bicubic interpolation or the combination of interpolations as proposed in 

[66] and then pass it through the neural network models. Whereas, models like [20] use transposed convolution at a later stage of the network architecture to upscale the features. An alternative to the transposed convolution layer, called the sub-pixel convolution layer [22], has also shown promising results for upscaling. Its advantage over the other upscaling layers lies in its computational efficiency and no-training nature. We take into account these factors while building our models and analyze their performance in the latter sections. Here, we list the key-takeaways from our work as follows:

  • The methodologies found in the literature aim to improve the quality of noisy document images in order to achieve higher OCR accuracy. To our knowledge, there is no reported work in the literature on super-resolution of binary document images, with the objective criterion of better OCR accuracy and improved readability. There is no standard dataset available for the reported research issue. The ICDAR2015-TextSR dataset [49], which is a dataset available for text image super-resolution, contains camera captured gray images. Since there is no publicly available dataset, we have created our own dataset, which captures multiple variations in low resolution input images so that the network can generalize well during testing. The complete details of the created dataset are given in section III.

  • We extend the advantages of the synthesis of residual learning and the sub-pixel layer proposed for natural images in [21, 22], for computationally efficient upscaling of binary document images. Residual learning facilitates the learning of residuals instead of the entire pixel-pixel mapping, which significantly reduces the amount of information that needs to be processed. The usage of sub-pixel convolution layer as the final upscaling function eliminates the need to interpolate the input image in the initial stages, thus allowing the model to learn the complex correspondence between the LR-HR patch pairs.

  • After performing extensive experiments, we have designed, implemented and proposed 12 different architectures (3 for an upscaling factor of 2 with ReLU as the activation function and 3 more by replacing ReLU with PReLU; similarly 6 more architectures for upscaling the document images by a factor of 4) of various structural complexities, which can reconstruct HR Tamil document images directly from images scanned at a LR. Even though all the architectures perform well in all the cases, our subpixel based architecture outperforms the others.

  • By employing the proposed methodologies, we can store the low-resolution version of a document image (originally scanned at a high resolution) in the system memory and pass it through the model only when we need to perform OCR. Thus, it saves memory and facilitates the storage of many more documents.

  • The HR output images of the DNN models have significantly improved perceptual quality of the text in the document. This implies that the document now has much better readability than its low-resolution counterpart. The details are given in Section VI.

The following sections are organized as follows: Section II provides information about the related work on super resolution. Section III provides a detailed insight into the dataset creation process. In Section IV, we explain our proposed CNN models, followed by Section V, in which we explain the experimentation process. The results are shown and discussed in section VI, followed by our conclusion in section VII and the possible future work in Section VIII.

Ii Related Work

The need to improve the details of a low-resolution image for applications in medical imaging, satellite image processing, multimedia, document content retrieval, surveillance, etc. has led to the development of many super-resolution techniques.

Image super-resolution is an ill-posed problem and can be addressed by using priors. The image SR methodologies can be broadly classified into: prediction based [23], gradient profile based [13], image statistics based [24, 25], patch based models [26] [27], internal [28] and external learning or example based super resolution [7].

Text image super resolution is a domain-specific SISR task, where the training data comprises only the gray document images of the characters of a language. This is unlike the generic SISR methodologies, where the domain is huge. Text image SR methodologies include sparse coding based approaches [29, 30, 31, 32], edge directed tangent fields using Markov random fields [29], Bayesian learning approaches [33], convolutional neural networks  [34] [35] [36] [46, 66]

and the iterative synthesis of the median estimator 


Internal learning based super-resolution requires the input image itself to reconstruct its high resolution counterpart using the cross-scale self-similarity property [8]. This property states that small patches of size are highly likely to be found in the down-sampled version of the same image. An example of internal learning based approach is high frequency transfer, wherein the initial HR image is obtained by bicubic interpolation of the input image and the high frequency (HF) components separated from the input image are transferred patch-wise to the bicubic interpolated HR image. Whereas, in the case of neighbor embedding (locally linear embedding), images are super resolved by assuming similarities between the local geometries. It implies that, since there is a correspondence between the LR and HR patches, the HR patches can be obtained by a weighted linear combination of its neighbors by making use of the same weights used with the neighbors of LR patches [38].

On the other hand, example based super-resolution uses a database of LR and HR patch pairs. In this case, the dataset is in a compact format (representation) in terms of the LR and HR dictionaries. Dictionary based approach for SISR was first proposed in [39], and was refined in  [40], which uses sparse coding to find a joint or coupled  [41] representation of LR and HR patches and uses this representation to find the HR patch. Zyde et. al. [42] use K-SVD for dictionary update and orthogonal matching pursuit for sparse coding. Anchored neighbor regression and its variant  [44] use smaller dictionaries in place of one larger dictionary to speed up the process. Dong et. al. [17] propose CNN based natural image super resolution, which uses bicubic interpolation to resize the images to the same size and then learns a mapping between the resized LR and the corresponding HR images. A lot of improvement in CNN-based super-resolution has been reported in [16]. In  [19], the authors have proposed recursive-supervision and skip connections to make the training easy and to avoid the problems of vanishing and exploding gradients. Another example is the work on fast super resolution CNNs [20], which has shown restoration quality and speed superior to SR convolutional neural networks [17].

A recent development in image super resolution is photo-realistic SISR using a generative adversarial network (GAN) [45]. They have shown that the super resolved image is of good perceptual quality. However, it is not guaranteed that the network will produce the true high resolution details, as the generator’s goal is to fool the discriminator by generating a good quality image similar to the natural image. GAN based approach makes use of a deep CNN, such as VGG-16 [67], for extracting texture information from the images. Thus, when these models are applied to binary document images, which have low texture content, the generator produces symbols, whose structure is retained but not the pixel-connectivity. This goes against the objective of our work.

We are addressing the problem of SISR for a task distinctly different from the above papers: to enhance the quality of the input binary document images so that the generated images have better readability and OCR accuracy. Our work builds on what has been reported in [46] [47], which addresses the above mentioned problem for the first time and is able to obtain good PSNR, OCR character and word level accuracies starting from a downsampled version of the document image (gray in nature), the results being similar to that of the corresponding ground truth image. In the current work, we have addressed the realistic problem of obtaining an upscaled version of a binary document image, actually scanned at a low resolution. We have achieved this by creating a new dataset and by designing CNNs for binary document image super resolution (BDISR).

Iii Dataset

Fig. 1: Magnified sample frames from our dataset. (a) 200-dpi high resolution patch . (b) Low resolution version of (a), generated by taking alternate pixels, . (c) A 200-dpi high resolution Tamil character, . (d) Low resolution version of (c), . (e) High resolution version of a Tamil character . (f) Low resolution version of (e), generated by applying the mask of random ones and zeros, . (g) 150-dpi high resolution image . (h) Low resolution image (directly scanned at 75 dpi), corresponding to the image in (g), . (i) 300 dpi high resolution image, . (j) Low resolution image (directly scanned at 75 dpi), corresponding to the image in (i), .

Iii-a Creation Methodology

Our training dataset consists of overlapping patches from the low and high resolution binary images of the same document. We consider them as the training data and the ground truth (GT), respectively. The LR patches are created by taking overlapping patches of stride 1 from the binary LR document image. If an upscaling factor of ’r’ is required from the network, we obtain the corresponding HR patches by taking overlapping patches with stride ’r’ from the binary HR (GT) image. We have created our dataset under the assumption that a function that upscales by a factor of 2 or 4 is being modeled to super-resolve the LR image. We have created a rich and diverse set of about five million

patch pairs, by creating various types of LR images, by choosing alternate pixels, random deletion of pixels, cropping from Tamil character images and direct scanning at low spatial resolution.

Iii-A1 Data for upscaling

We have scanned Tamil documents at 200 dpi, so that the resulting binary images can be considered as high resolution images for training purposes. Separate copies of these digitized HR document images are now converted to low resolution by selecting only the alternate pixels in the HR image in both x and y directions. Thus, the LR image has one-fourth the pixels of the HR image. We have created around 2 million corresponding pairs of overlapping patches of high and low resolution and train our networks on these patches, instead of the entire image. The dimensions of the LR and HR patches are and , respectively. Since the same document is scanned and converted to a lower resolution, the content of the HR image is reciprocated in the LR image, but with a reduction in the number of pixels, and hence the clarity. Let be one of the 2 million LR patches of dimension and be its HR ground truth of dimension . The two are related as,


where are the co-ordinates of the binary image. Since alternate pixels are considered, the dimensions of are ensured to be even. Picking alternate pixels to create LR images can be thought of as the scanner skipping the alternate pixels from the original image. As a result, we observe a loss in the structure or shape of the symbols.

Additionally, we create separate HR and LR patches using a different method. In this procedure, we initially generate the patch pairs following the same methodology as mentioned above, i.e by skipping alternate pixels. Then, we apply a mask, which has randomly distributed ones and zeros on each of the LR patches. Since the distribution is random, the mask for each LR patch is different, but the dimensions of the mask are the same as that of the LR patches, i.e . The image patch that entails is the result of pixel-wise multiplication of the mask and the LR image patch. Using (1), we get the LR patch through alternate pixel removal. If is the patch obtained after applying the mask on , then


where the dot operator (.) represents the element-wise multiplication of and .

is a matrix of randomly placed ones and zeros. The ones and zeros have been generated by non-uniform probability distributions such as a Gaussian distribution:


The values of and are initially chosen to be 0, 1 respectively, and are varied to generate different masks resulting in the creation of diverse low-resolution data. One can observe more discontinuities in the pixel structure of than in . This helps the model to be trained in such a way that it can tackle super-resolution tasks of randomly lost pixel data from a document. The ground truth for these new patches is denoted as .

In order to specifically improve the resolution of the characters (and thus to further enhance the performance of the OCR), we make use of the individual Tamil character data. This data facilitates improvement in pixel connectivity between the strokes in the symbols in the high resolution output. The HR data consists of 200 Tamil symbols, each having 150 samples on an average. Further, each sample is manipulated to create 15 different rotated variants. We follow the two previously mentioned procedures to create the low-resolution patches for this character data also. Let this entire low-dimensional data be denoted as and the ground truth as .

Finally, to generalize the upscaling function and to make it independent of the input resolution, font and thickness of the symbols in the document image, we create an additional dataset of 2 million LR-HR pairs from images scanned at 75 and 150 dpi, respectively. Let this data be represented as and . Our entire dataset for upscaling by two is thus the combination of all the low-resolution data:


The ground truth data comprises


Iii-A2 Data for upscaling

While generating the data for upscaling by four, the same procedures are used but the 200 dpi images are replaced by 300 dpi images as the ground truth. Our entire dataset for upscaling by four is the combination of low-resolution data comprising:


and the ground truth comprising:


Iii-A3 Test data

OCR performance does not differ much between 300 and 600 dpi images; but its performance on 75 and 100 dpi images is significantly lower than those on higher resolution images. Thus, without loss of generality, we choose the test data to be full length Tamil document images of 75 dpi resolution. When we pass these LR images through an OCR, the accuracy with which the OCR predicts the letters is low, resulting in an output of incorrect information. Now, by utilizing this dataset, one can train CNN models to convert LR Tamil images into HR images with better readability and OCR performance. Figure 1 illustrates a few character samples from this dataset, after scaling them for visual clarity.

Iv Proposed Cnn Models

Fig. 2: Binary document image super resolution using convolution-transposed convolution architecture (CTC)

We propose six (three, each with ReLU or PReLU as activation functions) neural network models to upscale the binary, low-resolution document images by a factor of two. We have also extended these to obtain six more architectures for upscaling by a factor of four.

Iv-a Convolution-Transposed Convolution Model (CTC)

In this convolution-transposed convolution (CTC) architecture, we make use of two convolution layers (conv1 and conv2) without padding, followed by two transposed convolution layers (trconv1 and trconv2) for upscaling the input image

by a factor of two, as shown in Fig. 2. A deeper model would consume a lot of time to train and test and hence, we have designed this architecture to have a balance between performance and speed. We have also used ReLU and PReLU activation functions and evaluated the performance of the model, as explained in the following sections. In order to upscale by a factor of 4, we add a new transposed convolution layer instead of replicating the entire model again. This approach has been followed in order to reduce the network depth, while achieving an upscale factor of 4.

Iv-A1 Transposed convolution

A transposed convolution (TC) layer, also called fractionally strided convolution layer, operates by interchanging the forward and backward passes of the convolution process 

[55]. It has found its application in semantic segmentation [56], representation learning [57], mid-level and high-level feature learning [58], etc.

To enhance the resolution, we need a function that maps every pixel in the low-dimensional space to multiple pixels in the high-dimensional space. This can be achieved by introducing the transposed convolution layer after extracting features in the low-dimensional space. Unfortunately, this method has some demerits: the kernel can have uneven overlaps with the input feature map, when the kernel size (the output window size) is not divisible by the stride (spacing between the input neurons)

[59]. These overlaps occur in two dimensions, resulting in checkerboard-like patterns of varying magnitude. To tackle this issue, we use unit stride, TC layers, along with increasing kernel sizes for our task. There are other alternatives to tackle this problem. For instance, by upscaling the LR image using bilinear interpolation and then utilizing the convolution layers for feature computation, we can prevent the occurrence of these checkerboard patterns. However, naively using this process may lead to the high-frequency image features being ignored during the upscaling [59].

In the proposed architectures, the dimension of the output feature of a transposed convolution layer can be calculated as given in [60]. According to [60], a convolution described by stride s = 1, padding p = 0 and filter size k has an associated TC described by filter size , stride and padding and its output size is given by,



is the dimension of the tensor input to the transposed convolution layer.

Iv-B Parallel Stream Convolution Model (PSC)

Following the previous approach, we add a convolution layer (conv3) to the output of trconv1 layer, whose output is then merged with the input (see Fig. 3). Since we are merging the feature output with the input image, the dimension of the merged feature map is the same as that of the input. Therefore, we make use of a transposed convolution upscaling layer (trconv3) in the end to upscale by two times. Now, we have two parallel feature maps, which are merged to obtain the final high resolution output as shown in Fig. 3. So, we call this as parallel stream convolution (PSC) architecture. The performance of this method is also evaluated using ReLU and PReLU activation functions.

Iv-B1 Residual training

Residual learning is most useful, when there is a chance of occurrence of exploding/vanishing gradients while training the network. Simply stacking more layers does not improve the performance of the network, as compared to combining residual blocks of layers. In residual learning, the network does not learn the exact pixel-pixel correspondence; instead, it learns the residual output, which consists mostly of zeros or negligible numbers [21]. Thus, the network can be trained at a higher learning rate to predict the residuals rather than the actual pixels, while using more number of layers than the usual CNN’s  [16, 19, 21]. In our

architecture, we have a residual connection from the input to one of the intermediate layers, instead of typically connecting it to the final output layer. Thus, we can represent these connections as,


where is the output tensor after merging the output of conv3, and the input, . We can now write the outputs of trconv2 and trconv3, in terms of the outputs of trconv1 and conv3 as,


The predicted output is finally given by,


Since we are combining the input image with the intermediate feature tensor, it is sufficient for the network to learn those extra set of features that are required for efficient upscaling, thus obviating the need to learn the redundant features already present in the input image. Here, we show the effectiveness of using residual connections between the intermediate features instead of initially upscaling the input image and combining it with the CNN model’s final output features.

Iv-C CTC-Sub-pixel Convolution Model (CTS)

In this case, we deploy a “sub-pixel convolution” upscaling layer to perform the upscaling from the low to the high dimensional feature space. We replace one of the transposed convolution layers (trconv2) with a single sub-pixel layer (SubPixel1) in the same architecture as that of , as shown in Fig. 4. Since sub-pixel operation does not have trainable parameters as other upscaling layers do, the computational complexity is less than that of . To achieve further upscaling, we need to increase the number of feature maps in the layer before sub-pixel convolution. With this technique, only a single sub-pixel layer is sufficient to upscale by a factor of two or four.

Iv-C1 Sub-pixel convolution

An alternative to fractionally strided convolution, interpolation and un-pooling methods for increasing the dimensionality is the sub-pixel convolution operation [22]. This layer is a non-trainable layer, since it only implements matrix manipulations to change the feature dimensions and does not have any weights to learn. Let us assume that we have the input tensor of dimensions to the sub-pixel convolution layer, where and are the height and width of the tensor, respectively, is the number of channels and

is the upscaling factor that we initially set out to achieve. Now after sub-pixel operation, this feature vector is periodically shuffled to dimensions

, thus resulting in an upscaled image. Let us first consider the following equation:




where, is the input, low resolution, binary document image, is the final upscaled image of the , and are the weights and bias of the transposed convolution layer, respectively, and is the output feature tensor of the second convolution layer in model.

Let be the input tensor to the sub-pixel convolution layer. Then the periodic shuffling function is given by,


where , and are the co-ordinates of the periodically shuffled image. For further explanation of the function, please refer to [22].

Fig. 3: Binary document image super resolution using parallel-stream convolution architecture (PSC)
Fig. 4: Binary document image super resolution using convolution- transposed convolution-sub-pixel convolution (CTS)

V Experiments

Our dataset created consists of approximately 5 million image pairs with diverse low resolution properties for training. The created image dataset has been saved in compressed format, with which the training of the different neural network models is performed.

V-a The Activation Functions Used

The biologically inspired rectified linear units (ReLU) have been an effective part of neural network architectures since the publication of

[61]. ReLU

converges faster during training than other activation functions and also avoids the vanishing gradient problem. A more generalized (data-dependent) non-linear activation is the PReLU, where the network learns the parameters of the activation function during training

[62]. The PReLU function is given by,

where is the data-dependent, learnable parameter.

V-B The Loss Function Used

We use the standard mean square error (MSE) function as the loss function to train the model.


Here, and refer to the ground truth and the final layer output, respectively, of , or models, whichever is being used, at the co-ordinates . are the height and width of the ground truth/high resolution image, is the batch size, and is the index of the training data in a particular batch. We optimize the MSE function using the default ADAM optimizer with the following parameter values: , and

V-C Implementation Details

In this subsection, we describe the implementation details of all the models for upscaling by 2 and 4 times.

V-C1 Convolution-transposed convolution architecture

The number and sizes of the filters in the CTC model (shown in Fig. 2) for an upscale factor of 2 are as follows: , , , . Table I gives the dimensions of the resulting intermediate feature maps.

Table II gives the dimensions of the feature maps for upscaling by four times. For this, we increase the number of feature maps in the second transposed convolution layer from 1 to 8 with filter sizes of . This is followed by the addition of an extra TC layer (Trconv3) with a depth of and filter size of .

V-C2 Parallel stream convolution architecture

The number and sizes of the filters in this PSC architecture (shown in Fig. 3) for upscaling by two are as follows: , , , , , . Table III gives the dimensions of the resulting intermediate feature maps.

To upscale by four times, we increase the number of filters in the trconv2 layer: and add another transposed convolution layer to it, which has , and . In the second stream also, we add an extra TC layer, with and merge the outputs of trconv4 and trconv5 to obtain the final superresolved image. Table IV gives the dimensions of the feature maps in this case.

V-C3 Convolution-transposition-sub-pixel architecture

The number and sizes of the filters in the CTS architecture (shown in Fig. 4) for upscaling by two are: , , . This framework is followed by a sub-pixel convolution layer for upscaling by a factor of . Table V gives the dimensions of the resulting intermediate feature maps.

To upscale by four times, we increase the number of filters in the TC layer 1: and add a third convolution layer: , followed by the sub-pixel convolution layer with upscaling factor . Table VI gives the dimensions of the different feature maps.

Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 16
trconv2 (output) 32 32 1
TABLE I: Dimensions of the intermediate feature maps of CTC model for an upscale factor of 2.
Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 16
trconv2 32 32 8
trconv3(output) 64 64 1
TABLE II: Dimensions of the intermediate feature maps of the CTC architecture for an upscale factor of 4.
Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 16
trconv2 32 32 1
conv3 16 16 1
trconv3 32 32 1
Output (trconv2+trconv3) 32 32 1
TABLE III: Dimensions of the intermediate feature maps of PSC architecture for an upscale factor of 2.
Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 16
trconv2 32 32 8
trconv4 64 64 1
conv3 16 16 1
trconv3 32 32 1
trconv5 64 64 1
Output (trconv4+trconv5) 64 64 1
TABLE IV: Dimensions of the intermediate feature maps of PSC model for upscaling by a factor of 4.
Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 4
sub-pixel (output) 32 32 1
TABLE V: Dimensions of the intermediate feature maps of the CTS architecture for upscaling by a factor of 2.
Layer Dimensions of features (channels last)
Input 16 16 1
conv1 12 12 48
conv2 8 8 16
trconv1 16 16 48
conv3 16 16 16
sub-pixel (output) 64 64 1
TABLE VI: Dimensions of the intermediate feature maps of the CTS model for an upscale factor of 4.

V-D Power-Law Transformation

In document images, some characters may split into multiple segments making it unsuitable for the OCR to recognize them properly. Thus, it helps if we can have a method of increasing the spread and connectivity of the pixels in each character before feeding the document image to the OCR. We utilize the power-law transformation to fulfill that need. The basic form of power-law transformation [63] is:


where and are the input and output intensities, respectively, are the co-ordinates of the gray-scale images, and and are positive constants. In our case, is the CNN model output. The exponent in the power-law equation is referred to as . Hence, this process was originally called gamma correction. In our experiments, is varied in the range of 0 to 1 in steps of 0.1, while the value of c is fixed as 1. When

= 1, the image pixel intensities are unchanged and thus the output undergoes normal binarization, the same as that of the input. When

, we observe that the split characters get merged in the output, which results in better OCR performance. If we increase ° to values higher than 1, the individual split components of characters may further split into multiple components, leading to poorer performance of OCR in recognizing the characters and words in the document images.

Vi Results and Discussion

Fig. 5: Binary document image super resolution using multi-parallel stream architecture.

We compare the OCR accuracy on the input low resolution, binary image with those on the images reconstructed by the proposed methods. We consider the OCR accuracy to be the highest priority comparison metric, since it is an objective measure of the quality of any document image. We also obtain the mean opinion score (MOS) from twenty human evaluators, ten each of non-Tamils and native Tamils.

Figure 6 gives the results of the various proposed models for a small cropped region of one of the input test images. From Fig. 6, we can qualitatively observe the major and minor differences in the character level predictions of the proposed models. The left figure in the top panel shows the input, which has been cropped from the Tamil document image and zoomed for the purpose of visualization and beside it is the corresponding zoomed ground truth image. For the sake of visual comparison, we have used bicubic interpolation as a baseline and given the images interpolated by factors of 2 and 4. The second row displays the outputs of the model and its variants. The first image C2 is the two times upsacled output of the model. The second image C4 is the result of four times upscaling. The third result CP2 is the two times upscaled output using PReLU as the activation function. Similarly, CP4 is the image obtained after four times upscaling using PReLU. The third and fourth rows show the output images of the different variants of and models, respectively.

Figure 7 shows a part of a test image, its output image and the corresponding text outputs obtained from the Google online OCR (Google drive based). Figure 7 (a) shows the poor quality of the 75-dpi binary input image, which is not even easy for native Tamils to read directly from. As clearly revealed by the output text given in Fig. 7 (b), there are too many errors arising out of the poor image segmentation during the OCR process. Roman and Chinese characters, Indo-Arabic numerals and certain other symbols are wrongly present in the recognized output. Figure 7 (c) illustrates the relatively high quality, upscaled image produced by the sub-pixel convolution architecture with PReLU activation. It is obvious that the human readability of the resultant image is high, and that a native Tamil can read the text easily, in spite of some strokes still missing. Accordingly, the text output by Google OCR (shown in Fig. 7 (d)) is also significantly better, where not even a single Roman character or numeral is present.

CTC 25.83 28.01 34.35 46.97 52.31 53.46
PSC 25.83 31.70 36.74 44.46 52.19 53.1
CTS 25.83 44.09 44.27 55.08 62.06 63.68
TABLE VII: Character level accuracies (%) obtained by the OCR on the images output by the different proposed techniques.
CTC 3 4.1 5.3 5.8
PSC 3.5 3.8 5.3 4.9
CTS 7.2 7.1 7.9 8.2
TABLE VIII: MOS obtained (on a scale of 1 to 10) from 10 human evaluators (non-Tamils) on the images output by the different proposed techniques.
CTC 4.5 5.8 4.8 5.8
PSC 6.1 6.3 5.8 5.5
CTS 8.5 9 8.3 9.6
TABLE IX: MOS obtained from 10 human evaluators (native Tamils) on the images output by the different proposed techniques.
Fig. 6: First row depicts the input low resolution, binary image, its corresponding high resolution ground truth and bicubic interpolated images with upscale factors of 2 and 4, respectively. The other rows illustrate the images output by the following models: C2, C4: convolution-transposed convolution architecture, CTC for upscale factors of 2 and 4; CP2, CP4: CTC using PReLU for upscale factors of 2 and 4; R2, R4: Parallel stream convolution architecture for upscale factors of 2 and 4; RP2, RP4: PSC using PReLU for upscale factors of 2 and 4; S2, S4: CT-subpixel convolution architecture for upscale factors of 2 and 4; SP2, SP4: CTS using PReLU for upscale factors of 2 and 4.
Fig. 7: Illustration of the significant improvement in readability and OCR performance of the Tamil binary document image, after enhancement by upscaling using the CTS architecture. (a) A small part of the 75-dpi, binary input image to our CTS model. (b) Output of Google OCR for the input image segment. (c) The corresponding segment of the output image generated by our model. (d) Text output of Google OCR for our generated output image segment.

Table VII compares the mean OCR character level accuracies of the outputs. We observe that the performance of sub-pixel stream is on the average nearly 10% better than the transposed convolution and 8% better than the resnet connection streams and hence its outputs are illustrated in Fig. 6 with black background to differentiate them from those of the other methods. CTS-PReLU scaling results in 71.4% relative improvement in OCR accuracy over the input image, whereas CTC-PReLU achieves 33%. Results of upscaling by 4 have more image details than those of upscaling by 2, facilitating the recognition of characters for the OCR software and thus achieving higher character level accuracies. A final processing step is the application of power law transformation on the network output. This results in a marginally better recognition due to the improved connectivity of the image pixels. entails a relative improvement in recognition accuracy of 107%, whereas achieves a significant relative increase of 146.5%.

Tables VIII and IX list the mean opinion scores for the quality of the output images, given by ten non-Tamils and native Tamils, respectively. Both the groups of people have subjectively rated (the outputs of) CTS to be the best of the three models. Further, barring a few exceptions, the PReLU outputs have been rated to be better than the ReLU outputs. Also, the non-Tamil evaluators, who purely decide based on the image features, have consistently rated the outputs to be better than the outputs.

The primary issue faced by us initially was the unavailability of a diverse dataset that contained corresponding binary patches of low and high resolutions to train the neural networks. Therefore, we have created our own dataset (which will be made publicly available) and built CNN architectures specific to the task in hand. Methods in the literature using convolutional neural networks have been trained and tested on different datasets (derived from natural, colour images or gray level document images) with different input and output dimensions. Thus, an attempt to compare the results of those models on our binary dataset would require modification of their existing architectures, which may fail to demonstrate the maximum potential of the originally proposed models.

While developing different architectures, we have also implemented a three stream, parallel neural network as shown in Fig. 5, in which the outputs of , and are merged to get the final output. We observe that the sub-pixel convolution layer contributes more details than the other two streams to the overall output, while training on either or upscaling data.

In the previous sections, we have mentioned about the poor performance of the OCR on sparsely connected symbols in the document images. The primary reason for this is the following: when a low quality image is passed to the OCR, since the pixels representing a symbol are not properly connected, during the segmentation stage, many symbols are segmented into multiple pieces. Each of these split components is wrongly classified by the OCR as one of the Tamil symbols, leading to the poor classification of the binary document image.

Vii Conclusion

In this paper, we have proposed effective architectures for the problem of binary document image super-resolution, using artificial neural networks [68]. We initially build a basic CNN model to perform two times upscaling of the input low resolution image. We progressively modify this architecture by incorporating additional upscaling layers, residually connecting input to the intermediate feature maps, changing the activation function from ReLU to PReLU and by changing the upscaling function from transposed convolution to sub-pixel convolution. We observe that the model employing sub-pixel convolution as the upscaling function and PReLU as the activation function outperforms the other models. Its four time upscaled output image results in a significant relative improvement in OCR accuracy of approximately 140 %. For further enhancement of the image details, we perform power law transformation on the neural network output and observe a marginal improvement in the OCR accuracy. An additional benefit of the enhanced quality of the image is the improved readability of the document content, thus making it easier for people to read the super-resolved, low quality document image.

Viii Future Work

We expect to continue this work with a much larger and diverse dataset encompassing various languages to test its scalability to other languages. We will also be working on building more efficient CNN models.


  • [1] Tsai, R. Y, “Multiframe image restoration and registration,” Adv. Comput. Vis. Image Process. 1.2 (1984): 317-339.
  • [2] Yang, Jianchao, and Thomas Huang, “Image super-resolution: Historical overview and future challenges,” Super-resolution imaging (2010): 20-34.
  • [3] Project Madurai for ancient Tamil literary works. Last accessed Sep. 30, 2017.
  • [4] Farsiu, Sina, et al, “Fast and robust multiframe super resolution,” IEEE Trans. image Process. 13.10 (2004): 1327-1344.
  • [5] Irani, Michal, and Shmuel Peleg, “Improving resolution by image registration,” CVGIP: Graphical models and image Process. 53.3 (1991): 231-239.
  • [6] Park, Sung Cheol, Min Kyu Park, and Moon Gi Kang, “Super-resolution image reconstruction: a technical overview,” IEEE signal Process. magazine 20.3 (2003): 21-36.
  • [7] Freeman, William T., Thouis R. Jones, and Egon C. Pasztor, “Example-based super-resolution,” IEEE Comput. graphics and Applicat. 22.2 (2002): 56-65.
  • [8]

    Glasner, Daniel, Shai Bagon, and Michal Irani, “Super-resolution from a single image,” Computer Vision, 12th Int. Conf. IEEE, 2009.

  • [9] Yang, Chih-Yuan, Chao Ma, and Ming-Hsuan Yang, “Single-Image Super-Resolution: A Benchmark,” ECCV (4). 2014.
  • [10] Kopf, Johannes, and Dani Lischinski, “Depixelizing pixel art,” ACM Trans. graphics (TOG). Vol. 30. No. 4. ACM, 2011.
  • [11] Tappen, Marshall, and Ce Liu, “A Bayesian approach to alignment-based image hallucination,” Comput. Vis.–ECCV 2012 (2012): 236-249.
  • [12] Yang, Chih-Yuan, Sifei Liu, and Ming-Hsuan Yang, “Structured face hallucination,” Proc. IEEE Conf. Comput. Vis. and Pattern Recog. 2013.
  • [13]

    Sun, Jian, Zongben Xu, and Heung-Yeung Shum, “Image super-resolution using gradient profile prior,” Prof. IEEE Conf. Computer Vision and Pattern Recognition, 2008.

  • [14] Freedman, Gilad, and Raanan Fattal, “Image and video upscaling from local self-examples,” ACM Trans. Graphics (TOG) 30.2 (2011): 12.
  • [15] Yang, Chih-Yuan, and Ming-Hsuan Yang, “Fast direct super-resolution by simple functions,” Proc. IEEE Int. Conf. Comput. Vis. 2013.
  • [16] Kim, Jiwon, Jung Kwon Lee, and Kyoung Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” Proc. IEEE Conf. Comput. Vis. and Pattern Recognition. 2016.
  • [17] Dong, Chao, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” In European conf. on comput. vis., pp. 184-199. Springer, Cham, 2014.
  • [18] Dong, Chao, et al,“Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Analysis Machine Intelligence 38.2 (2016): 295-307.
  • [19] Kim, Jiwon, Jung Kwon Lee, and Kyoung Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” Proc. IEEE Conf. Comput. Vis. and Pattern Recognition. 2016.
  • [20] Dong, Chao, Chen Change Loy, and Xiaoou Tang, “Accelerating the super-resolution convolutional neural network,” Proc. ECCV, Netherlands, Oct. 11-14, 2016.
  • [21] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition.” Proc. of the IEEE conf. on comput. vis. and pattern recog. 2016.
  • [22] Shi, Wenzhe, et al, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” Proc. IEEE Conf. Comput. Vis. Pattern Recog016.
  • [23] Peleg, Tomer, and Michael Elad, “A statistical prediction model based on sparse representations for single image super-resolution,” IEEE Trans. image Process. 23.6 (2014): 2569-2582.
  • [24] Efrat, Netalee, et al, “Accurate blur models vs. image priors in single image super-resolution,” Proc. IEEE ICCV. 2013.
  • [25] Fernandez-Granda, Carlos, and Emmanuel J. Candes, “Super-resolution via transform-invariant group-sparse regularization,” Proc. IEEE Int. Conf. Comput. Vis. 2013.
  • [26] Wang, Qiang, Xiaoou Tang, and Harry Shum, “Patch based blind image super resolution,” Comput. Vis. ICCV 2005. 10th IEEE Int. Conf. Vol. 1. IEEE, 2005.
  • [27] Mac Aodha, Oisin, et al, “Patch based synthesis for single depth image super-resolution,” Comput. Vision. ECCV 2012 (2012): 71-84.
  • [28] Bevilacqua, Marco, et al, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012: 135-1.
  • [29] Walha, Rim, et al, “Resolution enhancement of textual images via multiple coupled dictionaries and adaptive sparse representation selection,” Int. J. Doc. Anal. Recog. 18.1 (2015): 87-107.
  • [30] Zeyde, Roman, Michael Elad, and Matan Protter, “On single image scale-up using sparse-representations,” Int. conf. on curves and surfaces. Springer, Berlin, Heidelberg, 2010.
  • [31] Aharon, Michal, Michael Elad, and Alfred Bruckstein, “K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation,” IEEE Trans. signal Process. 54.11 (2006): 4311-4322.
  • [32] Walha, Rim, et al, “Super-resolution of single text image by sparse representation.” Proc. workshop Doc. Anal. Recog. ACM, 2012.
  • [33] Banerjee, Jyotirmoy and C. V. Jawahar, “Super-Resolution of Text Images Using Edge-Directed Tangent Field,” Document Analysis Systems (2008).
  • [34] Donaldson, Katherine, and Gregory K. Myers, “Bayesian super-resolution of text in video with a text-specific bimodal prior,” Proc. IEEE Comput. Vision Pat. Recog. 2005.
  • [35] Drucker, Harris, Robert Schapire, and Patrice Simard, “Improving performance in neural networks using a boosting algorithm,” Advances in neural inform. process. systems. 1993.
  • [36] Dong, Chao, et al, “Boosting optical character recognition: A super-resolution approach.” arXiv preprint arXiv:1506.02211 (2015).
  • [37] Zomet, Assaf, Alex Rav-Acha, and Shmuel Peleg, “Robust super-resolution,” Comput. Vis. and Pattern Recog., 2001. Proc. of the 2001 IEEE Comput. Society Conf. on. Vol. 1. IEEE, 2001.
  • [38] Roweis, Sam T., and Lawrence K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science 290.5500 (2000): 2323-2326.
  • [39] Yang, Jianchao, et al, “Image super-resolution as sparse representation of raw image patches,” Comput. Vis. Pat. Recog. 2008.
  • [40] Yang, Jianchao, John Wright, Thomas S. Huang, and Yi Ma, “Image super-resolution via sparse representation,” IEEE Trans. image Process. 19(11), 2010: 2861-2873.
  • [41] Yang, Jianchao, Zhaowen Wang, Zhe Lin, Scott Cohen, and Thomas Huang, “Coupled dictionary training for image super-resolution,” IEEE Trans. Img. Process. 21(8) 2012: 3467-3478.
  • [42] Zeyde, Roman, Michael Elad, and Matan Protter, “On single image scale-up using sparse-representations.” Int. conf. on curves and surfaces. Springer, Berlin, Heidelberg, 2010.
  • [43] Timofte, Radu, Vincent De Smet, and Luc Van Gool, “Anchored neighborhood regression for fast example-based super-resolution,” Proc. IEEE Int. Conf. Comput. Vision. 2013.
  • [44] Timofte, Radu, Vincent De Smet, and Luc Van Gool. “A+: Adjusted anchored neighborhood regression for fast super-resolution,” Asian Conf. Comput. Vis. Springer, Cham, 2014.
  • [45] Ledig, Christian, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken et al, “Photo-realistic single image super-resolution using a generative adversarial network,” arXiv preprint arXiv:1609.04802 (2016).
  • [46] Ram Krishna Pandey and A. G. Ramakrishnan, “Language independent single document image super-resolution using CNN for improved recognition,” arXiv preprint arXiv:1701.08835 (2017).
  • [47] Ram Krishna Pandey, and A. G. Ramakrishnan, “Efficient document-image super-resolution using convolutional neural network,” Sadhana, 2018.
  • [48] Peyrard, Clément et al, “ICDAR2015 competition on Text Image Super-Resolution.” ICDAR (2015).
  • [49] Karatzas, Dimosthenis, et al, “ICDAR 2015 competition on robust reading,” Document Analysis and Recog. (ICDAR), 2015 13th Int. Conf. on. IEEE, 2015.
  • [50]

    LeCun, Yann, et al, “Backpropagation applied to handwritten zip code recognition,” Neural computation 1.4 (1989): 541-551.

  • [51]

    Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural informat. process. systems. 2012.

  • [52] Ouyang, Wanli, et al, “Deepid-net: Deformable deep convolutional neural networks for object detection.” Proc. IEEE Conf. Comput. Vis. and Pattern Recognition. 2015.
  • [53]

    Zhou, Bolei, et al, “Learning deep features for scene recognition using places database,” Advances in neural inform. Process. systems. 2014.

  • [54] Szegedy, Christian, et al, “Going deeper with convolutions.” Proc. IEEE conf. on comput. vis. and pattern recognition. 2015.
  • [55] Zeiler, Matthew D., et al,“Deconvolutional networks.” Comput. Vis. and Pattern Recognition (CVPR). IEEE, 2010.
  • [56] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. ”Fully convolutional networks for semantic segmentation.” Proc. IEEE Conf. Comput. Vis. and Pattern Recognition. 2015.
  • [57] Bengio, Yoshua, Aaron Courville, and Pascal Vincent. ”Representation learning: A review and new perspectives.” IEEE Trans. pattern analysis and machine intelligence 35.8 (2013): 1798-1828.
  • [58] Zeiler, Matthew D., Graham W. Taylor, and Rob Fergus. ”Adaptive deconvolutional networks for mid and high level feature learning.” Comput. Vis. (ICCV), IEEE Int. Conf., 2011.
  • [59] Odena, et al., ”Deconvolution and Checkerboard Artifacts”, Distill, 2016.
  • [60] Dumoulin, Vincent, and Francesco Visin, “A guide to convolution arithmetic for deep learning,” arXiv:1603.07285 (2016).
  • [61]

    Glorot, Xavier, Antoine Bordes, and Yoshua Bengio, “Deep sparse rectifier neural networks,” Proc. 14th Int. Conf. Artificial Intelligence and Stat. 2011.

  • [62] He, Kaiming, et al, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” Proc. of the IEEE Int. Conf. on comput. vis. 2015.
  • [63] Kumar, Deepak, and A. G. Ramakrishnan, “Power-law transformation for enhanced recognition of born-digital word images.” Signal Process. and Commun. (SPCOM), Int. Conf. IEEE, 2012.
  • [64] Ali Abedi and Ehsanollah Kabir, “Text-image super-resolution through anchored neighborhood regression with multiple class-specific dictionaries,” Signal, Image and Video Processing, 2017, 11(2), pp. 275-282.
  • [65] Ali Abedi and Ehsanollah Kabir, ”Text image super resolution using within-scale repetition of characters and strokes,” Multimedia Tools and Applications, Volume 76 Issue 15, August 2017, Pages 16415-16438.
  • [66] Pandey, Ram Krishna, and A. G. Ramakrishnan, “A hybrid approach of interpolations and CNN to obtain super-resolution,” arXiv preprint arXiv:1805.09400 (2018).
  • [67] Simonyan, Karen, and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 (2014).
  • [68] Ram Krishna Pandey and A G Ramakrishnan, ”Method and system for enhancing binary document image quality for improving readability and OCR performance,” Provisional Patent Application No. 201841030740 (TEMP/E-1/33470/2018-CHE), Indian Patent Office.