UDBNET: Unsupervised Document Binarization Network via Adversarial Game

07/14/2020 ∙ by Amandeep Kumar, et al. ∙ 10

Degraded document image binarization is one of the most challenging tasks in the domain of document image analysis. In this paper, we present a novel approach towards document image binarization by introducing three-player min-max adversarial game. We train the network in an unsupervised setup by assuming that we do not have any paired-training data. In our approach, an Adversarial Texture Augmentation Network (ATANet) first superimposes the texture of a degraded reference image over a clean image. Later, the clean image along with its generated degraded version constitute the pseudo paired-data which is used to train the Unsupervised Document Binarization Network (UDBNet). Following this approach, we have enlarged the document binarization datasets as it generates multiple images having same content feature but different textual feature. These generated noisy images are then fed into the UDBNet to get back the clean version. The joint discriminator which is the third-player of our three-player min-max adversarial game tries to couple both the ATANet and UDBNet. The three-player min-max adversarial game stops, when the distributions modelled by the ATANet and the UDBNet align to the same joint distribution over time. Thus, the joint discriminator enforces the UDBNet to perform better on real degraded image. The experimental results indicate the superior performance of the proposed model over existing state-of-the-art algorithm on widely used DIBCO datasets. The source code of the proposed system is publicly available at https://github.com/VIROBO-15/UDBNET.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Document image binarization is a rudimentary problem in the field of Document analysis. Binarization itself, is the prepossessing backbone of many document image processing systems (DIPSs)[9, 21]. The performance of the high level processing tasks, such as image segmentation [29], word recognition [5, 4], optical character recognition (OCR) [2], and document layout analysis (DLA) [43]

is greatly dependent on the success of the binarization task. Technically, document image binarization is the technique of converting color document images or gray-level images into a binary representation, where the main objective is to classify each pixel as foreground(text/ink) or background(parchment/paper). In other words, it is the process of discarding the unnecessary noisy information while preserving the meaningful visual information.

Document image binarization can be considered as an easy task for images of uniform distribution. However, in real-world scenarios under significant image noise and uneven background, binarization is a quite challenging problem. Moreover, the document images suffer from various degradation due to faint characters, bleed-through background, clutter and artifacts, dark patches, creases, faded ink, non-uniform variation of intensity, inadequate maintenance, aging effect, ink stains, lighting conditions, warping effect during acquisition etc. Faded ink creates difficulty during distinguishing light text from background. Bleed through occurs when content from the back of a page becomes visible or ‘leaks’ through. It creates difficulty in labeling foreground and background during binarization process as it can misinterpret background as foreground. Uneven illumination happens when the image is suffering from shadow effect or inconsistent lighting during acquisition. In addition to the above, dark patches are quite difficult to remove for various reasons. Firstly, these patches are of varying sizes and intensities. Secondly, they appear as stains of arbitrary shapes. Thirdly, they are often present in areas containing characters. Therefore, the study on binarization for document images, specially in the context of degraded images, is highly essential.

In general, binarization methods [47] [48] [18] works for supervised setup. In the supervised setup, we need ground-truth binarized image along with the degraded image. But, it is difficult to get the corresponding ground truth binary image in many scenarios like in case of historical document image. To address these drawbacks, Bhunia et al. [6] first attempts to introduce unsupervised setup in the domain of document image binarization. For this purpose, they employ Texture Augmentation Network (TANet) that superimposes the noisy appearance of the degraded document on the clean binary image to generate multiple degraded image of same textual content with various noisy textures and later utilize Binarization Network (BiNet) to get back the clean version of the document image. Although this method has shown better results over the previous state-of-the-art methods, it has several limitations. Firstly, the TANet is completely unaware about the content at which it is conditioned on. Thus, the corresponding discriminator can not verify if the content of the generated noisy image remain consistent or not. Secondly, there exist no performance quantifier that validates the performance of the BiNet on real degraded noisy image. Finally, the Binarization Network (BiNet) has dataset bias towards generated noisy images. But, to adddress the dataset bias, BiNet does not use any kind of formulation or other techniques. In our observation, these limitations are due to the fact that the TANet and BiNet both employ straight-forward two-player Generative Adversarial Network (GAN) [16] objectives and model two different uncorrelated conditional distributions. In this paper, we address these limitations by introducing adversarial min-max game in the domain of unsupervised document image binarization. Similar to the TANet and BiNet, we propose Adversarial Texture Augmentation Network (ATANet) and Unsupervised Documenet Binarization Network (UDBNet) which utilize three-player GAN objectives. The proposed third player is a joint discriminator tries to couple both the Adversarial Texture Augmentation Network (ATANet) and Unsupervised Document Binarization Network (UDBNet). Our three-player min-max adversarial game comes to an end, when the distribution modelled by the Adversarial Texture Augmentation Network (ATANet) and the Unsupervised Document Binarization Network (UDBNet) align to the same joint distribution over time. Therefore, the contributions of this paper are as follows:

  • To be our best of knowledge, we are the first one to introduce adversarial game in the domain of document image binarization by proposing Adversarial Texture Augmentation Network (ATANet) and Unsupervised Document Binarization Network (UDBNet).

  • We introduce a joint discriminator which tries to couple the ATANet and UDBNet so that it can tackle the dataset bias problem and perform well on the real degraded document image.

  • Our approach shows a superior performance on widely used DIBCO datasets as compared to the existing state-of-the-arts methods.

The remaining of the paper is organized as follows: In section II, we discuss about the related works in the field of document image binarization. In section III, we describe the proposed framework. The datasets, implementation, baselines methods and performance analysis are discussed in section IV. Section V concludes the paper.

Ii Related Works

Document image binarization is a classical research problem in computer-aided document analysis and has been studied extensively over the past few decades. Document image binarization aims at converting the document image into either foreground text or background. The most simple and widely used approach is thresholding, which sets the pixels under a threshold value to 0 and the rest to 1. Thresholding methods are primarily of three types : global, local and hybrid. In high quality images, global algorithms can effectively estimate a threshold based on the entire image. Global thresholds can be calculated using gray level histogram

[29], circular statistics [26], error minimization [25], histogram entropy[23]

and moment preserving principle

[44]. Clustering models[30] can also learn mappings in an unsupervised manner based on global features to separate background and foreground. However performance degrades when they are applied to images having variations in background due to illumination, occlusion or degradation. For such cases, local adaptive methods perform better. Some of the common local thresholding approaches can be seen in the works of Bernsen et al. [3], Niblack et al. [27], Sauvota et al. [39]. In the work by Niblack et al. [27], a major drawback is that if the foreground text is sparse, a lot of background noise will remain in the binary image. Sauvota et al. [39] alleviates this by assuming the foreground pixels to be closer to background ones. On a similar note, Wolfe et al. [49] normalizes the contrast and the mean gray level of the neighbourhood to modify the threshold.

Apart from threshold based techniques, non-threshold based strategies have also been studied extensively in literature. Some notable approaches include Markov Random Field (MRF) modeling of an input image, which minimizes a cost function by regarding the target binarized image as a binary MRF. Howe et al. [20] proposed an algorithm where they defined the cost function based on combination of the Laplacian energy of image intensity for computing local likelihood of foreground-background pixels and Canny edge detection for detecting discontinuities. The cost function is minimized by graph cut computation. Howe’s method[20] is efficient and yields good results, however it is parameter dependent. Howe et al. [19] improvised on this method by adaptive tuning of two parameters to yield better performance. Howe’s technique formed the basis of the first winning algorithm proposed by Kliger and Tal in the DIBCO 2016 competition[35]

. They combined Howe’s algorithm with a novel pre-processing step based on linear transformation of the image onto a spherical surface where concavities correspond to foreground in the original image.The concavities are estimated using the Hidden Point Removal Operator

[24]

which outputs a probability of a pixel belonging to a concavity.

All these proposed techniques perform well in the context they are applied to. But these methods fail to generalize in the context of binarizing any kind of document subjected to a varied degree of illumination, background noise and degradation.Recently pixel-wise binarization approaches have been proposed in literature where each pixel is classified as text or background. Pastor-Pellicer[31] proposed a CNN framework consisting of two groups of convolution layers and a fully connected layer. Each pixel is classified into text or background by using a sliding window centred at the classified pixel. Such an approach has also been used in binarizing musical documents by Calvo-Zaragoza et al. [8] . These pixel wise classification techniques have shown good performance, however their most conspicuous drawbacks include being computationally very expensive since they involve labelling each pixel in the document image and classifying each pixel independently without exploiting contextual information in any pixel’s neighbourhood. To incorporate this contextual information, Afzal et al. [1] propose a pixel wise classification method where they formulate the binarization procedure as a sequence learning problem. They use a 2D LSTM model which takes in a 2D sequence of pixels as input and classifies each pixel as foreground or background. This achieved better results but still suffered from huge computational complexity. To alleviate this, Tensmeyer et al. [42] proposed a novel multi-scale fully convolutional network for document image binarization. Recently Calvo-Zaragoz et al. [7] proposed a fully convolutional-selectional auto encoder model that has been trained to learn a patch-wise mapping of the document image to its corresponding binarized version. This performs a fine-grained categorization in which each pixel gets a different activation value depending on whether the target label of the pixel is text or background. Other approaches involving convolution networks include the winning algorithm of DIBCO 2017 competition[36], where the winning team used a U-Net encoder decoder architecture for accurate pixel classification. Vo et al. [47] introduced a hierarchical deep supervised network for document binarization which achieves state of the art performance on several benchmark datatsets. Westphal et al. [48] proposed a Grid LSTM network for binarization, yet it achieves lesser performance than Vo’s method[47]. To learn the document degradation, He et al. [18] proposed an iterative fine tuning technique to learn the mappings from a degraded input document image to the expected clean and uniform images followed by a classifier to output the binarized image.

In case of unsupervised image-to-image translation task, one of the first major works that uses deep network is CycleGAN

[51]

. Following this work, there have been numerous attempts to design unsupervised or semi-supervised framework for different computer vision tasks like depth estimation

[50]

, image captioning

[17, 10] etc. Most of these works use popular cycle-consistency loss to learn an unsupervised mapping between two different domains. In contrast to all such works, Bhunia et al. [6] employ Texture Augmentation Network (TANet) that superimposes the noisy appearance of the degraded document on the clean binary image to generate multiple degraded image of same textual content with various noisy textures and later utilize Binarization Network (BiNet) to get back the clean version of the document image.

Fig. 1: Illustration of Our proposed Framework. The yellow highlighted region highlights our contribution over Bhunia et al. [6]. We have two networks: ATANet which takes Clean image and Degraded image as inputs and generates noisy degraded image . On the other hand, UDBNet tries to get back the clean image from the generated noisy . Thus, acts as pseudo paired data to train the binarization network. Also, We feed degraded image as an input to UDBNet and get the corresponding binarized image. Then, we concat the image pairs and and feed into joint discriminator to couple both ATANet and UDBNet. This enforces the UDBNet to generalize better for real degraded images

Iii Proposed Framework

In this section. we first briefly present the binarization model proposed by Bhunia et al. [6] as base model. Next, we describe the limitations of the base model. Finally, we introduce our novel unsupervised adversarial game and describe how Adversarial Texture Augmentation Network (ATANet) and Unsupervised Document Binarization Network (UDBNet) address these limitations efficiently.

Iii-a Background: Base Models

The base model consists of two networks: Texture Augmentation Network(TANet) and Binarization Network(BiNet). Let denotes the binarized clean image sampled from marginal distribution and denotes a degraded document image sampled from marginal distribution . TANet tries to model , i.e., given a clean image, it tries to generate a degraded version of it keeping the content same. On the other hand, BiNet tries to model , i.e., given a degraded image, it tries to generate the clean image. During inference, only BiNet is used.

Texture Augmentation Network : The TANet exploit a two-player GAN which consists of a generator network and a discriminator network. Conditional distribution is approximately modelled by , where the noisy texture of degraded image is superimposed on the clean image to generate noisy version of the clean image as . Note that, the content of and remains similar and we do not use any paired-data here. On the other side, the discriminator tries to discriminate between output image and degraded reference image . The generator of the TANet use a content encoder and a style encoder to encode the semantic content of the image and noisy texture of the image explicitly. Next, the two encoded features are concatenated to obtain a mixed feature representation. Finally, this mixed representation is passed through a decoder network that outputs noisy generated image . To ensure that the generated image contains the same textual content as clean image and the same texture element of the degraded image

, The TANet utilizes the following loss functions:

Adversarial loss: The objective of the adversarial loss is to constrain the output to make it similar to the degraded reference image . The adversarial loss is defined as:

(1)

Where, the discriminator tries to discriminate between the output image from the degraded reference image .

Style loss: While adversarial loss focuses on getting the overall structure of the generated image, an additional style loss ensures successful transfer of texture content from degraded reference image to the input binarized clean image . For this purpose, Gram matrices [15, 14] is used in “conv1_1”, “conv2_1”,“conv3_1”,“conv4_1”,“conv5_1” layers of the encoder networks. Mathematically,

(2)

Where, is the activation of filter at position k in layer l, gram matrix is the inner product between vectorised feature maps and in layer and is the number of feature maps.

Content loss: To ensure the generated image contains the same textual content as the clean binarized image , a content loss is defined as follows:

(3)

Here, denotes a binary mask that has value 0 in the background and 1 in the text region.

The overall objective of the TANet is defined as follows:

(4)

Where, and are the tunable hyper-parameters to balance multiple objectives.

Binarization Network : Similar to TANet, BiNet exploits two-player GAN and employs an image-to-image translation framework consisting of a generator and a discriminator. While the generator of the BiNet tries to model , where is the binarized clean image of the newly generated noisy image, the discriminator determines how good the generator is in generating binarized images. The adversarial loss of the BiNet is:

(5)

In times of training, for each input image , there is corresponding ground truth image . Thus, an additional loss is utilized to fully supervised the predicted binarization results along with the adversarial loss:

(6)

While the pixel loss helps to preserve the content, the adversarial loss guides to obtain sharper output image by de-noising input noisy image .

The overall objective of the BiNet is as follows:

(7)

Where, is a tunable hyper-parameter.

Iii-B Limitations of Base Model

The limitations of the base model are given below:

Limitation 1. Although the texture augmentation network (TANet) tries to model in the base model and generates noisy images, but it is completely unaware about the content at which it is conditioned on. Thus, the corresponding discriminator can not verify if the content of the generated noisy image remains consistent or not.

Limitation 2. For unpaired real degraded noisy image, the corresponding ground truth or binarized clean image is absent. The absence of ground truth image limits the scope of binarization network (BiNet) of the base model. Firstly, the BiNet can not be trained with real degraded noisy image as loss can not be utilized. Secondly, since the TANet and the BiNet model two different uncorrelated conditional distribution and are trained separately, these models are prone to overfitting. Finally, there exists no performance quantifier that validates the performance of the BiNet on real degraded noisy image.

Limitation 3. There exist a gap between generated noisy image distribution and real degraded noisy image distribution. As the Binarization network (BiNet) is completely trained on generated noisy image, the Binet has dataset bias towards generated noisy images. A model trained in the generated data can hardly perform well on the real data. This problem is quite similar to domain-shift [38] problem. But to minimize the generated-real domain shift in the context of document image binarization, the base model does not use any kind of formulation or other techniques.

Iii-C Adversarial Game

In our unsupervised setup, we do not have any paired training data. Thus, we do not have any access to real joint distribution of clean and degraded image, . However, we can approximate this real distribution by texture augmentation network and binarization network. The joint distribution can be factorized in two ways, namely

(8)

Please note that, is modelled by texture augmentation network and is modelled by binarization network. is obtained when the Generated noisy image from Adversarial Texture Augmentation network is concatenated with the corresponding input clean image . On the other side, when we pass a real degraded image through binarization network to get a corresponding clean image , we constitute by concatenating them together. Thus, texture augmentation network plays role in modeling and binarization network plays role in modeling . and are both approximated joint-distribution of real and degraded images. If we can properly align these two approximated joint distribution and together, it will closely get aligned with the real joint-distribution. In order to align and , we propose joint discriminator that distinguishes whether a input sample is from the distribution or the . The Adversarial Texture Augmentation Network(ATANet) and Unsupervised Document Binarization(UDBNet) network objective is to fool the discriminator such that it cannot distinguishes whether the input sample is from or . Thus, the distributions of and gets aligned overtime.

Adversarial Texture Augmentation Network: The ATANet Consists of three components: 1) a generator that characterizes the conditional distribution and generates noisy image ; 2) a discriminator that discriminates the output image from the degraded reference image ; 3) a joint discriminator that distinguishes whether a pair of data comes from or .

Similar to our base model, the generator consists of content encoder and the style encoder in which we pass the clean image and the degraded reference image as the input, respectively. The latent representations after the encoding the images are simply concatenated and feed into the decoder. The architecture of decoder is symmetrical to the encoder, having the skip connection between the layers of content encoder and decoder as similar to our base model. The discriminator tries to discriminate between the generated image from the generator and the degraded reference image . The pseudo generated-clean image pair are fed into the joint discriminator such that our joint discriminator tries to distinguish whether the input sample is from distribution or .

In this game, let a clean-degraded image pair is sampled from distributions and , generator produces a pseudo generated noisy image given following the conditional distribution . Hence, the pseudo clean-generated image pair is a sample from the joint distribution .

To get the real world noisy, degraded document image having the textual appearance similar to degraded reference image and textual content similar to the clean image . we define adversarial loss of our ATANet as:

(9)

Where, is Unsupervised Document Binarization Network. The adversarial loss is trained with flip flop fashion. The game will end when distribution and distribution will be in equilibrium and gets aligned over time.

Therefore, the overall objective of the ATANet is defined as:

(10)

Where, and are style loss and content loss similar to our base mode, and are the tunable hyper-parameters to balance multiple objectives.

Unsupervised Document Binarization Network: Similiar to ATANet, UDBNet consists of three components: 1) a generator that characterizes the conditional distribution and generates binarized clean image and corresponding to and respectively; 2) a discriminator determines how good the generator is in generating binarized images ; 3) a joint discriminator that distinguishes whether a pair of data comes from distribution or .

We have used similar network architecture for the generator and the discriminator as in the base model. The generated noisy image from the ATANet is fed into the generator of the UDBNet and generates the binarized image . The discriminator tries to discriminate between the binarized image and the original clean image . The pseudo clean-degraded image pair are fed into the joint discriminator such that our joint discriminator tries to distinguish whether the input sample is from distribution or .

In this game, let a clean-degraded image pair is sampled from distributions and , generator produces a pseudo binarized clean image given following the conditional distribution . Hence, the pseudo degraded-clean image pair is a sample from the joint distribution .

To attain the proper binarized image. We define adversarial loss of our UDBNet as:

(11)

Where, is an adversarial texture augmentation network. The adversarial loss is trained with flip flop fashion. Therefore, the overall objective of the UDBNet is defined as:

(12)

Where, is loss, is a tunable hyper-parameter.

Iii-D Training Joint Discriminator via Flipped Label

Similar to the Texture Augmentation network (TANet) of the base model, we feed our clean image and degraded image into the content encoder and style encoder, respectively. The encoded representations are simply concatenated and passed through the decoder which outputs the noisy version of clean image . The degraded reference image is then fed into the generator of Unsupervised Document Binarization Network (UDBNet), which generates binarized version of the image . To feed into the joint discriminator , we perform simple concatenation of the input pairs and . Both the Adversarial Texture Augmentation Network (ATANet) and Unsupervised Document Binarization Network (UDBNet) tries to fool the joint discriminator such that it cannot discriminate whether the input sample is from distribution or the distribution . The joint discriminator that distinguishes whether a pair of data and come from distribution or . This enforces the UDBNet to generalize better for real degraded images although we are not using any paired-training data. The joint discriminator is trained using the flipped labels as utilized in [45].

Iv Experiment

Iv-a Datasets

The experiments are conducted on the publicly available DIBCO datasets [33]. We train our model on DIBCO 2009 [13], DIBCO 2013 [34], H-DIBCO 2012 [37] and H-DIBCO 2014[28] datasets. On the other hand, Challenging historical dataset like H-DIBCO 2016 [35] and DIBCO 2011 [40] are selected for evaluation purposes. we resizes the images from these datasets to patches of size

before feeding to our model. To evaluate the performance of our methods, we adopt four evaluation metrics. They are F-measure, pseudo F-measure (

), distance reciprocal distortion metric (DRD), and the peak signal-to-noise ratio (PSNR). Similar to

[6], we augment the training patches by rotating with an angle of 90, 180 and 270 degrees.

Iv-B Implementation Details

We have implemented the entire model in Pytorch

[32]

and the experiments were done on a server having Nvidia Titan X GPU with 12 GB of memory. We have adapted step-wise training protocol for training our model. At first, we train ATANet for 15 epochs and generates the noisy version of the clean images. After that, we freeze the network such that the weights of the network does not alter. Next, we train UDBNet on generated noisy images for 20 epochs to generate its corresponding binary clean image. Then, We unfreeze the ATANet network. In next stage, we jointly train both the networks along with the joint discriminator for around

epochs in the flip flop fashion. At last, we fine-tune the model for 30 epochs. During the couple training, ATANet tries to generate more challenging adversarial samples that are used as a pseudo image pair for training the UDBNet. Thus, the training procedure helps the model to learn various degradation including aging effects, noises etc. During training, we use Adam optimizer with the learning rate of . We take = 0.5, =10 and = 100 throughout the experiment.

Methods F-Measure PSNR DRD
UDBNet-CL 92.7 95.8 19.9 2.6
UDBNet-GRL 93.2 96.0 20.1 2.4
Ours 93.4 96.2 20.1 2.2
TABLE I: Comparsion of Our method with Baseline Methods

Iv-C Baselines Methods

In this section, We present two alternative baselines to justify the effectiveness of our methods :

UDBNet-CL : The joint discriminator exploits domain confusion loss [12] to address the limitation described in the section III-B. The domain confusion loss gives equal importance to ATANet and UDBNet.

UDBNet-GRL : The Gradient Reversal layer [11] ensures that the adversarial discriminator views the two domains identically. Here, the joint discriminator utilize the Gradient Reversal layer.

Fig. 2: Comparison of the qualitative results of predicted binarized images by Bhunia et al. [6] and our framework on the evaluation set

Iv-D Performance Analysis

From Table I, we observe that UDBNet-CL has achieved an improvement of and in F-Measure from DeepOtsu [18] and Bhunia [6] on H-DIBCO 2016 [35] dataset. On the other hand, UDBNet-GRL shows better performance than UDBNet-CL, an improvement of and in F-Measure from DeepOtsu [18] and Bhunia [6] H-DIBCO 2016 [35]. Our method outperformed all the previous state-of-the-art methods because DeepOtsu [18] just uses stack refinement blocks and Bhunia [6] simply generates synthetic uncontrolled noisy image samples for training to improve the performance. In contrast, Our ATANet generates realistic degraded images including hard samples and also guides UDBNet to adopt to real noise distribution as depicted in Figures 2 and 3. However, out of three proposed approaches including baselines, the flipped label approach (ours) is found to be the best in all the four evaluation criteria because of its learning strategy. From Table II, it is obvious that our method has achieved significant improvement of , , from DeepOtsu [18] and , , from Bhunia [6] in F-Measure, and PSNR criteria on H-DIBCO 2016 [35] dataset. Also, our method has shown improved performance of , and in F-Measure, , PSNR than Vo [47] on DIBCO 2011 dataset. In both the cases, the low DRD value of our method implies the robustness regarding visual distortion.

Fig. 3: Binarization results on real test images by passing through UDBNet.
Methods H-DIBCO 2016 Dataset DIBCO 2011 Dataset
F-Measure PSNR DRD F-Measure PSNR DRD
Otsu [29] 86.6 89.9 17.8 5.6 82.1 84.8 15.7 9.0
Sauvola [39] 84.6 88.4 17.1 6.3 82.1 87.7 15.6 8.5
Howe[19] 87.5 92.3 18.1 5.4 91.7 92.0 19.3 3.4
Su [41] 84.8 88.9 17.6 5.6 87.8 90.0 17.6 4.8
Jia [22] 90.5 93.3 19.3 3.9 91.9 95.1 19.0 2.6
Vo [46] 87.3 90.5 17.5 4.4 88.2 90.3 20.1 2.9
Vo [47] 90.1 93.6 19.0 3.5 93.3 96.4 20.1 2.0
Westphal [48] 88.8 92.5 18.4 3.9 - - - -
DeepOtsu [18] 91.4 94.3 19.6 2.9 93.4 95.8 19.9 1.9
Bhunia [6] 92.3 95.4 19.9 2.7 93.7 96.8 20.1 1.8
Ours 93.4 96.2 20.1 2.2 95.2 97.9 20.4 1.5
TABLE II: Quantative results on H-DIBCO 2016 and DIBCO 2011 dataset

V Conclusion

In this paper, we have proposed a novel approach towards document binarization by introducing three-player min-max adversarial game. We introduce a joint discriminator which tries to couple the Adversarial Texture Augmentation Network (ATANet) and Unsupervised Document Binarization Network (UDBNet) so that it can tackle the dataset bias problem and perform well on the real degraded document image. The proposed framework is simple and easy to implement. We demonstrate the effectiveness of our system by conducting experiments on publicly available DIBCO datasets. The results of the experiment show the superiority of our proposed model over the existing methods.

References

  • [1] M. Z. Afzal, J. Pastor-Pellicer, F. Shafait, T. M. Breuel, A. Dengel, and M. Liwicki (2015) Document image binarization using lstm: a sequence learning approach. In HIP, Cited by: §II.
  • [2] S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu (2009) A hierarchical approach to recognition of handwritten bangla characters. Pattern Recognition. Cited by: §I.
  • [3] J. Bernsen (1986) Dynamic thresholding of gray-level images. In ICPR, Cited by: §II.
  • [4] S. Bhowmik, S. Malakar, R. Sarkar, and M. Nasipuri (2014) Handwritten bangla word recognition using elliptical features. In CICN, Cited by: §I.
  • [5] S. Bhowmik, S. Polley, M. G. Roushan, S. Malakar, R. Sarkar, and M. Nasipuri (2015) A holistic word recognition technique for handwritten bangla words. International Journal of Applied Pattern Recognition. Cited by: §I.
  • [6] A. K. Bhunia, A. K. Bhunia, A. Sain, and P. P. Roy (2019) Improving document binarization via adversarial noise-texture augmentation. In ICIP, Cited by: §I, Fig. 1, §II, §III, Fig. 2, §IV-A, §IV-D, TABLE II.
  • [7] J. Calvo-Zaragoza and A. Gallego (2019) A selectional auto-encoder approach for document image binarization. Pattern Recognition. Cited by: §II.
  • [8] J. Calvo-Zaragoza, G. Vigliensoni, and I. Fujinaga (2017)

    Pixel-wise binarization of musical documents with convolutional neural networks

    .
    In MVA, Cited by: §II.
  • [9] Y. Chen and L. Wang (2017) Broken and degraded document images binarization. Neurocomputing. Cited by: §I.
  • [10] Y. Feng, L. Ma, W. Liu, and J. Luo (2019) Unsupervised image captioning. In CVPR, Cited by: §II.
  • [11] Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In ICML, Cited by: §IV-C.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    .
    Cited by: §IV-C.
  • [13] B. Gatos, K. Ntirogiannis, and I. Pratikakis (2009) ICDAR 2009 document image binarization contest (dibco 2009). In ICDAR, Cited by: §IV-A.
  • [14] L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In CVPR, Cited by: §III-A.
  • [15] L. Gatys, A.S. Ecker, and M. Bethge (2015) Texture synthesis using convolutional neural networks. In NIPS, Cited by: §III-A.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §I.
  • [17] J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang (2019) Unpaired image captioning via scene graph alignments. arXiv preprint arXiv:1903.10658. Cited by: §II.
  • [18] S. He and L. Schomaker (2019)

    DeepOtsu: document enhancement and binarization using iterative deep learning

    .
    Pattern Recognition. Cited by: §I, §II, §IV-D, TABLE II.
  • [19] N.R. Howe (2013) Document binarization with automatic parameter tuning. IJDAR. Cited by: §II, TABLE II.
  • [20] N. R. Howe (2011) A laplacian energy for document binarization. In ICDAR, Cited by: §II.
  • [21] F. Jia, C. Shi, K. He, C. Wang, and B. Xiao (2016) Document image binarization using structural symmetry of strokes. In ICFHR, Cited by: §I.
  • [22] F. Jia, C. Shi, K. He, C. Wang, and B. Xiao (2018) Degraded document image binarization using structural symmetry of strokes. Pattern Recognition. Cited by: TABLE II.
  • [23] J.N. Kapur, P.K. Sahoo, and A.K.C. Wong (1985) A new method for gray-level picture thresholding using the entropy of the histogram. Computer vision, graphics, and image processing. Cited by: §II.
  • [24] S. Katz, A. Tal, and R. Basri (2007) Direct visibility of point sets. In TOG, Cited by: §II.
  • [25] J. Kittler and J. Illingworth (1986) Minimum error thresholding. Pattern recognition. Cited by: §II.
  • [26] Y. Lai and P. L. Rosin (2014) Efficient circular thresholding. IEEE Transactions on Image Processing. Cited by: §II.
  • [27] W. Niblack (1985) An introduction to digital image processing. Strandberg Publishing Company. Cited by: §II.
  • [28] K. Ntirogiannis, B. Gatos, and I. Pratikakis (2014) ICFHR2014 competition on handwritten document image binarization (h-dibco 2014). In ICFHR, Cited by: §IV-A.
  • [29] N. Otsu (1979) A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics. Cited by: §I, §II, TABLE II.
  • [30] N. Papamarkos (2001) A technique for fuzzy document binarization. In DocEng, Cited by: §II.
  • [31] J. Pastor-Pellicer, S. España-Boquera, F. Zamora-Martínez, M. Z. Afzal, and M. J. Castro-Bleda (2015) Insights on the use of convolutional neural networks for document image binarization. In ICANN, Cited by: §II.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §IV-B.
  • [33] I. Pratikakis, B. Gatos, and K. Ntirogiannis (2011) ICDAR 2011 document image binarization contest (dibco 2011). In ICDAR, Cited by: §IV-A.
  • [34] I. Pratikakis, B. Gatos, and K. Ntirogiannis (2013) ICDAR 2013 document image binarization contest (dibco 2013). In ICDAR, Cited by: §IV-A.
  • [35] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos (2016) ICFHR2016 handwritten document image binarization contest (h-dibco 2016). In ICFHR, Cited by: §II, §IV-A, §IV-D.
  • [36] I. Pratikakis, K. Zagoris, G. Barlas, and B. Gatos (2017) ICDAR2017 competition on document image binarization (dibco 2017). In ICDAR, Cited by: §II.
  • [37] I. Pratikakis, B. Gatos, and K.Ntirogiannis (2012) ICFHR 2012 competition on handwritten document image binarization. In ICFHR, Cited by: §IV-A.
  • [38] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence (2008) Covariate shift and local learning by distribution matching. MIT Press. Cited by: §III-B.
  • [39] J. Sauvola and M. Pietikäinen (2000) Adaptive document image binarization. Pattern recognition. Cited by: §II, TABLE II.
  • [40] A. Shahab, F. Shafait, and A. Dengel (2011) ICDAR 2011 robust reading competition challenge 2: reading text in scene images. In ICDAR, Cited by: §IV-A.
  • [41] B. Su, S. Lu, and C. L. Tan (2010) Binarization of historical document images using the local maximum and minimum. In DAS, Cited by: TABLE II.
  • [42] C. Tensmeyer and T. Martinez (2017) Document image binarization with fully convolutional neural networks. In ICDAR, Cited by: §II.
  • [43] T. A. Tran, K. Oh, I. Na, G. Lee, H. Yang, and S. Kim (2017) A robust system for document layout analysis using multilevel homogeneity structure. Expert Systems With Applications. Cited by: §I.
  • [44] W. Tsai et al. (1985) Moment-preserving thresholding-a new approach. Computer Vision Graphics and Image Processing. Cited by: §II.
  • [45] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §III-D.
  • [46] G. D. Vo and C. Park (2018) Robust regression for image binarization under heavy noise and nonuniform background. Pattern Recognition. Cited by: TABLE II.
  • [47] Q.N. Vo, S.H. Kim, H.J. Yang, and G. Lee (2018) Binarization of degraded document images based on hierarchical deep supervised network. Pattern Recognition. Cited by: §I, §II, §IV-D, TABLE II.
  • [48] F. Westphal, N. Lavesson, and H. Grahn (2018)

    Document image binarization using recurrent neural networks

    .
    In DAS, Cited by: §I, §II, TABLE II.
  • [49] C. Wolf, J. Jolion, and F. Chassaing (2002) Text localization, enhancement and binarization in multimedia documents. In Object recognition supported by user interaction for service robots, Cited by: §II.
  • [50] C. Zheng, T. Cham, and J. Cai (2018) T2net: synthetic-to-realistic translation for solving single-image depth estimation tasks. In ECCV, Cited by: §II.
  • [51] J. Zhu, T. Park, P. Isola, and A.A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint. Cited by: §II.