Scene Text Eraser

05/08/2017 ∙ by Toshiki Nakamura, et al. ∙ KYUSHU UNIVERSITY University of Electro-Communications 0

The character information in natural scene images contains various personal information, such as telephone numbers, home addresses, etc. It is a high risk of leakage the information if they are published. In this paper, we proposed a scene text erasing method to properly hide the information via an inpainting convolutional neural network (CNN) model. The input is a scene text image, and the output is expected to be text erased image with all the character regions filled up the colors of the surrounding background pixels. This work is accomplished by a CNN model through convolution to deconvolution with interconnection process. The training samples and the corresponding inpainting images are considered as teaching signals for training. To evaluate the text erasing performance, the output images are detected by a novel scene text detection method. Subsequently, the same measurement on text detection is utilized for testing the images in benchmark dataset ICDAR2013. Compared with direct text detection way, the scene text erasing process demonstrates a drastically decrease on the precision, recall and f-score. That proves the effectiveness of proposed method for erasing the text in natural scene images.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Nowadays, personal private information such as telephone numbers, ID number, home addresses, car numbers [1], etc. have become the special identity of person. Those important information may be incidentally captured, and appear in natural scene images. If published on the internet, it is a high risk to be collocated automatically by machines and criminals for illegal usage. To prevent the leakage of personal information, especially the text in scene images, information hidden technology is in great demand. Different from scene text detection [2], the text hidden technics do not extract the whole text lines. That means perfect detection accuracy is not required. Only the characters or parts of them are removed and the process parts should not be distinct from the background. The example is shown in Fig. 1.

Fig. 1: Hide text in scene images.

The goal is to erase the text regions and make them hard to be detected. The simple image processing like blurring through Gaussian filter [5] is only valid to text with specific shape and stroke. However, scene text has various appearance [3], such as color, font, size, orientation, etc. Additionally, in the background, lots of clutters exist and effect the text and non-text judgement. Those challenges make the task difficult to solve. In this paper, we propose a novel method that erases the scene text via an inpainting deep neural network (DNN).

The problem is converted as image transformation refereing to transforming images from a source image space to a target image space. In our case, it only needs input images and the output are text erased images with non-text regions remain original. The inpainting DNN is considered as the eraser. It composes of Convolutional neural networks (CNN) in front and deconvolutional neural networks (DeCNN) [4] subsequently to recover the image resolution. The CNN is used to represent the feature of the image [6]. If only the features on the top are used for transformation, some details may be lost. To tackle this problem, interconnection between the deconvolutional layers and the convolutional layers which have the same size is built, and then the result is inputted to next deconvolutional layer. This model is trained in end-to-end fashion. We use inpainting [7] and dilation process to obtain the ground truth for training.

A text detection method [8] then detects the text regions in the text erased images and the performance is evaluated in the same manner [9] including the precision, recall and f-score. Compared with the detection result on original ICDAR 2013 images, all the measurements decrease drastically on text erased images. That demonstrates effectiveness of our proposed method.

The major contributions of our work are claimed as follows:

  • We propose to use scene text eraser to hide text in the images without text detection. The concept is novel for preventing the leakage of private text-based information. And this scene text eraser can remove the text naturally and effectively.

  • The scene text eraser is implemented in image transformation way. The convolution-to-deconvolution structure adds the summation process among layers for better image quality.

  • The dealation and inpainting process are applied to label ground truth automatically and accurately.

The rest of the paper is structured as follows: A selection of related work is reviewed in Sect. II. Sect. III presents our proposed method in detail. In Sect. IV, we give the experimental results which include the details of databases and the experimental setup. Finally, Sect. V gives a summarization and conclusion of this paper.

Ii Related work

Two strategies can be used for scene text erasing. One follws text detection pipelines [10, 11] that extract the text regions and then erase them by post-process. The other shares the idea of image transformation [12, 13] that considers the output image as a different style in which the text are removed and the other parts keep original.

Generally, the text detection methods detect text through either connected component analysis (CCA)-based procedure or sliding window-based procedure. The CCA-based methods [14, 15] involves character candidates extraction, character/non-character classification, and text grouping. The sliding window-based methods [16, 17] extract regional textual features, such as HoG, LBP [18], CNN etc, from the regions which are scanned discretely from the image space by multi-scale and multi-ratio, and then scores the regions by inputting the features to a pertained text/non-text classification engine. Regions with high text scores are grounded to text regions eventually. Sometimes, image pre-processing or post-processing techniques are required and added in the two pipelines. For text erasing, further process is required, for instance, how to fill the text regions by background color.

In recent years, many classic problems can be framed as image transformation tasks [13]

, where a system receives some input image and transforms it into an output image. Examples from image processing include denoising, super-resolution, and colorization, where the input is a degraded image (noisy, low-resolution, or grayscale) and the output is a high-quality color image. Examples from computer vision include semantic segmentation and depth estimation, where the input is a color image and the output image encodes semantic or geometric information about the scene. The related algorithms, either transfer the tone (color, contrast, saturation, etc.) of an image, preserving its patterns and details, or distort the texture uniformly of an image to create style . Scene text erasing can also be treated as a style transferring. Due to the richness of features that a deep CNN can possess, this task used to train a feedforward DNN in a supervised manner for transferring. Examples include the Ref 

[19] that automatically converts complex rough sketches to line drawings, Ref [20] converts the image style, Ref [21] performs color conversion on black and white images, etc. In this paper, we think out using image transform technology to hide the characters in the image by DNN with a special structure.

Iii Proposed method

The flowchart of the proposed method is shown in Fig. 2

. Since the purpose of text erasing is not the same as accurate text detection task, a single scale sliding window is applied to the original input images. The sliding stride is half of the window size. We cut the whole images into 64

64 patches and then input them into a pre-trained DNN. The size of each output result patches is also 64 64. To overcome the ambiguousness in the overlap regions, only the center part with 32 32 pixels of the output is considered valid and put back to original location. After this process, a single text hidden image is generated.

Fig. 2: The proposed method for scene text erasing.

Iii-a The structure of the scene text eraser

A feedforward DNN composed with half convolution part and half deconvolution part is used as eraser in our approach. The architecture of the DNN is shown in Fig. 3.

Fig. 3: The architecture of DNN in our proposed method.

The convolution part contains four convolutional layers. The filter size of each convolutional layer is 4

4. The stride step and padding size is set to 2 and 1, respectively. Therefore, in each layer, the size of the feature maps reduces half comparing with the previous ones. The deconvolution part has the same structure but replaces the convolutional layers to deconvolution layer. The size of the filter, stride step and padding size is exactly the same as in convolution part. Thus, with the layer going deeper, the size of the feature maps is double increased. Due to the reduction of the image by convolution and the enlargement of the image by deconvolution, the output image has the same size as the original image.

However, if we only use a linear structure, in which the image size reduction or enlargement operations are performed isolated, lots of information on the original image may be lost. Because in the convolution part, only part of the information in the input image is stored, and the output image size is reduced. And in the deconvolution part, only the stored information is used to recover the image content. It results in information losing and low resolution of the output image.

To tackle this problem, we used skip connection technique [22] which is effective for restoring images with less deterioration. The skip connection sums the feature maps in different layer and inputs them to the next layer. Since the feature maps in convolution layers have more detailed information, such as the position information of objects, etc. By adding up the feautres of the previous layer for image recovering, it is possible to complement some image information that is lost by the reduction and enlargement procedure. And this process is expected to prevent the resolution being lowered.

As shown in Fig. 3, the skip connection is performed by adding a summation layer after each deconvolution layer. It is expressed in Eq. 1 by adding up the features in deconvolution layer and features in convolution layer element-wisely, and then inputing them to the next deconvolution layer. This summation layer requires the input from different layers have the same size. So the convolution to deconvolution structure is symmetry.


Rectified Linear Unit (ReLU[24] is followed after each layer. Normalization is performed as well. Thus, the output result in each layer is rendered nonnegative. The lose function for back propagation uses mean square error [23] as expressed in Eq. 2. N is the total training samples. represents the output through the skip connection DNN model and

is the text removed ground truth. We implement and train the network on Caffe. The stochastic gradient descent (SGD) with learning rate of

is used in training phase.


Iii-B Training

Since in our method, the patch images are input for DNN, we need to collect the training samples on patch level. The aim of the system is to hide the text information in natural scene images. So, for positive samples, the input are scene text images and the ground truth are the same images with text removed. For negative samples, the input and ground truth are the same background images. To automatically generate the training samples, the image inpainting process is performed. It is a technique to fill up defects in the images and make them inconspicuous. Specially, it is frequently used for restoring images when noises exist. For our case, the text is considered as defects and filled by the surrounding background color after inpainting.

Fig. 4: Ground truth generation. (a) Original image. (b) character level ground truth. (c) Binary character ground truth. (d) Inpainting result. (e) One time dilate result on binary character ground truth image. (f) Inpainting result based on binary image of (e). (g) Three times dilate result on binary character ground truth image. (h)Inpainting result based on binary image of (g).

Fig. 4 shows the details of the processing. Given the character ground truth in pixel level (Fig. 4), inpainting process is applied on the original scene text image. The character ground truth is the basement as shown in Fig. 4) and we can get the processing result in Fig. 4. The pixels on character strokes are inpainted by the surrounding background color. To make the boundaries between character and background more inconspicuous in the image, additional dilation process is implemented on the basement images before image inpainting. Fig. 4 and  4 is the dilate results by performing dilation once and three times, respectively. And Fig. 4 and  4 is the final generation images by dilation and inpainting process sequentially.

To collect the patch level training samples, the sliding window with the size setting to 64

64 pixels is used. The batch formation is performed as well. The pair of input and output images are cropped from the same position in the original images and the corresponding inpainting images. The character ground truth is the guidance to classify the patches to positive or negative samples. In character ground truth images, if the corresponding cropped region contains any text, it is classified as text sample. Otherwise, that is background sample. Examples of the training data are shown in Fig. 

5. With this process, the training samples can be collected and classified automatically.

Fig. 5: Examples of training samples for DNN learning.(a) Positive samples. (b) Negative samples.

Iv Experimental results

Iv-a Dataset

In the experiment, a Flickr image dataset which contains more than 3000 scene images and the benchmark dataset ICDAR 2013 [9] which contains 229 images used for training. Most of the images in this dataset have signboards and billboards with text attached on. The font, color and position of characters and the background is various which is benefit for training the model. To evaluate the performance, the dataset ICDAR 2013 that is different from images used in training is tested.

Iv-B Qualitative Evaluation

Fig. 6 shows some text erased image by employing our proposed method. In Fig. 6, text can be successfully erased, even they are in complicated background, such as the the glass, the trees, etc. However, our proposed model fails for some cased as shown in Fig. 6. Our work only uses one scale sliding window to get the subregion. The captured parts in the character whose size is much larger than the window size might be considered as background. So the output of the DNN has no changes in that subregion. This results in the bad erasing performance on images with large size characters. Comparing results from differently trained DNN, the one that is trained with three times dilatation and inpainting ground truth gets the best performance. As shown in Fig. 6, text in the images of the last column are mostly erased and can not be distinguished by human. Since the dilate operation turns more pixels on the character boundary to be considered as part of the character, the text erasing result looks smoothing and natural.

Fig. 6: Examples of text erased images. Images in the columns from left to right correspond to the original images, text erased images by training with only inpainting ground truth, text erased images by training with one time dilatation and inpainting ground truth, text erased images by training with three times dilatation and inpainting ground truth.

Iv-C Quantitative Evaluation

To evaluate the scene text erasing performance, a modified text detection method [8]

is used to detect the text in images after erasing process. It is an object proposal based deep neural network that predicts discrete regions with different aspect ratios and scales from multiple feature maps. To make it adapted for text detection, we select six aspect ratios: 0.7, 1, 2, 3, 5, 7 for designing the default boxes. The scales on the prediction layers range from 0.06 to 0.85. In total, 38124 regions estimated. Most of them are non-text regions. Only the detections with text probability higher than 0.7 are remained as text. We test this text detection method on scene text erased images in ICDAR 2013 and compare the results with original scene text images. They are named by the generation ways as below:

  • Original images dataset: focused scene text images in ICDAR 2013.

  • Erased0 images dataset: Scene text erased images of ICDAR 2013 by network trained with inpainting ground truth.

  • Erased1 images dataset: Scene text erased images of ICDAR 2013 by network trained with one time dilatation and inpainting ground truth.

  • Erased3 images dataset: Scene text erased images of ICDAR 2013 by network trained with three times dilatation and inpainting ground truth.

We follow the text detection performance measurement by compute the precision, recall and f-score under two protocols, the DetEval [25] and the ICDAR 2013 evaluation [9]

. Precision represents the proportion of detected text regions to all detected regions. Recall is the proportion of detected text regions to ground truth text regions. f-score is a trade-off between precision and recall rate by computing their harmonic mean. Table 

I demonstrates the results. After text erasing, the recall of text detection decreases more than 70%. That demonstrates less text regions are detected. The precision decreases about 30% representing that the non-text regions’ proportion becomes higher in all the detected regions. Compared with the text detection results on original images, the three text erased image datasets have worse performance. The overall measurement f-score drops drastically after text erasing in the images. Inversely, it proves the effectiveness of the proposed method. As explained above, by adding the dilate operation for training samples, the text can be erased more smoothly and naturally. Without the shape boundaries between background and text regions, the erased text regions are much difficult to be detected. Examples of text detection results are displayed in Fig. 7. The proposed text eraser can distinguish the text regions and non-text regions well. From the results, we can see that most text regions go through exserting process and are hidden afterwards.

In this work, we only used single scale sliding window-based method to perform text erasing in images. It has some weakness for erasing large size text. In our future work, a real end-to-end system will be employed. The input is a complete scene text image, and the output is the text erased image. For training, the full images and the corresponding inpaining images will be the training samples instead of using cropped image patches. Additionally, we will propose new evaluation method to measure the character erased performance but not only by text detection evaluation.

Fig. 7: Text detection performance on original images and text erased images. The detect results in the columns from left to right correspond to Original images dataset, Erased0 images dataset, Erased1 images dataset and Erased3 images dataset.
Image dataset ICDAR Eval DetEval
Recall Precision f-score Recall Precision f-score
Original images 82.56% 83.70% 83.13% 81.90% 87.15% 84.45%
Erased0 images 21.74% 69.31% 33.09% 22.25% 70.17% 33.78%
Erased1 images 13.88% 59.20% 22.49% 14.45% 60.48% 23.32%
Erased3 images 8.35% 54.07% 14.46% 8.89% 54.53% 15.30%
TABLE I: The text detection performance on four datasets.

V Conclusion

To protect privacy of the text based information in natural scene images, we proposed a novel scene text eraser. It used the image transform method which transferred the scene text images to text erased images via an inpainting deep neural network. This network process the image patches, which are cropped by sliding window, from convolution to deconvolution. To improve the resolution of output images and conserve more information of the non-text part in the original images, we used skip connection to sum the feature maps in both deconvolutional layers and specified convolutional layers. For model training, the dilate and inpainting technologies are applied subsequently to generate the training samples. A text detection method evaluated the text erasing performance on ICDAR 2013 dataset. The precision, recall and f-score dropped drastically after erasing the text in images. It proved the effectiveness of this text eraser. In our future work, we will develop this model in end-to-end fashion and think out new evaluation method to better measure the performance of scene text eraser.

Vi Acknowledgments

The pictures of left bottom of Fig.5(a) and left top of Fig.5(b) are taken from Flickr under the copyright license. The authors would like to thank the contributors of those pictures. Left bottom of Fig.5(a) and Left top of Fig.5(b) : alykat


  • [1]

    K. Inai, M. P alsson, V. Frinken, Y. Feng, and S. Uchida, ”Selective concealment of characters for privacy protection,” in Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 2014, pp. 333-338.

  • [2] D. D. Ye Q, ”Text detection and recognition in imagery: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1480-1500, 2015.
  • [3] J. M, S. K, V. A, and Z. A, ”Reading text in the wild with convolutional neural networks,” International Journal of Computer Vision, vol. 116, no. 1, pp. 1-20, 2016.
  • [4] H. Noh, S. Hong, and B. Han, ”Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520-1528.
  • [5]

    M. A. Carreiraperpinan, ”Fast nonparametric clustering with gaussian blurring mean-shift.”,International Conference on Machine Learning , 2006, pp. 153-160

  • [6]

    G. Yang and H. Jing, ”Multiple convolutional neural network for feature extraction,” Proceedings of the IEEE International Conference on Image Processing, pp. 104-114, 2015.

  • [7] A. Criminisi, P. P erez, and K. Toyama, ”Region filling and object removal by exemplar-based image inpainting,” IEEE Transactions on image processing, vol. 13, no. 9, pp. 1200-1212, 2004.
  • [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ”Ssd: Single shot multibox detector,” in European Conference on Computer Vision. Springer, 2016, pp. 21-37.
  • [9] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras, ”Icdar 2013 robust reading competition,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, pp. 1484-1493.
  • [10] Z. Y. Y, Y. C, and B. X, ”Scene text detection and recognition: Recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19-36, 2016.
  • [11] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, ”Robust text detection in natural scene images,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 970-983, 2014.
  • [12] M. Oka and Y. Kurauchi, ”Method and system for image transformation,” Oct. 23 1990, uS Patent 4,965,844.
  • [13] J. Johnson, A. Alahi, and L. Fei-Fei, ”Perceptual losses for realtime style transfer and super-resolution,” in European Conference on Computer Vision. Springer, 2016, pp. 694-711.
  • [14] W. Huang, Y. Qiao, and X. Tang, ”Robust scene text detection with convolution neural network induced mser trees,” in European Conference on Computer Vision. Springer, 2014, pp. 497-511.
  • [15] B. Epshtein, E. Ofek, and Y. Wexler, ”Detecting text in natural scenes with stroke width transform,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2963-2970.
  • [16] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ”End-to-end text recognition with convolutional neural networks,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 3304-3308.
  • [17] L. Neumann and J. Matas, ”Scene text localization and recognition with oriented stroke detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 97-104.
  • [18] G. Gan and J. Cheng, ”Pedestrian detection based on hog-lbp feature,” in Computational Intelligence and Security (CIS), 2011 Seventh International Conference on. IEEE, 2011, pp. 1184-1187.
  • [19] E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa, ”Learning to simplify: fully convolutional networks for rough sketch cleanup,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 121, 2016.
  • [20] L. A. Gatys, A. S. Ecker, and M. Bethge, ”Image style transfer using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414-2423.
  • [21] S. Iizuka, E. Simo-Serra, and H. Ishikawa, ”Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 110, 2016.
  • [22] J. Long, E. Shelhamer, and T. Darrell, ”Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431-3440.
  • [23] X.-J. Mao, C. Shen, and Y.-B. Yang, ”Image restoration using convolutional auto-encoders with symmetric skip connections,” arXiv preprint arXiv:1606.08921, 2016.
  • [24]

    V. Nair and G. E. Hinton, ”Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807-814.

  • [25] C. Wolf, E. Lombardi, J. Mille, O. Celiktutan, M. Jiu, E. Dogan, G. Eren, M. Baccouche, E. Dellandr ea, C.-E. Bichot et al., ”Evaluation of video activity localizations integrating quality and quantity measurements,” Computer Vision and Image Understanding, vol. 127, pp. 14-30, 2014.