Text-Aware Single Image Specular Highlight Removal

by   Shiyu Hou, et al.

Removing undesirable specular highlight from a single input image is of crucial importance to many computer vision and graphics tasks. Existing methods typically remove specular highlight for medical images and specific-object images, however, they cannot handle the images with text. In addition, the impact of specular highlight on text recognition is rarely studied by text detection and recognition community. Therefore, in this paper, we first raise and study the text-aware single image specular highlight removal problem. The core goal is to improve the accuracy of text detection and recognition by removing the highlight from text images. To tackle this challenging problem, we first collect three high-quality datasets with fine-grained annotations, which will be appropriately released to facilitate the relevant research. Then, we design a novel two-stage network, which contains a highlight detection network and a highlight removal network. The output of highlight detection network provides additional information about highlight regions to guide the subsequent highlight removal network. Moreover, we suggest a measurement set including the end-to-end text detection and recognition evaluation and auxiliary visual quality evaluation. Extensive experiments on our collected datasets demonstrate the superior performance of the proposed method.



There are no comments yet.


page 2

page 5

page 9


CANDY: Conditional Adversarial Networks based Fully End-to-End System for Single Image Haze Removal

Single image haze removal is a very challenging and ill-posed problem. T...

Fast and High Quality Highlight Removal from A Single Image

Specular reflection exists widely in photography and causes the recorded...

Unsupervised Representation Disentanglement of Text: An Evaluation on Synthetic Datasets

To highlight the challenges of achieving representation disentanglement ...

Scene text removal via cascaded text stroke detection and erasing

Recent learning-based approaches show promising performance improvement ...

Blind Visual Motif Removal from a Single Image

Many images shared over the web include overlaid objects, or visual moti...

A Novel Integrated Framework for Learning both Text Detection and Recognition

In this paper, we propose a novel integrated framework for learning both...

Gradual Network for Single Image De-raining

Most advances in single image de-raining meet a key challenge, which is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Specular highlights often exist in real-world images due to the material property of objects and the capturing environments. It is always desired to reduce or eliminate these specular highlights to improve the visual quality and to facilitate the vision and graphics tasks, such as stereo matching [10, 12], text recognition [16], image segmentation [3, 6] and photo-consistency [30, 31]. See Fig. 1 for examples, the performance of the end-to-end text detection and recognition drops due to the existence of highlight in the images, while our method is designed to detect and remove the highlight so as to improve the subsequent OCR performance.

In the last decades, many approaches have been proposed to address this challenging specular highlight removal problem. These existing works can be roughly classified into three categories: dichromatic reflection model-based methods 

[28, 32, 23, 7, 26], inpainting-based methods [27, 21, 4, 18]

, and deep learning-based methods 

[8, 15, 20]. The dichromatic reflection model [24] linearly combines the diffuse and specular reflections, and subsequently many methods are proposed based on this model. These methods usually require some simplifying assumptions. In addition, they often need to carry out the pre-processing operations, e.g.

, segmentation, when encountering images with diverse colors and complex textures, which results in low efficiency and weak practicability. Inpainting-based methods mainly recover the original image contents behind the highlight borrowing the techniques from the image inpainting community. This kind of methods have limited performance for the large highlight contamination. Considering the complexity of single image specular highlight removal, some recent works 

[8, 15, 20] are proposed based on the deep neural networks, e.g.

, convolutional neural network (CNN) and generative adversarial network (GAN). With the aid of the powerful learning capacity of deep models, these deep learning-based methods usually have better performance compared with traditional optimization-based methods. However, these deep learning-based methods require the large-scale training data, especially paired real-world images with necessary annotations, which are time-consuming even difficult to collect.

Highlight ImageOCR ResultMaskOursOCR Result

Figure 1: Selected single image specular highlight removal results of our method. “Mask” and “Ours” separately are the outputs of our first-stage and second-stage networks.

Existing specular highlight removal methods mainly process the medical images, natural images, and specific-object images, however, there is no work to focus on the text images. On the other hand, for the end-to-end text detection and recognition, many approaches are proposed to handle texts with arbitrary shapes and various orientations. To our knowledge, the case of text images with specular highlight contamination is rarely studied. Therefore, in this paper, we conduct an extensive study on the text-aware single image specular highlight removal problem including dataset collection, network architecture, training losses, and evaluation metrics. The main contributions of our work are as follows:

  • We first raise and study in the literature text-aware single image specular highlight removal problem. To study this challenging problem, we collect three high-quality datasets with fine-grained annotations.

  • We propose a novel two-stage framework of highlight regions detection and removal implemented with two sub-networks. The highlight detection network provides the useful location information to facilitate the subsequent highlight removal network. For the training objectives, we jointly exploit detection loss, reconstruction loss, GAN loss, and text-related loss to achieve the good performance.

  • For the result comparison, we suggest a comprehensive measurement set, which contains the end-to-end text detection and recognition performance and auxiliary visual quality evaluation.

2 Related Work

2.1 Dichromatic Reflection Model-Based Methods

The dichromatic reflection model [24] assumes that the image intensity can be represented by a linear combination of diffuse and specular reflections. This model have been widely used for specular highlight removal. Based on the distribution of diffuse and specular points in the maximum chromaticity-intensity space, Tan et al. [28] separated the reflection components by identifying the diffuse maximum chromaticity and then applying a specular-to-diffuse mechanism. Inspired by the observation that diffuse maximum chromaticity in a local patch of color images changes smoothly, Yang et al. [32]

enhanced the real-time performance and robustness of the chromaticity estimation by applying the bilateral filtering. To exploit the global information of color images for specular reflection separation, Ren

et al. [23] proposed a global color-lines constraint based on the dichromatic reflection model. Fu et al. [7] reformulated estimating the diffuse and specular images as an energy minimization with sparse constraints, which can be approximately solved. Recently, Son et al. [26] proposed a convex optimization framework to effectively remove the sepcular highlight from chromatic and achromatic regions of natural images. These dichromatic reflection model-based approaches often have limited performance for processing the images with diverse colors and complex textures.

2.2 Inpainting-Based Methods

Inpainting is to complete the missing regions of images by propagating information from the known regions, and this technique can be used to restore damaged paintings or remove specific objects [5]. Tan et al. [27] first proposed an inpainting-based method for highlight removal by incorporating the illumination-based constraints. Ortiz and Torres [21] designed a connected vectorial filter integrating into the inpainting process to eliminate the specular reflectance. Park and Lee [22] introduced a highlight inpainting method based on the color line projection, however, this method needs two images taken with different exposure times. Inpainting-based highlight removal methods were also proposed to handle the medical images, such as endoscopic images [4] and colposcopic images [18]. However, these inpainting-based methods are only effective for images with small areas of highlight contamination.

2.3 Deep Learning-Based Methods

Different from the aforementioned two kinds of methods, the deep learning-based methods do not require the specular highlight model assumption, and thus have the potential to handle various scenarios. Lee et al. [14]

proposed a perceptron artificial neural network to detect the specular reflections of tooth images and then applied the smoothing spatial filter to recursively correct the specular reflections. Due to the lack of paired training data, Funke

et al. [8] adopted the cycle GAN framework [33] and introduced a self-regularization loss aiming to reduce image modification in non-specular regions. Similarly, Lin et al. [15] also adopted a GAN framework and proposed a multi-class discriminator, where classifying the generated diffuse images from real ones and original input images as well. Muhammad et al. [20] proposed two deep models (Spec-Net and Spec-CGAN) for specularity removal from faces. The former takes the intensity channel as input while the latter takes the RGB image as input. These methods mainly proposed for the medical images, specific-object images or facial images, whereas our work pays attention to the text images.

Figure 2: The collection pipelines of real dataset (a) and synthetic dataset (b), and an example of paired data sample (c).

3 Datasets

In the literature, there is no publicly available dataset for studying the text-aware single image specular highlight removal problem. Therefore, in this work, we collect three high-quality datasets including a real dataset and two synthetic datasets. The pipelines of datasets collection and an example of paired data sample are shown in Fig. 2.

3.1 Real Dataset

Fig. 2(a) illustrates the pipeline of real dataset collection. For the real dataset, we collect 2,025 image pairs: image with text-aware specular highlight, the corresponding highlight-free image and binary mask image indicating the location of highlight. The image contents include ID cards and driver’s licenses, which contain a lot of text information. We first put the transparent plastic film on the picture and turn on the light. Then, the camera shoots to obtain a highlight image. Correspondingly, we obtain a highlight-free image by turning off the light. The shapes and intensities of the highlights are various by adjusting the location of the plastic film. Binary mask image is achieved from the image with specular highlight and highlight-free image through difference and multiple threshold screening. We randomly split this dataset (named RD) into a training set (1,800 images) and a test set (225 images).

3.2 Synthetic Datasets

To further enrich the diversity of our dataset, we construct two sets of synthetic images using the 3D computer graphics software Blender. Fig. 2(b) shows the pipeline of synthetic dataset collection. We first collect 3,679 images with texts from supermarkets and streets, and 2,025 images mentioned in Sec. 3.1. Then, we use the Blender with Cycles engine to automatically generate 27,700 groups of text-aware specular highlight images, the corresponding highlight-free images and highlight mask images. In particular, the highlight shapes include circles, triangles, ellipses, and rings to simulate the lighting conditions in real-world scenes. The material roughness is randomly set in the range [0.1,0.3], and the illumination intensity is randomly chosen from the range [40,70]. To force the specular highlight on the text areas of the image, we provide the Blender with the location information of the text areas obtained via the text detection model CTPN [29].

Because the product or street view category contains less texts per image, while the texts in ID cards and driver’s licenses are more dense. Under the same illumination condition, the difficulty of restoring the text information interfered by the specular highlight in these two kinds of images is different. Therefore, we divide the above two types of images into two datasets, namely, SD1 and SD2. SD1 contains 12,000 training sets and 2,000 test sets. SD2 contains 12,000 training sets and 1,700 test sets. Note that, the image contents of RD and SD2 are same.

4 Proposed Method

In this work, we propose a two-stage framework to detect and remove the specular highlight from text images. The whole architecture is shown in Fig. 3

. In the following, we describe the details of our network architecture and the loss functions.

Figure 3: The whole structure of our proposed specular highlight removal framework, which consists of a highlight detection network, a highlight removal network, and a patch-based discriminator. Symbol means the channel-wise concatenation.

4.1 Network Architecture

Highlight Detection Network.

The highlight detection network takes the text image with specular highlight as input and outputs a mask indicating the highlight regions. Each element of

is in [0,1], and a larger value stands for a higher probability that the corresponding location of image

is covered by the specular highlight. Due to the same width and height of and , for this network, we adopt a fully convolutional architecture consisting of three downsampling and upsampling layers. Each downsampling layer is followed by two convolutional layers, and each upsampling layer is followed by three convolutional layers.

Highlight Removal Network.

After achieving the highlight mask , the highlight removal network is then applied to remove the specular highlight and recover the text information. As input, accepts an input text image and detected highlight mask . The output of our highlight removal network is a highlight-free image . Through introducing , network can pay more attention to the highlight regions and achieve more better removal performance. For the network architecture of , in this work, we adopt an encoder-decoder network with skip connection. This network consists of two downsampling layers, four residual blocks, and two upsampling layers. To further enhance the removal performance, we also apply a patch-based discriminator [19]. The discriminator

includes one convolutional layer and five downsampling layers with kernel size of 5 and stride of 2. The spectral normalization is utilized to stabilize the training of the discriminator.

4.2 Loss Functions

Next, we illustrate the loss functions used for training our network.

Highlight Detection Loss.

For the objective function of highlight detection network, we use loss, i.e., , where is the ground truth of highlight mask.

Reconstruction Loss.

The reconstruction loss is to add constraints on the pixel and feature space. The pixel-aware loss consists of pixel-wise difference item and total varition (TV) item: . The feature-aware loss including perceptual loss [11] and style loss [9]: , where is the feature maps of pre-trained VGG-16 [25] and is the Gram matrix [9]. The feature-aware loss improves the visual quality of results.

GAN Loss.

In the highlight removal network, we use a patch-based discriminator to enhance the visual realism of results. For the GAN loss, we adopt the hinge loss. Therefore, the adversarial loss for is . The loss used for training the discriminator is formulated as .

Text-Related Loss.

In this work, our specular highlight removal is text-aware. This means that the highlight removal network needs to pay more attention to recover the texts hidden behind the highlights. To do this, we apply the pre-trained text detection and recognition models to provide the supervision on the text recovering. More specifically, we add the consistent constraints on the feature maps of and extracted from above two pre-trained models, and the text-related loss is formulated as , where stands for the -th layer feature map from the pre-trained CTPN model [1] and denotes the -th layer feature map from the pre-trained DenseNet [1].

To this end, the total objective function of and is . In all experiments, we set and .

5 Experiments

5.1 Implementation Settings

Our network is implemented with TensorFlow 1.15. As GPU we use a TITAN RTX from NVIDIA

®. The Adam optimizer [13] with a batch size of 4 is used to train our network, where and . The learning rate is initialized as 0.0001. In our experiments, all the images are of size of . Note that, the text recognition model used for result evaluation is different from the model used in text-related loss.

5.2 Qualitative Evaluation

We qualitatively compare our method with two recent advanced specular highlight removal methods: Multi [15] and SPEC (Spec-CGAN [20]) on our collected three datasets. The results are shown in Fig. 4. Among these three methods, our method can better remove the highlight and achieve the superior end-to-end text detection and recognition performance. For example, our method successfully recovers the name, address, and id number in the third row. Multi has apparent highlight remnants in the third and fourth rows due to its blind removal property, whereas our method can better perceive the highlight regions. Compared with Multi, the results of SPEC have less highlights, however, the capability of recovering the texts is limited for the cycleGAN framework as SPEC followed.

Ground TruthHighlight ImageMultiSPECOurs

Figure 4: Qualitative comparisons of our method with Multi [15] and SPEC [20].

5.3 Quantitative Evaluation

In addition, we quantitatively compare the above three methods in terms of the end-to-end text detection and recognition performance and visual quality. For the end-to-end text detection and recognition evaluation, we adopt the common metrics [17]: recall, precision, and f-measure. We choose the current advanced text detection and recognition algorithm Paddle OCR [2] to calculate these three metrics. For visual quality evaluation, we utilize the PSNR and SSIM.

Table 1 reports the numerical results of the three methods on our three datasets. Due to the same image contents of RD and SD2, for real dataset (RD), we fine-tune the model trained on SD2 using the training set of RD for all three methods. From Table 1, we can find that our method achieves the best performance for end-to-end text detection and recognition (see 3-5 columns). Take the recall as an example, our method can improve the end-to-end text detection and recognition performance by 6.89% (SD1), 3.07% (SD2), and 13.65% (RD), respectively. This improvement indicates that our method can better recover the original texts hidden behind the specular highlight. In addition, the end-to-end detection and recognition performance of Multi and SPEC sometimes is lower than that of Light Image. The reason is that these two methods remove the highlight and texts as well. For PSNR and SSIM, SPEC is worst, while our method and Multi are competitive for synthetic datasets, and our method is better than Multi for real datasets. PSNR and SSIM of our method sometimes are lower than that of Multi, however, these two metrics are not exactly the same as the visual quality that the human eyes perceive. More importantly, we focus on the end-to-end text detection and recognition performance after highlight removal, and the visual quality is an auxiliary aspect.

Datasets Methods Recall ↑ Precision ↑ F-measure ↑ PSNR ↑ SSIM ↑
SD1 Light Image 85.03 94.70 88.70 17.58 82.37
Multi(2019) 86.28 94.76 89.30 26.29 89.86
SPEC(2020) 82.39 93.12 86.31 15.61 68.82
Ours 91.92 96.32 93.57 22.65 88.33
SD2 Light Image 80.50 95.89 87.10 11.79 66.42
Multi(2019) 79.21 93.82 84.88 28.99 91.81
SPEC(2020) 78.87 95.10 85.55 9.66 53.95
Ours 83.57 95.00 88.42 29.21 90.67
RD Light Image 64.85 90.60 73.49 17.05 65.04
Multi(2019) 61.58 87.63 70.72 17.17 64.23
SPEC(2020) 70.59 91.62 78.38 14.82 52.49
Ours 78.50 91.34 83.34 21.62 77.19
Table 1: Quantitative comparison of our method and two recent state-of-the-art methods: Multi [15] and SPEC [20]. All three methods are trained and tested on our collected three datasets separately. Recall, precision, f-measure, and SSIM are in .

5.4 Ablation Study

To verify the effectiveness of the text-related loss, we perform the ablation experiments and report the corresponding results in Table 2. We observe that the end-to-end text detection and recognition performance of our method with text-related loss is consistently improved for three datasets. This indicates that the text-related loss can enforce the highlight removal network to conduct the text-aware restoration. In addition, we can find that the end-to-end text detection and recognition performance of our method is already better than that of Multi and SPEC (comparing the first row of each dataset in Table 2 with the corresponding rows in Table 1), even though there is no text-related loss.

Datasets Methods Recall ↑ Precision ↑ F-measure ↑ PSNR ↑ SSIM ↑
SD1 w/o text loss 91.43 94.12 92.75 21.88 87.19
Ours 91.92 96.32 93.57 22.65 88.33
SD2 w/o text loss 82.69 93.48 87.76 28.12 89.93
Ours 83.57 95.00 88.42 29.21 90.67
RD w/o text loss 77.04 89.66 82.87 21.38 76.11
Ours 78.50 91.34 83.34 21.62 77.19
Table 2: Performance of our method without and with text-related loss on our collected three datasets.

6 Conclusion and Future Work

In this work, we studied and solved the challenging specular highlight removal problem of single text image. To facilitate this study, we collected three high-quality datasets with fine-grained annotations. We proposed a two-stage framework including a highlight detection network and a highlight removal network. The output of highlight detection network is used as an auxiliary information, which guides the highlight removal network to pay more attention to the highlight regions. In addition, text-related loss was introduced to improve the recovering of texts. Our source code and datasets are available at
https://github.com/weizequan/TASHR .

In the future, we would like to construct lager and richer dataset to promote the development of related research. We would also like to design more effective networks and loss functions. Furthermore, an exciting research problem is to suggest more complete and exact visual quality measurements.


This work was supported by the National Key R&D Program of China (2019YFB2204104) and the National Natural Science Foundation of China (61772523).


  • [1] Chineseocr: Ctpn plus densenet plus ctc based chinese ocr, https://github.com/YCG09/chinese_ocr. Accessed 30 April 2021.
  • [2] Paddleocr: Awesome multilingual ocr toolkits based on paddlepaddle, https://github.com/PaddlePaddle/PaddleOCR. Accessed 30 April 2021.
  • [3] Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5), 898–916 (2011)
  • [4] Arnold, M., Ghosh, A., Ameling, S., Lacey, G.: Automatic segmentation and inpainting of specular highlights for endoscopic imaging. Journal on Image and Video Processing 2010 (2010)
  • [5] Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: ACM SIGGRAPH. pp. 417–424 (2000)
  • [6] Fleyeh, H.: Shadow and highlight invariant colour segmentation algorithm for traffic signs. In: IEEE Conference on Cybernetics and Intelligent Systems (2006)
  • [7] Fu, G., Zhang, Q., Song, C., Lin, Q., Xiao, C.: Specular highlight removal for real-world images. Computer Graphics Forum 38(7), 253–263 (2019)
  • [8] Funke, I., Bodenstedt, S., Riediger, C., Weitz, J., Speidel, S.: Generative adversarial networks for specular highlight removal in endoscopic images. In: Medical Imaging 2018: Image-Guided Procedures, Robotic Interventions, and Modeling. vol. 10576, pp. 8 – 16 (2018)
  • [9]

    Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2414–2423 (2016)

  • [10] Guo, X., Chen, Z., Li, S., Yang, Y., Yu, J.: Deep eyes: Binocular depth-from-focus on focal stack pairs. In: Chinese Conference on Pattern Recognition and Computer Vision. pp. 353–365 (2019)
  • [11]

    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of the European Conference on Computer Vision. pp. 694–711 (2016)

  • [12] Khanian, M., Boroujerdi, A.S., Breuß, M.: Photometric stereo for strong specular highlights. arXiv preprint arXiv:1709.01357 (2017)
  • [13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
  • [14] Lee, S.T., Yoon, T.H., Kim, K.S., Kim, K.D., Park, W.: Removal of specular reflections in tooth color image by perceptron neural nets. In: International Conference on Signal Processing Systems. vol. 1, pp. V1–285–V1–289 (2010)
  • [15] Lin, J., El Amine Seddik, M., Tamaazousti, M., Tamaazousti, Y., Bartoli, A.: Deep multi-class adversarial specularity removal. In: Image Analysis. pp. 3–15 (2019)
  • [16] Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. International Journal of Computer Vision 129(1), 161–184 (2021)
  • [17] Lucas, S., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: Icdar 2003 robust reading competitions. In: International Conference on Document Analysis and Recognition. pp. 682–687 (2003)
  • [18] Meslouhi, O.E., Kardouchi, M., Allali, H., Gadi, T., Benkaddour, Y.A.: Automatic detection and inpainting of specular reflections for colposcopic images. Central European Journal of Computer Science 1 (2011)
  • [19] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
  • [20] Muhammad, S., Dailey, M.N., Farooq, M., Majeed, M.F., Ekpanyapong, M.: Spec-net and spec-cgan: Deep learning models for specularity removal from faces. Image and Vision Computing 93, 103823 (2020)
  • [21] Ortiz, F., Torres, F.: A new inpainting method for highlights elimination by colour morphology. In: International Conference on Pattern Recognition and Image Analysis. pp. 368–376 (2005)
  • [22] Park, J.W., Lee, K.H.: Inpainting highlights using color line projection. IEICE Transactions on Information and Systems 90(1), 250–257 (2007)
  • [23] Ren, W., Tian, J., Tang, Y.: Specular reflection separation with color-lines constraint. IEEE Transactions on Image Processing 26(5), 2327–2337 (2017)
  • [24] Shafer, S.A.: Using color to separate reflection components. Color Research & Application 10(4), 210–218 (1985)
  • [25] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
  • [26] Son, M., Lee, Y., Chang, H.S.: Toward specular removal from natural images based on statistical reflection models. IEEE Transactions on Image Processing (2020)
  • [27] Tan, P., Lin, S., Quan, L., Shum, H.Y.: Highlight removal by illumination-constrained inpainting. In: IEEE International Conference on Computer Vision. pp. 164–169 (2003)
  • [28] Tan, R.T., Nishino, K., Ikeuchi, K.: Separating reflection components based on chromaticity and noise analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10), 1373–1379 (2004)
  • [29] Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: European Conference on Computer Vision. pp. 56–72 (2016)
  • [30] Wang, T.C., Efros, A.A., Ramamoorthi, R.: Occlusion-aware depth estimation using light-field cameras. In: IEEE International Conference on Computer Vision. pp. 3487–3495 (2015)
  • [31] Wang, W., Deng, R., Li, L., Xu, X.: Image aesthetic assessment based on perception consistency. In: Chinese Conference on Pattern Recognition and Computer Vision. pp. 303–315 (2019)
  • [32] Yang, Q., Wang, S., Ahuja, N.: Real-time specular highlight removal using bilateral filtering. In: European Conference on Computer Vision. pp. 87–100 (2010)
  • [33]

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision. pp. 2242–2251 (2017)