1 Introduction
When taking a photo of objects behind the glass window, unwanted reflection always appears. It is not only visually disturbing but may also affect the performance of other computer vision algorithms (, object detection, scene parsing, ). To solve this problem, reflection removal has been exploited by a number of existing works [15, 16, 17, 10, 25]. Single image reflection removal is a challenging problem. It only takes a single image as input and aims to separate it into two outputs, the clear background and the reflection. Specifically, given a input image with reflection, denoted as , we need to separate it into background and reflection [16, 10]:
(1) |

Apparently, Eqn. 1 is ill-posed due to there are double unknowns (B and R) to the one known (I
). To obtain meaningful solutions, existing methods either introduced various low-level priors or use a deep neural network. For example, Li
[16] take the smoothness prior which assumes the reflection is always smoother than the background. Shih [20] propose the ghost effect prior which models the double reflection property caused by the two sides of a glass. However, such priors are all based on low-level cues which are not robust in most real scenes. Later, Fan [10]trained a two-stage deep learning approach with low-level losses on color and edges to learning the mapping between the mixture images and the clean images. Recently, Zhang
[31] exact features from the few layers of a pre-trained VGG-19 network and consider them as perceptual feature. Low-level information is insufficient for the reflection separation when there is low-level appearance ambiguity.As shown in Fig. 1, both the reflection (bus) and the background (child) contains a variety of complex texture and color which share similar statics. And therefore, reflection cannot be easily removed by existing methods.
In this paper, we are inspired by the human cognition, that we humans can easily separate visual appearance into reflection and background. We notice that human vision achieves such capability by understanding the objectiveness. in Fig. 1, we understand that head, torso, hands and legs all belong to human and therefore belong to the same layer. This enable us to know that the red and blue coat belongs to the background while the light black and white components belong to the reflection.
Implementing such idea is not trivial. Because understanding the semantic in image with reflection and later use it to separate the appearance is a “chicken and egg” problem. In another word, a naive semantic estimation network is not guaranteed to work robustly with the presence of reflection, and a cleaner image will benefit the estimation of semantic. To solve it, same as all existing works, we assume the intensity of the background image is stronger than the reflection. We then propose the multi-task Semantic guided Reflection Removal Network (SRRN). This means we simultaneously learn the semantic estimation and reflection removal, and thereby solves the ”chicken and egg” problem. Particularly, for our implementation of the multi-task learning, we let the semantic task and reflection removal task share the same encoder and hidden parameters. Furthermore, we explicitly let the semantic to guide the reflection removal, which explicitly reflects our idea.
To evaluate the effectiveness of SRRN, we conducted systematical experiments on three datasets: first, a real images dataset proposed by Zhang [31]; second, a real benchmark proposed by Wan [23]; third, our synthetic dataset. Experiments report consistent and significant performance improvement on all three datasets. Rigorous experiments also show that our implementation of multi-task learning out-perform the baselines.
Contributions. We summary the contributions as follows:
-
To the best of our knowledge, we are the first method to use object semantic prior to reflection removal, and jointly solve the semantic separation and reflection removal from single image.
-
We propose a novel multi-task and end-to-end network structure single image reflection removal (main-task) with semantic guidance (sub-task).
-
We demonstrated the consistent effectiveness of the method by systematical experiments on two existing dataset sets and our new dataset.
2 Related Work
Multiple-view methods. Many works solve Eqn. 1 with multiple input. Methods [22, 15, 11] assume the reflection and background layer are at different depth plane which can be separated by multi-view depth estimation. To align multiple inputs, optical-flow is adopt to make reflection removal [27, 29]. Method [21] make reflection removal on in-vehicle black box to get a cleaner video of outside of the car. Recently, low rank matrix completion [17] is also used to make reflection removal in videos.
Multiple-modality methods. Another group of work using a pair of with/without flash points to make reflection removal like [1]. Schechner [19] using a group images with different focal length, remove reflections by solving the depth of different layers. Kong [14, 26] explore the polarization and take multiple image to solve the optimal separation through angle filter.
Non-CNN Single-image methods. Eqn. 1 is not directly solvable for single image. To tackle this Li [16] assume that the reflection layer is more blurry than background layer and model these as two different gradient distributions in the two layers for the separation. Shih [20] explore the ghost effect in reflection layer and designed a GMM model to make reflection removal. Arvanitopoulos [2] make reflection suppression through the relative gradient prior between different layers. Sandhan [18] use the symmetry in human face to remove the reflections on glasses. Yun [30] propose an algorithm to remove virtual points in large scale 3D points clouds using the conditions of the reflection symmetry and the geometric similarity.
CNN Based Single Image Methods. Fan [10] propose the Cascade Edge and Image Learning Network (CEIL Net) for reflection removal, in this work, background’s edge is predicted at first and then adopt to guide the reflection separation. Wan utilize existing prior information, designed a benchmark [23] for reflection removal, then train an end-to-end model called CRRN [24] to separate layers. Yang [28] presented Bidirectional Network (BDN) to predict background and reflection layer sequentially, they use these two layers to constrain each other and extract layers from coarse to fine. Baslamisli [3] designed a reflection and retinex model based on CNNs to decompose intrinsic image in two-stage methods. Zhang [31] proposed perceptual loss, which is extracted from the first layers of VGG, later they combine feature loss, adversarial loss and exclusive loss together. The main difference between perceptual loss and ours is that we explicitly utilize high level semantic information to guide reflection removal during training.
3 Semantic Guided Reflection Removal

3.1 A Case Study on Prior Based Methods
Before we start to introduce our proposed method, we perform a study on exiting methods to see their limitations. An example case is illustrated in Fig. 2, where a human face is occluded by reflection interference. In this real case, all exiting methods based on low level features, smoothness prior [16], ghost effect [20] fails to remove the reflection. The worst result is from [2] that seriously smooth the content image. Even recent CNN based methods [10, 31, 28] cannot handle this case well. These reveal the fact that neither these priors based nor the direct image-to-image training based methods are general enough in reflection separation problem. However, with this reflection, it is found the semantic information still can be reliably estimated as shown in last row. This is likely due to that semantic estimation gathers more global information that has the ability to recognize the human upper body as a whole. With this help from semantic information, our method (details are presented later) can generate the most clear reflection separation (more comparisons can be found in Sec. 5).
3.2 Study on Semantic Information with Reflection Interference

) based on 5000 test cases, with one visual example. CI is confidence interval. Note that semantic estimation is sensitive to the images with various reflection when
, but still robust to the observations with low reflectance intensity.It is not guarantee that the semantic estimation is still robust with the presence of reflection. Following the above study, we further validate the robustness of semantic segmentation estimation against different intensity of reflectance. We randomly sample images, from Pascal VOC dataset [9], where ground truth of semantic label is provided in 21 categories. Based on it, we synthesize the image with reflection by linear blending two images using , where larger can simulate larger reflectance intensity. In total, we sample and generate sets of images with . Fig. 3 illustrates the relationship between semantic segmentation quality and the reflectance intensity. Note that the semantic estimation result is robust with , but the mIoU drop rapidly when . We observe that current semantic estimation doesn’t work well in the case of features are completely occluded, because the reflectance intensity is too strong and non-transmitted reflections with low transmittance occurred.

3.3 Multi-task Learning for Simultaneous Reflection Removal and Semantic Estimation
From the the two studies presented in Sec. 3.1 and Sec. 3.2, we can see the benefit of using semantic information in the reflection removal task and we confirm the semantic segmentation estimation is relatively robust too reflection interference with moderate intensity. Our method tries to solve these normal cases, leaving the rare extreme cases as our future direction.
Given an input image with reflection interference, we perform two tasks: (1) : extracting background semantic map from input , and 2) : Recovering background layer (also reflection layer ) from input along with the semantic information () obtained in the first task, denoted as:
(2) | ||||
Using multi-task learning, we train a convolution neural network (CNN) to achieve these two tasks together.
Network architecture. Our SRRN layout is illustrated in Fig. 4, containing a Feature Extraction module, a Layer Reconstruction module, and a Semantic Estimation module. The Feature Extraction module aims to extract features for subsequent tasks. We use ResNet-101 [12] as the Feature Extractor and add the Atrous Spatial Pyramid Pooling (ASPP) module in Deeplab [5] at the end to capture information at different scales. The Semantic Estimation module estimates the semantic information and this information is further used in reflection removal task. The last Layer Reconstruction module utilize both the extracted features and the semantic information to recover the background and reflection layers. Fully convolutional layers are used to perform these two tasks. We also use skip connections (green arrows) between Feature Extractor and Reconstruction Module to forward and fuse the features from lower levels.
Loss function design
. As we are jointly performing two tasks, the final loss functions are built on these two tasks together as:
(3) |
where , and are the enforced losses on , , , respectively. is the sub-task’s observation noise parameter. Large scale values will decrease the contribution of , and vice versa. Detail definitions of , and will be provided in Sec. 4.2.
4 Implementation Detail
4.1 Multi-task Information Fusion Study

We design three different models to implement multi-task information fusion between semantic estimation and reflection removal.
Basic guidance. As shown in the left of Fig. 5, in the version of basic semantic guided reflection removal, the semantic map is estimated firstly, then its features are merged into the reflection removal branch.
Representation sharing without fusion. To simultaneous reflection removal and semantic estimation, we make these two tasks share representation and followed task-specific branch. In this way, semantic segmentation and reflection removal are trained simultaneously. Experiments show that results of this version are comparable to the state-of-the-art. (See Sec. 5 for detail).
4.2 Loss Detail
Our background loss is penalized by the difference between currently estimated background and the corresponding ground-truth :
(4) |
Following [24], we use SSIM (structural similarity index [32]) and is the L1 norm. denotes the canny operation [8], which is used to constrain the difference between and in gradient level . is the matrix’s Frobenius norm.
As the reflection in input contains less information than the background, the reflection loss is just L1 distance between the estimation and the ground truth :
(5) |
For semantic, we use cross entropy as the loss:
(6) |
where is the size of training batch and is the summation over classes, is the prediction and the ground truth label is .
To prevent over-fitting, we follow the settings in [4] to add regularization on the parameters, the final loss is organized as follow:
(7) |
where and is the L2 regularization, is the total number of trainable parameters in SRRN.
is the corresponding loss’s variance. We set
, , , , and to balance each loss item in experiments. We employ ResNet-101 with pre-trained parameters on ILSVRC-2012-CLS [7], parameters in this part are frozen. We train the ASPP module, Reconstruction Module and Semantic Module in our SRRN. Convolution weights are initialized as CZ18 [6]. Momentum optimizer [13] is employed with , where the cycle learning rate is initially set to 0.007 and decay in every 30000 iterations until 0.0001.4.3 Training Data Generation

For each batch in our training set, we require four data: (1) image with reflection , (2) Clear background , (3) reflection , and most importantly (4) semantic labels .
To build dataset to train the proposed model, we make use of two existing dataset, and also synthesize our own data. First, we use the dataset proposed by Zhang [31] which contains 110 real image sets with and provided. We then generate . We use one of the state-of-the-art semantic segmentation method DeeplabV3+ [6] to generate the semantic label from clear background, we further manually fix the error on the generated label and therefore obtains a high quality ground-truth. It contains 21 categories and is considered as ground-truth in our study.
Second we use the dataset proposed by Wan [24], it contains 454 image sets, Each sets contains , , all ready provided. Then we generate semantic label in the same way as described before.
Noticing the existing dataset contains only 564 images in total. Therefore, we first generate the dataset with semantic label for reflection removal. We use clear images (provided by Pascal VOC [9]) as background and reflection, semantic ground truth is provided in Pascal VOC. Then we blend background image and reflection image together as input image. In total, we generated 5000 image sets. Fig. 6 illustrates our generated images. Table 1 is a brief summary of all the dataset.
Dataset | Source | Volume | ||
---|---|---|---|---|
Zhang [31] | 110 | w/o GT | w/o GT | |
Wan [24] | 454 | GT | w/o GT | |
Ours | 5000 | GT | GT |
Our final dataset is the combination of all the three datasets. For each dataset, we randomly choose 80% as training set. Images are randomly cropped to to feed into the network.
5 Experiments
In this section, we first quantitatively and qualitatively evaluate our approach on single image reflection removal against previous methods [16, 10, 31, 28, 24] ,
then we demonstrate the state-of-art performance.
For numerical analysis, we firstly employ peak-signal-to-noise-ratio (PSNR) and SSIM as evaluation metrics.
Secondly, we analyse the effect of different parts of the SRRN.
Next, we make additional experiments on how the intensity of reflectance affects the final performance of the semantic segmentation and the reflection removal task.
Finally, we show additional applications of our model and make a discussion on failure cases.
5.1 Comparison with Previous Works

Input | AN17 [2] | FY17 [10] | YG18 [28] | ZN18 [31] | Ours |

Input |
B(FY17[10]) | B(ZN18[31]) | R(ZN18[31]) | (Ours) | B(Ours) | R(Ours) |
We make qualitative and quantitative comparisons with prior works on our dataset. Here we compare our method with the layer separation method by Li and Brown [16] and reflection suppression method by Arvanitopoulos et.al. [2] with their default parameters. We re-trained the method by Zhang [31], fine-tune the CEILNet [10] based on its released pre-trained model on our training set. We use the pre-trained BDN [28] directly to evaluate on our validation set because the training code is not published yet. We sent the test images to the authors of CRRN [24] and they provided their results kindly. The quantitative and qualitative comparisons are presented below:
Background | Reflection | ||||
---|---|---|---|---|---|
Method | SSIM | PSNR | SSIM | PSNR | Runtime |
Input | 0.801 | 19.02 | N/A | N/A | N/A |
LB14[16] | 0.763 | 17.77 | 0.231 | 16.58 | 0.475 |
AN17[2] | 0.786 | 19.28 | 0.285 | 15.74 | 99.3 |
FY17[10] | 0.820 | 21.65 | N/A | N/A | 0.095 |
WS18[24] | 0.812 | 19.03 | N/A | N/A | 0.619 |
YG18[28] | 0.800 | 20.03 | 0.221 | 9.75 | 0.024 |
ZN18[31] | 0.849 | 22.16 | 0.463 | 18.50 | 0.332 |
Ours | 0.860 | 23.09 | 0.559 | 20.19 | 0.061 |
|
Background | Reflection | |||
---|---|---|---|---|---|
Method | SSIM | PSNR | SSIM | PSNR | |
|
Input | 0.783 | 19.86 | N/A | N/A |
FY17[10] | 0.832 | 22.04 | N/A | N/A | |
WS18[24] | 0.725 | 18.98 | N/A | N/A | |
YG18[28] | 0.766 | 18.97 | 0.065 | 7.25 | |
ZN18[31] | 0.852 | 23.14 | 0.420 | 21.60 | |
Ours | 0.886 | 25.53 | 0.654 | 28.51 | |
|
Input | 0.869 | 22.15 | N/A | N/A |
FY17[10] | 0.873 | 21.87 | N/A | N/A | |
WS18[24] | 0.820 | 18.87 | N/A | N/A | |
YG18[28] | 0.858 | 21.71 | 0.256 | 8.92 | |
ZN18[31] | 0.881 | 22.39 | 0.266 | 17.84 | |
Ours | 0.898 | 22.76 | 0.479 | 21.07 | |
|
Quantitative Comparison: As shown in Table 2, we compare our SRRN with previous works. Note that we only show background results because methods [10, 24] only provide the background layer. Results are shown in Table 3.
Qualitative Comparison: We qualitatively compare the results of our proposed method against previous state-of-the-arts methods over synthetic and real-world images with reflection. We mainly present the results on synthetic data in Fig. 7, real data in Fig. 8.
Next, we test the running time of prior works and ours and present the result in the last column of Table 2. We test different approaches on , with a Intel ® i7-7700 CPU and a GPU card. The comprehensive comparison is illustrated in Fig. 9.

5.2 Ablation Study
In this section, to verify the the effectiveness of semantic guidance, we re-train the network on these three ablations: without semantic information (w/o ), with out semantic guidance as shown middle in 5 (w/o fusion), and we add ground-truth semantic map to explore the relationship between different quality of semantic map and reflection removal. Furthermore, we conduct the ablation study of in Eqn. 4 (w/o ).
As shown in Fig. 10, we observe that with semantic guidance, layers are separated clearly where color or structure is ambiguous. We list the numerical results in Table 4 to show the effectiveness of our SRNN, result shows that SRRN could performance well without extremely high quality of semantic information.

Background | Reflection | |||
Method | SSIM | PSNR | SSIM | PSNR |
Input | 0.801 | 19.02 | N/A | N/A |
w/o | 0.820 | 20.98 | 0.317 | 15.46 |
w/o | 0.833 | 21.91 | 0.451 | 18.53 |
w/o fusion | 0.854 | 22.97 | 0.513 | 19.33 |
SRRN | 0.860 | 23.09 | 0.559 | 20.19 |
with | 0.867 | 23.85 | 0.571 | 19.71 |
5.3 Exploration of Performance vs the Reflectance
In this section, we make experiments on the relationship between the SRRN performance and the reflectance intensities. We generate a series of image quadruples of , where , , we compare the final mIoU of DeeplabV3+ [6] and the final SSIM/PSNR of our baseline [31] on such images. As presented in Fig. 11, the proposed SRRN perform a higher score than the baseline in most cases with different values. Furthermore, the SRRN performs a more robust result to different intensities of reflectance, as illustrated in Fig. 12.


5.4 Extend Applications
We extend our method to another two image enhancement tasks: image dehazing and color enhancement, using our trained SRRN without any other fine-tune on image dehazing or color enhancement dataset. These two image tasks could be treated as image layers separation task, semantic segmentation module could provide guidance to color and structure priors reconstruction. For image dehazing, we aim at removing the haze layer which subjects to visibility degradation caused by particle-scattered light. For color enhancement, we aim to enhance different scene color from color shifting, contrast loss if saturation attenuation. The results are presented in Fig. 13.

5.5 Failure Cases and Discussion
Although the SRRN achieves the state-of-the-art on these three datasets, there are still challenging cases illustrated in Fig. 14. One of the challenging scenarios where reflection in the input is too strong, background is contaminated heavily that our model may not separate layers successfully. Note that reflections cannot be totally removed by these methods, but still, our result is superior to [31] (, the person in background is more distinguishable, the reflection layer is more clearer).
6 Conclusion
In this paper, We have presented an approach to use semantic clues for the task of single image reflection separation. Unlike prior works that use only low-level information, we employ the semantic information as guidance to extract the background layer and reflection layer. We design a deep encoder-decoder network for image feature extraction and use a semantic segmentation network in parallel. Then with the two kinds of information fused together, our separation network can correctly separate the background layer and reflection layer. We evaluate our method with other prior works extensively on three different datasets. The comparison result shows that our approach can outperform the existing methods both quantitatively and visually on all three datasets.
References
- [1] A. Agrawal, R. Raskar, S. K. Nayar, and Y. Li. Removing photography artifacts using gradient projection and flash-exposure sampling. TOG, 24(3):828–835, 2005.
- [2] N. Arvanitopoulos, R. Achanta, and S. Susstrunk. Single image reflection suppression. In CVPR, 2017.
- [3] A. S. Baslamisli, H.-A. Le, and T. Gevers. Cnn based learning using reflection and retinex models for intrinsic image decomposition. In CVPR, 2018.
- [4] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2016.
- [5] L. C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587v3, 2017.
- [6] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- [8] L. Ding and A. Goshtasby. On the canny edge detector. Pattern Recognition, 34(3):721–725, 2001.
- [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- [10] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf. A generic deep architecture for single image reflection removal and image smoothing. In ICCV, 2017.
- [11] X. Guo, X. Cao, and Y. Ma. Robust separation of reflection from multiple images. In CVPR, 2014.
- [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- [13] S. Ilya, M. James, D. George, and H. Geoffrey. On the importance of initialization and momentum in deep learning. In ICML, 2013.
- [14] N. Kong, Y. W. Tai, and J. S. Shin. A physically-based approach to reflection separation: from physical modeling to constrained optimization. TPAMI, 36(2):209–221, 2014.
- [15] Y. Li and M. S. Brown. Exploiting reflection change for automatic reflection removal. In ICCV, 2013.
- [16] Y. Li and M. S. Brown. Single image layer separation using relative smoothness. In CVPR, 2014.
- [17] A. Nandoriya, M. Elgharib, C. Kim, M. Hefeeda, and W. Matusik. Video reflection removal through spatio-temporal optimization. In ICCV, 2017.
- [18] T. Sandhan and Y. C. Jin. Anti-glare: Tightly constrained optimization for eyeglass reflection removal. In CVPR, 2017.
- [19] Y. Y. Schechner, N. Kiryati, and R. Basri. Separation of transparent layers using focus. In ICCV, 1998.
- [20] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman. Reflection removal using ghosting cues. In CVPR, 2015.
- [21] C. Simon and I. K. Park. Reflection removal for in-vehicle black box videos. In CVPR, 2015.
- [22] S. N. Sinha, J. Kopf, M. Goesele, D. Scharstein, and R. Szeliski. Image-based rendering for scenes with reflections. TOG, 31(4):1–10, 2012.
- [23] R. Wan, B. Shi, L. Y. Duan, A. H. Tan, and A. C. Kot. Benchmarking single-image reflection removal algorithms. In IEEE ICCV, 2017.
- [24] R. Wan, B. Shi, L.-Y. Duan, A.-H. Tan, and A. C. Kot. Crrn: Multi-scale guided concurrent reflection removal network. In CVPR, 2018.
- [25] P. Wieschollek, O. Gallo, J. Gu, and J. Kautz. Separating reflection and transmission images in the wild. In ECCV, September 2018.
- [26] P. Wieschollek, O. Gallo, J. Gu, and J. Kautz. Separating reflection and transmission images in the wild. In ECCV, 2018.
- [27] T. Xue, M. Rubinstein, C. Liu, and W. T. Freeman. A computational approach for obstruction-free photography. TOG, 34(4):1–11, 2015.
- [28] J. Yang, D. Gong, L. Liu, and Q. Shi. Seeing deeply and bidirectionally: A deep learning approach for single image reflection removal. In ECCV, 2018.
- [29] J. Yang, H. Li, Y. Dai, and R. T. Tan. Robust optical flow estimation of double-layer images under transparency or reflection. In CVPR, 2016.
- [30] J.-S. Yun and J.-Y. Sim. Reflection removal for large-scale 3d point clouds. In CVPR, 2018.
- [31] X. Zhang, R. Ng, and Q. Chen. Single image reflection separation with perceptual losses. In CVPR, 2018.
- [32] W. Zhou, B. Alan Conrad, S. Hamid Rahim, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process, 13(4):600–612, 2004.
Comments
There are no comments yet.