It is often desirable to remove glass reflection as it may contaminate the visual quality of a photograph. Reflection separation is also arguably important for robots to work robustly in the real world as the content in reflection usually does not exist in the viewing frustum of a camera. One intriguing property of reflection is that reflected light is often polarized, which may facilitate reflection removal. In this paper, we study reflection removal with polarized sensors by designing a customized deep learning framework.
An image with reflection is a mixture of reflection and transmission, as shown in Fig. 1. In raw data space, the mixed image can be formulated as
where and are transmission and reflection, respectively. We name the light behind glass as background and the light that passes through glass as transmission . Although most prior work treats as the same as [34, 28], we argue that and are different. is darker than as some light is reflected or absorbed by glass, and there is a spatial shift between and due to refraction.
A common issue of many existing reflection removal methods [34, 32, 7, 31, 24] is that strict assumptions are imposed on reflection. These assumptions make previous methods work well in special cases but fail in many others. For example, many works assume reflection images are out of focus [7, 34]. As a result, these approaches may not remove reflection properly when the reflection is sharp and strong. Another prior assumption is on ghost cues  that result from multiple reflections inside a thick glass. However, ghost cues do not exist in thin glass.
The lack of diverse and high quality real-world data is another challenging issue. Zhang et al.  and Wei et al.  have collected a small set of real-world data where only background images (in Fig. 1 and Fig. 4) are captured as the ground-truth transmission images. However, background images are not perfectly aligned with the mixed images due to refraction and also have the problem of intensity decay ( appears darker than
) and color distortion (colored glass). Misalignment introduces great challenges in training a machine learning model and the intensity difference makes it even more difficult. Moreover, since the type of reflection depends on the glass type and only one type of glass is used to collect data, the models trained on these data cannot generalize well to other types of glass.
To be able to relax the assumptions about the appearance of reflection, we leverage polarization that inherently exists in almost all reflected light. Fig. 2 shows an example polarized image. Existing works based on polarization often impose strict assumptions. A common one is that all light sources are unpolarized , which is easily violated in the real world because reflection happens in different types of surfaces in addition to glass and polarized or partially polarized light source exists commonly, such as the LED light. As can be seen in Fig. 2, polarization exists both inside and outside the glass. We cannot solely rely on this information. To rule out the case that polarization also happens in transmission image, our work in this paper removes this assumption. Therefore, our method is more general and applicable to more scenarios.
To ensure the diversity and quality of real-world data, we propose a new data collection pipeline called M-R based on the principle that raw image space is linear. We capture and only and obtain the transmission through . Note that we capture the raw sensor data so that Eq. 1 holds. Our formulation is physically faithful to image formation and eases the process of data collection. We show that with our novel M-R pipeline, it is easy to capture reflection caused by a diverse set of glass. We use the M-R pipeline to build a real-world polarization dataset collected by a novel polarization sensor for reflection removal.
With the collected dataset, we propose a two-stage framework for reflection removal from polarized images. Our approach firstly estimates reflection, with which it infers the transmission image secondly. Our PNCC (perceptual NCC) loss is used to minimize the similarity between the output reflection and transmission. Experiments demonstrate that our method achieves state-of-the-art performances on various metrics. The ablation study shows that our approach benefits from polarized data, PNCC, and the two-stage framework design. Our contributions are summarized as follows:
We observe two important factors for the task of reflection removal: 1) the difference between transmission and background is noticeable. 2) the linearity from reflection to mixed image holds perfectly on raw data.
We design a new data collection pipeline called M-R, which helps us collect diverse real-world data with perfect alignment by utilizing glass in the real world.
We propose a deep learning method for reflection removal based on polarization data. Our method does not impose any assumption on the appearance of reflection. A two-stage framework is adopted to get better performance. We design a PNCC loss, which can be applied to many image decomposition tasks. Experiments show that our method outperforms all state-of-the-art methods and has better generalization.
|RGB M||ISP M-R||Gamma M-R||Raw M-R||Pol M||ISP M-R||Raw M-R|
|RGB R||Closeup||Closeup||Closeup||Pol R||Closeup||Closeup|
2 Related Work
Single image reflection removal.
Most single image reflection removal methods [7, 34, 31, 32] rely on various assumptions. Considering image gradients, Arvanitopoulos et al.  propose the idea of suppressing the reflection, and Yang et al.  propose a faster method based on convex optimization. These methods fail to remove sharp reflection. Under the assumption that transmission is always in focus, Punnappurath et al.  design a method based on dual-pixel camera input. For most deep learning based approaches, training data is critical for good performance. CEILNet , Zhang et al.  and BDN 
assume reflection is out of focus and synthesize images to train their neural networks. CEILNet estimates target edges first and uses it as guidance to predict the transmission layer. Zhang et al.  use perceptual and adversarial losses to capture the difference between reflection and transmission. BDN  estimates the reflection image, which is then used to estimate the transmission layer. These methods [34, 7, 31] work well when reflection is more defocused than transmission but fail otherwise. To break the limitation of using solely synthetic data, Zhang et al.  and Wei et al.  collect real-world datasets for training. However, their datasets have misalignment issues and do not contain sufficient diversity. Wei et al.  propose to use high-level features that are less sensitive to small misalignment to calculate losses. To obtain more realistic and diverse data, Wen et al.  and Ma et al.  propose methods to synthesize data using a deep neural network and achieve better performance and generalization. Though the data is more perceptually appealing, physical authenticity remains a doubt.
Polarization-based reflection removal.
utilize independent component analysis to separate reflection and transmission images. With the assumption of unpolarized light sources, Kong et al. proposed an optimization method to automatically find the optimal separation of the reflection and transmission layer. Wieschollek et al.  combine deep learning with a polarization-based reflection removal method. Different from previous works, they eliminate a number of assumptions (e.g., the glass must be perfectly flat) and propose a pipeline to synthesize data with polarization information from regular RGB images. However, all light sources are still assumed to be unpolarized.
Multi-image reflection removal
Polarization-based reflection removal methods are a special category of multi-image approaches. Agrawal et.al  use a pair of flash/no-flash images. Many works [26, 22, 21, 16, 10, 11, 30] move the camera to exploit the relative motion between reflection and transmission for reflection removal, while most works assume that motion of the reflection layer is larger than that of the transmission layer. Sarel and Irani [21, 22] assume that both reflection and transmission should be static. Li et al.  use SIFT-flow to align the images to make a pixel-wise comparison under the assumption that the background dominates in the mixed image. Xue et al.  also require that objects in reflection and transmission are roughly static. Han et al.  require the transmission to be more dominant than the reflected scenes.
|Glass type||Data format||Scene||Alignment||Intensity decay||Raw|
|Zhang et al. ||1||110||Misalignment (calibrated)||Yes||No|
Wei et al. 
|SIR benchmark ||3||100+20+20||Misalignment (calibrated)||Yes||No|
3 M-R Dataset
Real-world reflection removal datasets [34, 28] are limited in quantity and diversity because of the complicated data collection procedure and the difficulty of acquiring ground-truth reflection and transmission. We propose a new method named M-R to collect paired data for reflection removal. A triple is collected for each scene where are the mixed image, the reflection image, and the transmission image, respectively.
We use the PHX050S-P polarization camera, which is equipped with an IMX250MZR CMOS. This sensor captures an image with four different polarizer angles in one single shot. Each polarization pixel consists of units with four sub-pixels corresponding to the polarization angles . The light intensity passing through a polarizer follows Malus’ law :
where is the angle of polarizer, and is the polarization angle of incoming light. Note that the equations related to polarization hold only for raw data that is linear to light intensity, and thus we adopt the RAW format in our dataset. The resolution of each captured RAW image is . We extract sub-pixels with the same polarization angle to form an image, and we can get four images with resolution . The value range of each pixel is from 0 to 4095. Let be the light intensity, and let be the intensity for unpolarized light and linear polarized light. The degree of polarization equals to . Then we define as the light intensity passed through 4 angles . According to the properties of polarization, we have:
|Zhang et al. ||Our||Our|
|Wei et al. ||Our||Our|
Data collection pipeline
Fig. 4 shows the comparison between our pipeline and previous [34, 28, 27]. Previous methods take a photo in front of glass as a mixed image and then remove the glass to take another one as the transmission so that the difference between background and transmission is ignored. As mentioned before, is the sum of and (not ). Therefore, inferring is believed to be easier than . However, it is relatively difficult to capture directly because all the reflection must be blocked. Therefore, we capture and only and then obtain .
While prior work [29, 17] claims that the combination of reflection and transmission is beyond linearity, we argue that the non-linearity is introduced by ISP pipeline when operating in RGB space. On the other hand, there is no such problem for raw data since the voltage on the sensor is linearly correlated with the intensity of light. Therefore, Eq. (1) holds, and we can obtain a transmission image directly by . Fig. 3 shows the difference between RGB data and raw data. It is clear that our formulation conforms with reality, and the direct subtraction removes reflection perfectly. To the best of our knowledge, we are the first to use as ground truth on raw data.
To ensure perfect alignment between and , we use a tripod to fix the camera and take the polarized images remotely controlled by a computer. We first use a piece of black cloth to cover the back of the glass to block all transmission to obtain a clear reflection . Then we remove the cloth to collect the mixed image . To ensure the intensity of reflection are the same in and , we set the camera to manual model with relatively long exposure time to avoid noise.
Analysis of M-R
Table 1 shows the comparison between our dataset and previous datasets. Compared with previous methods, M-R has the following advantages:
a). More diversity. Previous methods require the glass to be thin, non-colored, removable, and flat. As long as the transmission is clear, we do not make such assumptions on the glass. Therefore, it is possible for us to utilize numerous glass in our daily life, such as glass doors and windows. The glass can be flat or curved, thin or thick, colored, or non-colored. We are even able to record dynamic scenes if the reflection is static.
b). Simplified task. Since might be different with in color, intensity, and position, using as ground truth introduces extra problems in reflection removal. Estimating is an equally useful and simplified task. Our dataset has provided perfectly aligned pairing data.
c). Improved simulation. Even if we use our method, collecting paired data is time-consuming. Since previous methods have the misalignment problem, they can not correctly obtain by . Besides, they use RGB images instead of raw images, so non-linearity in intensity is introduced. Derived from the linearity discussed above, we can use directly to simulate various realistic data where and varies from to with unpaired and .
To improve the quality of our dataset, we calculate the mean intensity ratio for each pair of and , and discard the pairs if the ratio is greater than 10 or smaller than 0.1. As in this situation, either or is perceptually invisible. Negative values after subtraction, due to noise, are set to zero. If there is more than one layer of glass, we crop the image to keep only the part with a single layer. Polarization can be calculated correctly only if each polarization image is correct. Hence, we need to pay special attention to overexposed areas. We calculate an overexposure mask based on the intensity of .
where is a threshold and we use here.
4.1 Reflection-Based Framework
Unpolarized light reflected from the glass surface or passed through the glass becomes partially polarized. The degree of polarization, , depends on the property of glass and the angle of incidence. For a specific type of glass with refractive index , Fig. 6 shows how the degree of polarization changes. Based on this fact, Kong et al.  and Wieschollek  propose two methods for reflection removal. However, in the real world, unpolarized light sources assumption doesn’t hold well because partially polarized light sources exist commonly, and reflection exists not only through glass surfaces. These methods would then fail [19, 14]. Different from Wieschollek et al.  and Kong et al. , we do not assume all light sources are unpolarized. We utilize the fact that the of transmission is quite different. Hence we propose to use a deep learning based and two-stage method to catch the differences between reflection and transmission and separate them.
Fig. 5 shows an overview of our framework. Our method takes a multi-channel image as input. The first 4 channels, are extracted from mixed image for each polarization angle. The next 4 channels are , calculated from Eq. 3, 4, 5, 6. The final network output is a one-channel image, the recovered transmission , the same size as , that is half of in width and height.
There are two stages in our process. The first stage is dedicated to estimating reflection , and the second is for transmission with estimated . We use a two-stage design for two reasons. Firstly, reflection contributes a lot to mix image and has a strict relationship on RAW space (Eq. 1). Furthermore, as discussed above, reflection and transmission are quite different in terms of polarization. The separated decoders for them are helpful to learn specific features. BDN  also observes the importance of reflection and improves performance by training a bidirectional network. However, their performance relies on an assumption to make and more different: the reflection is blurry. Undoubtedly, their model cannot distinguish and well when reflection is sharp. Note that if without polarization, such design may deteriorate the performance as the difference between them becomes subtle in regular image data.
4.2 Loss function
In general, reflection and transmission images would be different on most pixels. We propose a perceptual normalized cross-correlation (PNCC) loss to minimize the correlation between estimated reflection and transmission on different feature maps. Our PNCC loss is defined on different feature maps of VGG-19 . Given two images and , we try to calculate the NCC of their feature maps. In practice, the monotonicity is not right in extreme cases where the intensity between and has a big difference. Therefore, we normalize to , denoted as . The PNCC loss is defined as follows:
where denotes the -th layer feature maps of VGG-19 . In practice, we use three layers ’conv2_2’,’conv3_2’,’conv4_2’. PNCC can also be applied using another pre-trained neural network.
where is sampled from 0.01 to 1. When , and are completely two different images, PNCC is the lowest. When , contains most part of , PNCC is the largest, but the non-normalized version is not. Our PNCC loss can also be applied to other image decomposition tasks. More results are demonstrated in experiments.
The perceptual loss 
has been proved effective on various computer vision tasks[15, 34, 5]. In our task, we modify it to account for the overexposed area. Given the overexposure mask , the perceptual loss is defined as:
is the weight for the -th layer. Following Chen et al. , we initialize based on the number of parameters in each layer and we adopt 6 layers ’conv1_1’, ’conv1_2’, ’conv2_2’, ’conv3_2’, ’conv4_2’, and ’conv5_2’.
In total, the loss function we optimize is the sum of PNCC loss betweenand and perceptual loss.
To improve the performance of our model, we augment the input to the network with the hypercolumn features extracted from the VGG-19 network. In particular, we extract ’conv1_2’ from the VGG-19 network for and upsample the layers bilinearly to match the resolution of the input image. Since our data is in RAW format and pre-trained VGG-19 
was trained on ImageNet dataset in RGB space, we first apply a gamma correction to the raw input and then feed them into the network. We adopt U-Net  as our network architecture for both and . We modify the kernel size of the first layer to and use it to reduce the dimensionality of the augmented input . At the training, we first train and
together for 200 epochs using Adam optimizer and learning rate 0.0001. Then we decay the learning rate to 0.00001 and train for 50 more epochs.
5.1 Experimental procedure
We compare our method with several state-of-the-art reflection removal approaches, including both deep learning and traditional methods. Specifically, in the deep learning track, we choose Zhang et al. , Wei et al. , BDN , Wieschollek et al. , and Wen et al. . For fairness, we re-train models on our M-R dataset using official source codes for Zhang et al.  and Wei et al. . For BDN  and Wieschollek et al. , we directly use the available pre-trained models since no training codes are available. For Wen et al. , as their training requires additional alpha matting masks that are not available in our task, we also use their pre-trained model.
For polarization based methods, we choose Kong et al. , Schechner et al.  and Fraid et al. . Third-party implementations by Wieschollek et al.  are used. We also evaluate the convex optimization based method by Yang et al.  using their official source codes.
DoubleDIP  is an unsupervised image decomposition model, but it fails in our setting. The possible reason is that DoubleDIP holds a simple assumption that a mixed image is composed of two images with spatial-invariant coefficients, but real-world data break the assumption.
|Fraid et al.** ||21.99||0.714||6.48||0.241|
|Schechner et al.** ||23.42||0.655||12.40||0.247|
|Kong et al.** ||18.76||0.402||12.96||0.271|
|Yang et al. ||25.42||0.780||-||-|
|Wieschollek et al.* ||22.15||0.711||15.93||0.462|
|Wen et al.* ||26.62||0.827||-||-|
|Wei et al. ||30.13||0.899||-||-|
|Zhang et al. ||31.91||0.903||32.02||0.88|
|Ours (3 inputs)||33.91||0.930||33.53||0.903|
The experiments are mainly conducted on our M-R dataset since it is the only available raw image dataset. We select 100, 107 pairs of data as a validation set and a testing set. All data are stored in the 16-bit PNG format to avoid precision loss.
Most existing works train their models in RGB space. To minimize the gap between training and testing data for these methods, we average the intensity of followed by gamma correction () before inputting to their models. Note that the domain gap between RGB images and gray images may degrade the performance of some methods. All the input images and results are saved as 16-bit PNG or NPY files to avoid accuracy loss.
5.2 Comparisons with baselines
Table 2 summarizes the evaluation results on our dataset. Our method presents a new state-of-the-art performance. Performance of traditional polarization-based methods [23, 14, 8] rank low since their assumption that all light sources are unpolarized is oversimplified for real-world data. An interesting phenomenon is that BDN  scores badly in reflection despite its bidirectional network design. After analysis, we find out that BDN confuses between transmission and reflection in many cases, which affects the performance significantly. Scores of Zhang et al.  and Wei et al.  are the closest to ours. In addition to being retrained on our dataset, another common characteristic of the two methods is that they are designed for not only synthetic data but also real data.
Fig. 8 shows several samples by different methods in different situations. We choose the best two single image models and the best polarization method for perceptual comparisons. As seen in Fig. 8, our method can handle different types of reflection well and remove the reflection pretty well without introducing artifacts. Wieschollek et al.  can also remove different types of reflection based on polarization, but their results have visible artifacts, and it even amplifies the reflection for the third case. For Zhang et al.  and Wei et al. , the results have visible residual reflection left. Fig. 9 shows a hard case where the mixed image is a bit blurry. Previous methods [32, 31, 28, 29] assuming the reflection is blurry perform poorly and tend to remove too much content. Our result shows better generalization without such an assumption. Our model can also achieve good performance on curved glass and non-ideal data collected by Wieschollek et al. , as shown in Fig. 11 and Fig. 12.
5.3 Ablation study
|Input||without pol||with pol|
|GT R||without pol||with pol|
To study the influence of polarization information, we replace the input channels all with and keep the network structure the same. To study the effect of our two-stage structure, we remove the loss on . Finally, we conduct an experiment with the setting without PNCC. The results are shown in Table 3. Polarization information improves the performance most. Fig. 10 shows a sample. The model predicts as without the support of polarization information. The two-stage design also boosts the performance of our model by a large margin. Our proposed PNCC can further increase the performance of our model on reflection removal.
As an additional evaluation, we compare PNCC with the exclusion loss proposed by Zhang et al. . The experiment is conducted in DoubleDIP  framework, which adopts exclusion loss to decompose images. By replacing the exclusion loss with our PNCC, we get the evaluation results in Table 4
. Our approach outperforms their official implementation easily and still performs better after tuning the hyperparameters for them.
We propose a two-stage polarized reflection removal model with perfect alignment of input-output image pairs. With a new reflection formulation to bypass the misalignment problem between the background and mixed images, we build a polarized reflection removal dataset that covers more than 100 types of glass in the real world. A general decomposition loss called PNCC is proposed to minimize the correlation of two images at different feature levels. We have conducted thorough experiments to demonstrate the effectiveness of our model. We hope our novel model formulation and the M-R dataset can inspire research in reflection removal in the future.
We thank SenseTime Group Limited for supporting this research project.
-  (2005) Removing photography artifacts using gradient projection and flash-exposure sampling. TOG 24 (3), pp. 828–835. External Links: Cited by: §2.
-  (2017) Single image reflection suppression. In CVPR, Cited by: §2.
-  (2006) Recovery of surface orientation from diffuse polarization. TIP 15 (6), pp. 1653–1664. Cited by: §1.
-  (2005) Sparse ica for blind separation of transmitted and reflected images. IJIST 15 (1), pp. 84–91. Cited by: §2.
-  (2017) Photographic image synthesis with cascaded refinement networks. In ICCV, Cited by: §4.2, §4.3.
-  (2009) ImageNet: A large-scale hierarchical image database. In CVPR, Cited by: §4.3.
-  (2017) A generic deep architecture for single image reflection removal and image smoothing. In ICCV, Cited by: §1, §2.
-  (1999) Separating reflections and lighting using independent components analysis. In CVPR, External Links: Cited by: §2, §5.1, §5.2, Table 2.
-  (2019) ”Double-dip”: unsupervised image decomposition via coupled deep-image-priors. In CVPR, Cited by: §5.1, §5.3, Table 4.
-  (2014) Robust separation of reflection from multiple images. In CVPR, Cited by: §2.
-  (2017) Reflection removal using low-rank matrix completion. In CVPR, Cited by: §2.
-  (2002) Optics. Pearson education, Addison-Wesley. External Links: Cited by: §3.
Perceptual losses for real-time style transfer and super-resolution. In ECCV, Cited by: §4.2.
-  (2014-02) A physically-based approach to reflection separation: from physical modeling to constrained optimization. TPAMI 36 (2), pp. 209–221. Cited by: §2, §4.1, §5.1, §5.2, Table 2.
Fully automatic video colorization with self-regularization and diversity. In CVPR, Cited by: §4.2.
-  (2013) Exploiting reflection change for automatic reflection removal. In ICCV, Cited by: §2.
-  (2019) Learning to jointly generate and separate reflections. In ICCV, Cited by: §2, §3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §4.3.
-  (2018) Separating reflection and transmission images in the wild. In ECCV, Cited by: §2, Figure 8, §4.1, Figure 12, §5.1, §5.1, §5.2, Table 2.
-  (2019) Reflection removal using a dual-pixel sensor. In CVPR, Cited by: §2.
-  (2004) Separating transparent layers through layer information exchange. In ECCV, Cited by: §2.
-  (2005) Separating transparent layers of repetitive dynamic behaviors. In ICCV, Cited by: §2.
-  (2000-03) Polarization and statistical analysis of scenes containing a semireflector. J. Opt. Soc. Am. 17, pp. 276–84. Cited by: §5.1, §5.2, Table 2.
-  (2015) Reflection removal using ghosting cues. In CVPR, Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §4.2, §4.3.
-  (2000) Layer extraction from multiple images containing reflections and transparency. In CVPR, Cited by: §2.
-  (2017) Benchmarking single-image reflection removal algorithms. In ICCV, Cited by: Table 1, §3.
-  (2019) Single image reflection removal exploiting misaligned training data and network enhancements. In CVPR, Cited by: §1, §1, §2, Table 1, Figure 4, §3, §3, Figure 8, Figure 9, §5.1, §5.2, §5.2, Table 2.
-  (2019) Single image reflection removal beyond linearity. In CVPR, Cited by: §2, §3, Figure 9, §5.1, §5.2, Table 2.
-  (2015) A computational approach for obstruction-free photography. TOG 34 (4), pp. 79. Cited by: §2.
-  (2018) Seeing deeply and bidirectionally: a deep learning approach for single image reflection removal. In ECCV, Cited by: §1, §2, §4.1, Figure 9, §5.1, §5.2, §5.2, Table 2.
-  (2019) Fast single image reflection suppression via convex optimization. In CVPR, Cited by: §1, §2, Figure 9, §5.1, §5.2, Table 2.
-  (1999) Polarization-based decorrelation of transparent layers: the inclination angle of an invisible surface. ICCV. Cited by: §2, Table 2.
-  (2018) Single image reflection separation with perceptual losses. In CVPR, Cited by: §1, §1, §1, §2, Table 1, Figure 4, §3, §3, Figure 8, §4.2, §4.2, §5.1, §5.2, §5.2, §5.3, Table 2.