Intrinsic image decomposition is the inverse problem of recovering the image formation components, such as reflectance and shading (Barrow1978). The shading component consists of light effects such as direct illumination, geometry, shadow casts and ambient light. The reflectance component represents the (albedo) color of an object and is free of any lighting effect. Intrinsic images are favorable for various computer vision tasks. For example, albedo images are beneficial for semantic segmentation algorithms because of their illumination invariant representation (Baslamisli2018ECCV). Similarly, most of the scene editing applications, such as recoloring, rely on albedo images (Ye2014), whereas shading images are preferred for relighting tasks (Shu2017).
The pioneering work on intrinsic image computation is the Retinex algorithm by Land1971
which uses a heuristic that is based on the rectilinear Mondrian world assumption. In a Mondrian world, where surfaces have piece-wise constant colors, strong gradients correspond to albedo changes, while shading variations are related to weaker ones. Then, using a re-integration algorithm (i.e. Poisson) over the strong (albedo) gradients, the albedo component is computed. However, classifying image gradients into albedo or shading is not a trivial task due to various photometric effects such as strong shadow casts, illuminant color, surface geometry changes or weak albedo transitions. For instance, shadow boundaries or abrupt changes in surface geometry may cause strong intensity shifts and may therefore be interpreted as albedo changes. Moreover, the Mondrian world assumption do not apply to real world scenes. Other traditional approaches usually utilize an optimization process by introducing constraints on the intrinsic components(Gehler2011; Shen2011; Barron2015). Most of the priors aim at constraining the albedo component such as global reflectance sparsity, piece-wise constant reflectance or chromaticity reflectance correlation. On the other hand, the shading intrinsic is usually constrained by a smoothness prior.
More recent methods rely on deep learning models, specialized loss functions, and large scale datasets. For example,Baslamisli2018CVPR provide an end-to-end solution to the Retinex approach in a deep learning framework, Li2018ECCV combine four datasets with specialized loss functions to impose constraints, and Lettry2018b investigate adversarial learning. With the availability of densely annotated synthetic datasets and multiple constraints on the albedo component, CNN-based methods are capable of estimating high quality albedo maps. However, CNN-based shading estimations regularly suffer from texture and intensity ambiguities (e.g. albedo leakage) introducing (color) artifacts in the shading profiles. See Figure 1 for an illustration.
In the early days of photometric invariance in computer vision, invariant image descriptors were widely used for different vision tasks. These descriptors are invariant to certain image capturing conditions so that the vision algorithms are not affected by them, such as illumination color, surface geometry or camera position. Successful results were demonstrated for object recognition (Gevers1997)Gevers2000), and shadow removal (Finlayson2006). As CNN-based shading estimations suffer from (color) artifacts, physics-based invariant features may be useful to steer the intrinsic image decomposition process.
Therefore, we investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial shading map is calculated directly from the corresponding image gradients in a learning-free manner (unsupervised). Then, an optimization method is proposed to reconstruct the full shading map. Finally, we integrate the shading map into a deep learning model to achieve full intrinsic image decomposition.
Contributions. 1. We are the first to use photometric invariance and deep learning to address the intrinsic image decomposition task. 2. We propose albedo and shading gradient descriptors using physics-based models. 3. The shading map is calculated directly from the corresponding image gradients in a learning-free (unsupervised) manner. 4. We propose a novel deep learning model to leverage the physics-based shading map for the intrinsic image decomposition task. By doing so 5. we are the first to directly address the color leakage problem in the estimated shading maps. Finally, 5. we extend the dataset of Baslamisli2018CVPR from 15,000 to 50,000 images to train our models, which will be publicly available.
2 Related Work
Intrinsic image decomposition is an ill-posed and under-constrained problem. The pioneering work is the Retinex algorithm by Land1971 based on the assumption that albedo changes cause large gradients, whereas shading variations result in smaller ones. In general, traditional approaches use different optimization processes to constrain the intrinsic components together with the Retinex heuristic. For example, Gehler2011 impose constraints on the global albedo sparsity. SIRFS estimates shape, chromatic illumination, albedo, and shading from a single image by applying seven different constraints on the intrinsic components (Barron2015). Intrinsic Images in the Wild (IIW) model combines commonly used priors together with a dense conditional random field (Bell2014). Shen2011 use optimization to constraint neighboring pixels having similar intensity values to have similar albedo values. Shen2008 exploit non-local texture cues by constraining distinct points with the same intensity-normalized textures to have the same albedo values. Furthermore, user interactions are investigated as additional priors to specify albedo values (Bousseau2009; Shen2013). Finally, image sequences of the same scene under varying illumination are used to impose constant albedo (Weiss2001; Matsushita2004). Most of the priors mentioned above are related to the albedo intrinsic. It is partially due to color information being more descriptive for robust computer vision algorithms (Sande2009). It is also relatively harder to define priors for the shading intrinsic, because geometry and lighting information are entangled in the representation.
With the introduction of large-scale synthetic datasets, recent research use convolutional neural networks(Shi2017; Baslamisli2018ECCV; Li2018ECCV). Narihia2015 are the first to use CNNs to learn the task end-to-end in a data-driven manner. Shi2017 make use of a very large scale dataset along with a specialized network to exploit correlations between the intrinsic components. Baslamisli2018CVPR convert the Retinex approach into a deep learning framework together with a physics-based image formation loss. Cheng2018
use a Laplacian pyramid inspired neural network architecture to exploit scale space properties.Lettry2018b explore adversarial residual networks. Fan2018 apply a domain filter guided by a learned edge map to flatten the albedo estimations. Li2018ECCV combine four datasets with specialized loss functions. Janner2017 explore the problem in a self-supervised setting by estimating albedo, shape, and lighting, where shape and lighting estimations are used to train a differentiable shading function. Baslamisli2020 further decomposes the shading into different photometric effects. Image sequences of the same scene under varying illumination are also explored by deep learning approaches (Lettry2018; Li2018CVPR). Recent work focusing on inverse rendering tasks also achieve superior albedo estimations (Sengupta2019; Li2020). Nonetheless, these methods are limited by indoor settings and require surface normal and environmental lighting supervision.
CNN-based methods are capable of estimating high quality albedo maps that are mostly free of photometric effects. However, their shading estimations are often negatively affected by albedo transitions causing texture ambiguities and intensity variations, as illustrated in Figure 1. To mitigate the problem, for example, Zhou2019 shift the problem of predicting shading to predicting surface normals and lighting properties. Yet, their work is limited by indoor settings and require additional modalities and supervision, similar to inverse rendering works. Another example is CGIntrinsics which over-smooths the shading estimations, yet that in return causes structure loss in the shading maps (Li2018ECCV). As CNN-based shading estimations suffer from albedo artifacts, invariant image representations may be favorable to steer the process. They were widely used for various image understanding tasks (Drew1998; Finlayson1995; Finlayson2006; Gevers1997; Gevers1998; Gevers2000). One example is the illumination invariant color ratio features used for robust object recognition (Finlayson1992). Stricker1992 combine ratio histograms with boundary histograms for a more robust framework. Nayar1996
utilize color ratios for pose estimation.Matas1995 embed ratio information into a graph representation also for efficient object recognition. Barnard2000
identify probable shadow regions using color ratios.Gevers2001 exploit ratio gradients for image retrieval. As invariant image representations are independent of the certain imaging conditions, they may be useful to improve CNN-based shading estimation as part of intrinsic image decomposition. To this end, in this paper, we investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading).
3.1 Image Formation Model
We use the dichromatic reflection model of Shafer1985 to describe an image. The model defines a surface (image) as a combination of diffuse and specular reflections as follows:
We assume that the diffuse reflection component dominates the imaging conditions and hence the effect of the specular reflection component is negligible, i.e. . Then, an image I over the visible spectrum is modelled by:
for three color channels , where indicates the surface normal, denotes the incoming light source direction, and m is a function of the geometric dependencies (e.g. Lambertian ). Furthermore, represents the wavelength, indicates the camera spectral sensitivity, and describes the spectral power distribution of the light source. Finally, denotes the reflectance i.e. the albedo. Then, assuming a linear sensor response and narrow band filters ()), the equation can be simplified as follows:
This equation models an image by the multiplication of its geometry , albedo and light source properties at pixel . Then, these characteristics are used to define intrinsic images as follows:
where an image at can be modelled by the element-wise product of its shading and albedo components. If the light source is colored, then the color information is embedded in the shading component.
3.2 Albedo Gradients
Using Equation 3, the image formation model for the three color channels becomes:
Considering only neighbouring pixels and , locally constant illumination can be assumed: (Land1971). By taking the difference of the logarithmic transformation of each color channel, the albedo descriptors are defined as follows:
where the remaining factor is only the albedo difference between two channels. The albedo change is a measure that is invariant to surface geometry , illumination direction , and its intensity and color . If there is no albedo change (homogeneously colored patch), then the difference is zero. Sensor artifacts or noise may slightly deviate the value from zero. Therefore, the index can be used to identify regions with constant albedo. On the other hand, when the difference deviates significantly from zero, it corresponds to a true albedo change. Hence, this measure encodes spatial information of an image emphasizing on (illumination invariant) albedo edges. Then, we propose the albedo gradient index as follows:
We calculate the albedo gradients over a local neighbourhood (patch) by using derivative filters (e.g. the derivative of a 2D Gaussian or Laplacian) to identify the changes. As a result, the average response of the albedo gradients is calculated. A neighborhood with a higher albedo gradient index value indicates a stronger albedo change, which is also illustrated in Figure 2. A patch with a constant index yields the homogeneous regions. The albedo gradient index is very intuitive and realized in real time. It is computed for a small threshold to remove possible problems caused by sensor artifacts and noise.
3.3 Shading Gradients
So far, we have described that the albedo gradient index can be used to identify uniformly colored (homogeneous) patches. In a color image, if the pixel values share the same albedo, then the only source causing those pixel values to change is the shading component. For constant (satisfying ) over an image neighborhood, the shading gradient can be computed by taking the difference of the logarithmic transformation of each color channel. We illustrate it on the red channel as follows (same also holds for green and blue channels):
Note that it is only applied on the homogeneous patches. Logarithms are usually preferred to avoid numerical instabilities, yet note that also the derivatives of the channels can be taken to yield the following shading gradient index:
Similar to the albedo gradient index, the average response is calculated, which results in representing the gradient field of an image. Note that (non-colored) shadows are included in the shading difference component i.e. when .
After obtaining the shading gradient, we reconstruct the shading map from its shading gradient fields. We use a publicly available algorithm to compute the global least squares reconstruction (Harker2008; Harker2011). Note that the albedo gradient index is used to detect uniformly colored (homogeneous) patches first. Then, the shading gradients are calculated only on the homogeneous patches. As a result, the reconstructed shading map is computed directly from the shading gradient fields of an image in an unsupervised manner. Since it is computed only on the homogeneous image regions (satisfying ), a sparse shading map is obtained. Therefore, the representation is not affected by the albedo changes. The process is illustrated in Figure 3. In the end, we can generate a sparse shading map that is directly computed from the RGB image that is also very close to the ground-truth representation.
Then, a shading smoothness constraint is used to fill in the gaps based on the neighboring pixel information. To achieve that, we adapt a publicly available optimization framework that is originally designed for the depth completion task (Zhang2018). We modify the model to impose the shading smoothness constraint to achieve a full (dense) shading map. The objective function () is defined as the sum of squared errors with two terms as follows:
where denotes the pixels that are available (not empty) in the initial sparse shading map, which are reconstructed from the gradient fields over the homogeneous regions, and denotes a neighbourhood. measures the distance between the final shading map and the initial (sparse) shading map at pixel , i.e. per-pixel reconstruction accuracy. Then, encourages adjacent pixels to have the same shading values, i.e. smoothness.
4 Intrinsic Image Decomposition
Since the sparse shading map is completed by only a smoothness constraint, the reconstructed dense map may suffer from geometry loss if the initial gaps are too large. It may also suffer from scale problems due to the least squares fitting. Therefore, we integrate the completed dense shading map into a deep learning framework to refine it and also to predict the corresponding albedo image to achieve intrinsic image decomposition. The network is expected to further improve the shading maps by supervised training and also by the differentiation of additional albedo cues. It is also expected to generate better albedo maps as the dense shading map is robust to color leakages and intensity ambiguities. As stated earlier, deep learning based shading estimations are not as good as albedo estimations. They suffer from albedo color leakages mostly due to texture ambiguities and intensity variations, as demonstrated in Figure 1. On the other hand, our physics-based generated shading map is more robust to those leakages as it is computed only on homogeneous regions. As a result, we design a CNN model such that the image only refines the initial shading estimation, and it is not directly involved in the reconstruction phase to avoid any further critical color leakage. The model is illustrated in Figure 4.
Encoder blocks use strided convolution layers for downsampling (4 times). Each convolution is followed by residual blocksHe2016. They are preferred as the deviations from the input are rather small.
encoder uses 4 consecutive residual blocks, while the shading encoder uses 1 block with different dilation rates. A residual block is composed of Batch Norm-ReLu-Conv(3x3) sequence, repeated twice.
Fusion. The final layers of the encoders are fused with a convolution and a contextual attention module (Yu2018) to create a bottleneck such that the related features can properly guide the shading estimation. As a result, the features are fused with the shading features (1) as a (learnable) weighted combination using a
convolution, and (2) by the contextual attention module. The contextual attention module learns where to use feature information from known background patches to generate missing patches for the image inpainting task. We adopt their module to our problem such that the shading features uses the information from thefeatures. It is expected to help as in a homogeneously colored patch, the only source causing pixel values to change is the shading component, i.e. . Therefore, in those regions, the shading map and the image are highly correlated. Fusion happens at resolution. Preliminary experiments suggested that lower resolutions (i.e. ) cannot reconstruct a decent shading map (too blurry) and higher resolutions (i.e. ) cause further critical color leakages in the shading.
Decoders. The fusion output is fed to the shading decoder, while the albedo decoder takes encoder’s final layer as input. Both decoders share the same structure. Encoder features are passed through Conv(3x3)-Batch Norm-LeakyReLu sequence. Then, the feature maps are (bilinearly) up-sampled and concatenated with their encoder counterpart by skip connections. The process is repeated 4 times to reach the final resolution. Shading decoder only receives shading encoder features through skip connections not to be affected by high resolution color features. Albedo decoder only receives features through skip connections. Therefore, we design a specialized network for the intrinsic image decomposition task for robust shading estimation.
Loss Functions. The loss functions used to train the model are as follows:
where is the pixel-wise reconstruction loss, which is a weighted combination of mean-squared-error loss and scale-invariant MSE loss, denotes the gradient-wise reconstruction loss, assesses the structural dissimilarity, measures the reconstruction distance in several feature spaces of a pre-trained VGG16 (Simonyan2015), is the image formation loss to force that the estimated reflectance and shading images should reconstruct the original image (i.e. ), and the s are the weights. Note that the loss functions are the standard reconstruction modules and do not impose any intrinsic image characteristics. The implementation details and other training details are provided in the supplementary material 111 https://drive.google.com/file/d/1Bl09bJfDS5KayTyHljmHgAmgV5UcxF4E/view?usp=sharing.
Dataset. To train our models, we use the ShapeNet dataset of Baslamisli2018CVPR. The dataset includes around 20,000 (synthetic) images of man-made objects randomly sampled from the original ShapeNet dataset (Chang2015). Following the setup of Baslamisli2018CVPR, we render additional images to reach around 50,000 images for training 222The dataset will be made publicly available..
5 Experiments and Evaluation
We conduct experiments on four datasets of real world objects with ground-truth intrinsics, MIT Intrinsics (Grosse2009), NIR-RGB Intrinsics (Cheng2019), Multi-Illuminant Intrinsic Images (Beigpour2015) and Spectral Intrinsic Images (Chen2017). In addition, we provide experiments on two scene-level datasets, As Realistic As Possible (Bonneel2017) a synthetic ground-truth dataset, and Intrinsic Images in the Wild (Bell2014) a real world complex dataset with relative human annotations. Finally, we provide further qualitative evaluations on real world in-the-wild images. Comparisons are provided against several state-of-the-art intrinsic image decomposition algorithms. We pick three optimization based methods: (i) STAR, a structure and texture aware advanced Retinex model (Xu2020), (ii) IIW, a framework based on clustering and a dense CRF (Bell2014), and (iii) SIRFS, a model imposing seven different priors on reflectance, shape and illumination (Barron2015). We include four deep learning based methods: (i) ShapeNet uses specialized decoder links to correlate intrinsics and is trained on 2.5M synthetic objects (Shi2017), (ii) IntrinsicNet uses deep VGG16 encoder-decoders and an image formation loss, trained on 20K synthetic objects, (iii) RetiNet provides an end-to-end solution to the Color Retinex approach using gradients, trained on 20K synthetic objects, (iv) CGIntrinsics combines two real world scenes (around 3000) and two synthetic scene level datasets (around 20K) for training with additional smoothness constraints to achieve better intrinsics. We use the publicly available models and the original outputs without any fine-tuning or post-processing stages as comparison. To evaluate our proposed method, following the common practice (Grosse2009), when dense ground-truths are available, we use the mean squared error (MSE), where the absolute brightness of each image is adjusted by least squares as the ground-truth is only defined up to a scale factor and the local mean squared error (LMSE) with window size 20. For Intrinsic Images in the Wild (IIW) dataset’s human annotations, we use Weighted Human Disagreement Rate (WHDR) metric as provided by the authors (Bell2014). All the images are resized to for fair comparison.
5.1 Evaluations on Object-level Datasets
5.1.1 MIT Intrinsic Images Dataset
The dataset contains 20 real-world objects with ground-truth intrinsic images. Objects are lit by a single directional white light source. We follow the recommendation of the authors and exclude apple, pear, phone and potato objects as they are marked as problematic (Grosse2009). The quantitative results are provided in Table 1. The table also includes the effect of the contextual attention (CA) module as an ablation study.
|OURS (w/o CA)||0.0075||0.0070||0.0073||0.0454||0.0458||0.0456|
The results show that comparing with the deep learning based estimations, our proposed models achieves better performance at generating albedo and shading maps on the dataset. Optimization based SIRFS results are better than all other learning based models. Its shading estimations yield the best results on MSE metric. It is known that SIRFS achieves superior performance on single and masked objects, yet it generalize poorly to real scenes (Narihia2015; Li2018ECCV). Nonetheless, our estimations are superior than SIRFS on all other metrics. On average, we achieve the best results by a substantial margin. Furthermore, the contextual attention module by Yu2018 leads to further performance boost on all metrics. It emerges as a fundamental building block of our proposed method.
In addition, we are extremely efficient compared with the optimization-based methods. To process a single image on MIT dataset, on average, SIRFS takes 111.38 seconds, whereas our model takes 1.79 seconds including the albedo gradient estimation, initial shading recovery from the gradients, filling the initial shading with the smoothness prior, and finally estimating complete intrinsic images. All in all, our model appears 78 times faster than SIRFS. As a side note, IIW model takes 18.09 seconds, and STAR takes 2.78 seconds to process a single image on MIT dataset 333The results are provided on Intel Xeon CPU E5-2640 v3 @ 2.60GHz..
Finally, we provide qualitative evaluations. Figure 5
demonstrates the effect of the proposed model from the initial step to reach the final shading map with progressive improvement. The results show that our framework first generates an initial shading map where the color transitions are masked out by the physics-based albedo gradient descriptors. Then the initial shading maps are filled (inpainted/interpolated) with the shading smoothness prior. They are free of color leakages and intensity ambiguities. However, they suffer from scale problems due to the least squares fitting and they are rather blurry due to the neighbourhood smoothness filling. Finally, our deep learning model is able to refine the initially filled shading maps. It makes them sharper, adjusts the scale, and finer geometry details are visible. Figure7 provides the qualitative comparison results against several state-of-the-art models. It shows that we achieve better shadow and shading handling in albedo predictions and our albedo estimations are significantly better. We attribute this to our physics-based shading reconstructions as it handles color leakage and intensity ambiguity problems. Thereby, our shading predictions has no or minimum color leakage. Moreover, the shading map estimations by the deep learning methods tend to severely overfit to the image producing strong color leakages as texture artefacts and intensity ambiguities.
5.1.2 NIR-RGB Intrinsic Images Dataset
We provide additional cross dataset experiments on NIR-RGB Intrinsic Images dataset, which was mainly generated for near-infrared imagery research (Cheng2019). It includes seven real-world objects with corresponding ground-truth intrinsics. The quantitative results are provided in Table 2.
The results show that our proposed model achieves better performance compared against other models on all metrics. We especially achieve significantly better albedo estimations. The results further demonstrate the improved generalization ability of our proposed method. In this dataset, deep learning based methods are as good as SIRFS, even more superior in some cases. Finally, Figure 6 shows qualitative comparisons for a number of images.
The qualitative results further support the quantitative evaluations. Our model predictions are closer to the ground-truth images. The colors of our albedo estimations appear more natural and vivid, and closer to the chromaticity patterns of the input images. Our shading estimations do not include intensity ambiguities or texture artefacts. On the other hand, the intensity ambiguity problem in the shading maps can be observed on ShapeNet and IntrinsicNet estimations on the candle and house images. CGIntrinsics’s shading smoothness constraint tends to generate over-smoothed estimations and cannot capture fine-grained geometric patterns. For example, the balcony of the house object is not visible anymore. SIRFS tends to generate incorrect colors on albedo estimations when a scene is dominated by a single color as in the cases of lion and house objects. The colors of the CGIntrinsics albedo maps tend to shift towards red.
5.1.3 Multi-Illuminant Intrinsic Images (MIII) Dataset
MIT Intrinsic Images and NIR-RGB Intrinsic Images datasets provide images with uniform white illumination. In this experiment, we further test the ability of our proposed method to generalize also to complex multi-illuminant scenarios. The dataset includes five real-world scenes with multi-colored non-uniform lighting, complex geometry, large specularities, and challenging colored shadows (Beigpour2015). Each scene includes two objects and illuminated with 6 single-illuminant and 9 two-illuminants. The colors of the illuminants vary from orange to blue. In total, there are 75 images with ground-truth intrinsics. The quantitative results are provided in Table 3.
The qualitative results show that our proposed model achieves better performance on almost all metrics. Only the reflectance estimations of CGIntrinsics (Li2018ECCV) are better on the LMSE metric, but their shading estimations are significantly worse. Thus, compared with other works, on average we achieve the best results by a large margin. Note that optimization based SIRFS (Barron2015) and learning based ShapeNet (Shi2017) are inherently modelled to estimate multi-colored illumination. Nevertheless, our model emerges more robust to real-world images with multi-colored non-uniform lighting. The results further demonstrate the improved generalization ability of our proposed method.
5.1.4 Spectral Intrinsic Images Dataset (SIID)
The dataset was mainly generated for spectral intrinsic image decomposition research (Chen2017). It includes nine objects illuminated with two kinds of light sources, one white and one warm-tone white. In total, it has 18 spectral images with corresponding shading ground-truths. The dataset also provides corresponding images synthesized from the spectral images that are used as inputs to the models. The quantitative results are provided in Table 4.
The results show that the reconstruction quality of our shading maps are closer to the ground-truths on all metrics. Similar to the MIII dataset experiments with multi-colored non-uniform lighting, our models also achieve more robust results on a different illumination setting of warm-tone white. Finally, Figure 8 shows qualitative comparisons for a number of images.
The qualitative results further support the quantitative evaluations. Our model predictions are closer to the ground-truth images. Our albedo estimations appear more natural and vivid and they are free of geometric effects. Our model is also capable of removing shadows casts on the platforms of the gypsum and cube objects from the albedo estimations. Since our model is trained only on white light, the color of the light source is also embedded in the albedo. Same behaviour is also observed on other models. To overcome this issue, a white balancing algorithm can be applied to the input images as a pre-processing step. Nonetheless, it does not cause significant problems on the reconstruction quality as the ground-truths are not absolute and only defined up to a scale factor (Grosse2009; Narihia2015). SIRFS can handle the issue, but it tends to confuse albedo and color of the light source when a scene is dominated by a single color as demonstrated in the previous section. Additional examples can be found in the upcoming sections. Likewise, as mentioned in the previous section, ShapeNet (Shi2017) is inherently modelled to estimate multi-colored illumination. However, it also fails to differentiate the color of the light source and albedo in this case. It also generates undesired color artefacts on the albedo maps.
As for the shading map generations, our model estimations are free of any texture artefacts and intensity ambiguities. The text on the heart of the baymax object is correctly attributed to the albedo map, whereas ShapeNet estimation is contaminated with the texture artefact, and IntrinsicNet and CGIntrinsics estimations both contain texture artefacts and intensity ambiguities. The intensity ambiguity problem is more severe on the shading estimations of the cube object. Our model and optimization-based SIRFS can handle those. Nevertheless, our contribution is more significant on the gypsum object, where SIRFS tend to generate over-smooth and overly-bright estimations that the geometry is distorted and fine-grained structures are not visible anymore. Our model is also not flawless. For example, we cannot capture the fine geometric details of the cube image and our estimation appears more rigid. That is because of the shading smoothness constraint that is used to fill in the gaps of the initial shading map based on the neighboring pixel information. Since the color changes happen near the holes, shading smoothness interpolation also fills in those gaps. Therefore, the shading estimation appears more rigid in those cases.
5.1.5 Amsterdam Library of Object Images (ALOI) Dataset
We also provide additional visual comparisons for real world images without ground-truths. For the task, we use Amsterdam Library of Object Images (ALOI) dataset (Geusebroek2005). Figure 9 provides a number of example objects with different properties to demonstrate the effectiveness of our method. Rows (1,2,3) provide examples with textures and rows (4,5) provides examples with strong shading patterns.
Deep learning based methods have severe color leakages in the shading map estimations for textured objects. CGIntrinsics’s shading smoothness constrain negatively effects the shading maps when strong shading patterns are present. It generates homogeneously smooth images such that it cannot properly capture darker regions where the surface normals (geometry) significantly deviate from the incoming light source direction. It can be observed from the cup image that the right part of the handle should be covered by the shading pattern and should not be visible. Our proposed work is the only model that can capture that pattern. Similar behaviour is also observed for the wooden cube in the last row. Likewise, the other models cannot generate a decent albedo map in those cases. ShapeNet generated albedo maps are rather dull colored and blurry. Similarly, CGIntrinsics and IntrinsicNet generated albedo maps tend to be polluted with color artefacts. On the other hand, our model is better at avoiding attributing surface texture to the shading maps, and our albedo estimations are sharper, have better color augmentation and more natural for all cases. SIRFS model is capable of producing decent shading maps for textured objects, as well. However, its albedo predictions are not as decent when an image is dominated by a single color as in the case of 1st and 5th rows. Similarly, it tends to fail to capture decent shading maps when an image has strong shading patterns.
5.2 Evaluations on Scene-level Datasets
There are several aspects that are challenging for our current setup for the scene level intrinsic image decomposition. Firstly, a scene is composed of multiple objects so that the behaviour of the illumination component is more complex. Especially, the ambient light (inter-reflection) effect is way stronger. In addition, our optimization process using the smoothness constraint to fill in the gaps of the initial shading map may be negatively effected if the gaps are filled from different surfaces (e.g. filled with object boundaries). Similarly, cluttered objects may cause way too large gaps to fill. Another thing is that since scene level objects have different scales, one single threshold might not be sufficient to obtain proper gradients. Nonetheless, for the sake of completeness, we also evaluate our model on scene-level images to provide additional insights.
5.2.1 As Realistic As Possible (ARAP) Dataset
With the current technology, it is not possible to generate dense ground-truth intrinsic images for any real world scene. Collecting the ground-truth intrinsics happens only on object-level and in a fully-controlled (indoor) laboratory settings, which demands extreme care (Grosse2009; Chen2017; Cheng2019). That is the reason why those datasets are small sampled. Therefore, to evaluate our model on scene-level images, we utilize the synthetic dataset of Bonneel2017. The dataset provides 53 high quality realistic scene-level renderings with corresponding per-pixel ground-truth intrinsics. Some of the scenes were re-rendered with different illumination settings. Thus, the evaluation is provided for the full dataset of 152 images. The quantitative results are provided in Table 5. The evaluations do not include CGIntrinsics model as it uses ARAP for training (Li2018ECCV), and also SIRFS model as it is specifically designed for single objects and generalize poorly to real scenes (Narihia2015; Li2018ECCV). Compared with other frameworks our proposed model achieves better performance on all metrics also on scene-level images, which further demonstrates our improved generalization ability.
5.2.2 Intrinsic Images in the Wild (IIW) Dataset
We follow the common practice and utilize the test set used by previous work (Zhou2015; Li2018ECCV). The test split includes 1046 images with relative human annotations. The quantitative results are provided in Table 6. We also train our model with less data (20K) to provide a more fair comparison against the models of Baslamisli2018CVPR.
|ShapeNet||ShapeNet (2.5 M)||59.4%|
|IntrinsicNet||ShapeNet (50 K)||32.1%|
|RetiNet||ShapeNet (50 K)||37.9%|
|OURS||ShapeNet (20 K)||28.7%|
|OURS||ShapeNet (50 K)||28.9%|
|OURS*||ShapeNet (50 K)||26.8%|
Comparing with the models trained on object-level ShapeNet dataset, our proposed model achieve significantly better reflectance predictions. Additional performance boost is achieved by applying a post processing step to enforce piecewise constant reflectance (Nestmeyer2017). Decreasing the training sample size does not significantly effect the performance for our model’s albedo estimations on IIW. Furthermore, our proposed model is significantly better than the structure and texture aware advanced Retinex model, and also DirectIntrinsics model trained on scene-level Sintel dataset. We also achieve on par results with CGIntrinsics model when trained on scene-level SUNCG dataset. The model achieves superior performance by combining the refined and improved renderings of scene-level SUNCG and the integration of ARAP dataset to create their final dataset CGI. It is also worthwhile to note that all the learning based models use data augmentations through random flips, shifts, resizings, and crops, whereas we do not apply any augmentation technique. Finally, Figure 11 provides qualitative comparisons for albedo estimations, and Figure 10 for shading estimations.
ShapeNet estimations are contaminated with artefacts and do not appear natural. The shading of the bed image includes texture artefacts and the text AWAI is directly copied to the shading map in the girl image. Similar patterns are also observed in IntrinsicNet estimations. IntrinsicNet generated shading maps also suffer from intensity ambiguities, which can be observed from the girl image that the neck of of the t-shirt has a darker color. Its albedo estimations are better than ShapeNet’s, yet they contain inconvenient brightness artefacts. IIW’s albedo estimations appear natural and free of geometry effects. However, its shading generations directly overfit to the inputs, and all the texture patterns are clearly visible in the shading maps.
CGIntrinsics trained on scene-level imagery achieves decent albedo predictions with proper smoothing effects, and compared with others, they appear more natural. However, their shading estimations appear way too smooth and hazy and most of the structures are not visible anymore, see the stairs or the fine-grained pillars of the church. It also suffers from the same intensity ambiguity problem as IntrinsicNet. On the other hand, our model is also capable of producing scene-level shading maps that are free of texture or intensity ambiguities. The first image shows that our model also works on outdoor scenes capable of handling geometry differences and different light properties. We can also handle the text on the t-shirt of the girl image and the text on the salt box and correctly attribute them to albedo maps. The windows of the bed image are an example where our shading map is negatively effected as our model tries fill in the gaps with insufficient gradient information. Although we did not enforce it as CGIntrinsics, our albedo estimations also appear smooth. However, our method still makes mistakes, such as the face of the girl or right side of the church appear blurry. Finally, our model is the only one that can handle the strong shadow cast under the bed. Our albedo estimations are free of strong shadow casts in this example, whereas all other models fail to handle it.
We investigated the use of photometric invariance to steer a deep learning model for intrinsic image decomposition (albedo and shading). We proposed albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial shading map is calculated directly from the corresponding image gradients in a learning-free unsupervised manner. Then, an optimization method was proposed to reconstruct the full dense shading map. Finally, we integrated the generated shading map into a novel deep learning framework to refine it and also to predict corresponding albedo image to achieve intrinsic image decomposition. Additionally, to train our model, a large-scale dataset of synthetic images of man-made objects was extended from 20K to 50K.
The evaluations were provided on five different object-level datasets (MIT, NIR-RGB, MIII, SIID, and ALOI), and two scene-level datasets (ARAP and IIW) with comprehensive setups without any fine-tuning or domain adaptation stage. The evaluations proved that our proposed model generated shading maps are more robust to texture artefacts and intensity ambiguities, which has been a long standing problem in the intrinsic image decomposition task. Since our model handles the undesired artefacts in the shading estimations, we also better differentiate albedo changes and achieve superior quantitative results.
Another conclusion is that deep learning based methods tend to overfit to the image having critical color leakages in the shading maps. When quantitatively evaluating, the leakage effect may not be reflected numerically. That suggests that future work should focus on proposing better metrics for evaluation. In addition, the color leakage effect may not be observed when a model is trained and tested (or fine-tuned) on the same dataset (Narihia2015; Cheng2018)
. Therefore, it is important for intrinsic image decomposition methods to provide cross-dataset or in-the-wild evaluations. Finally, we also tried to adapt several guided image-to-image translation and feature modulation techniques for our preliminary experiments to refine our initial shading maps with thefeatures. In particular, we tried the end-to-end trainable guided filter by Wu2018, bi-directional guided image-to-image translation by AlBahar2019, spatially-adaptive normalization by Park2019, and deep spatial feature transform by Wang2018. Unfortunately, none of them were able to address the color leakage problem in the shading maps.
This project was funded by the EU Horizon 2020 program No. 688007 (TrimBot2020). We thank Partha Das for his contribution to the experiments.