ShadingNet: Image Intrinsics by Fine-Grained Shading Decomposition

12/09/2019 ∙ by Anil S. Baslamisli, et al. ∙ University of Amsterdam 6

In general, intrinsic image decomposition algorithms interpret shading as one unified component including all photometric effects. As shading transitions are generally smoother than albedo changes, these methods may fail in distinguishing strong (cast) shadows from albedo variations. That in return may leak into albedo map predictions. Therefore, in this paper, we propose to decompose the shading component into direct (illumination) and indirect shading (ambient light and shadows). The aim is to distinguish strong cast shadows from reflectance variations. Two end-to-end supervised CNN models (ShadingNets) are proposed exploiting the fine-grained shading model. Furthermore, surface normal features are jointly learned by the proposed CNN networks. Surface normals are expected to assist the decomposition task. A large-scale dataset of scene-level synthetic images of outdoor natural environments is provided with intrinsic image ground-truths. Large scale experiments show that our CNN approach using fine-grained shading decomposition outperforms state-of-the-art methods using unified shading.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Intrinsic image decomposition is the problem of recovering the image formation components in terms of reflectance (albedo) and shading (illumination) [2]. The reflectance component contains information about the real color (i.e. albedo) of an object and is independent of illumination. On the other hand, the shading component contains different types of photometric effects such as direct light, ambient light (inter-reflections) and cast shadows. As a result, using intrinsic images rather than raw images can be favourable for different computer vision tasks. For instance, reflectance images (i.e. illumination invariant) are useful for semantic segmentation [3], whereas shading images are a source of information for 3D shape reconstruction tasks [34].

The problem of intrinsic image decomposition is ill-posed, because there are multiple solutions to compute the intrinsic components. As a consequence, most of the traditional methods impose priors on the intrinsic components to constrain the search space by means of an optimization process [1, 11, 31]

. More recent approaches use large-scale datasets with powerful deep learning methods 

[4, 9, 24, 32].

In general, existing (traditional and new) algorithms assume that strong image variations are due to albedo changes and that smooth image variations are caused by shading. However, this assumption does not always hold for real images as they may suffer from strong photometric changes due to environmental conditions such as cast shadows and inter-reflections. As a consequence, existing methods may fail to correctly distinguish strong (cast) shadows from albedo variations. Altering strong shadows by reflectance variations may negatively influence the quality of the resulting intrinsic image decomposition.

According to the Lambert’s law, shading is directly proportional to the cosine of the angle between the light source direction and surface normals, which only models direct shading. Nonetheless, most of the intrinsic image decomposition methods assume a single unified shading component including all photometric effects. Assuming a single shading term, these methods may fail to distinguish strong (cast) shadows from albedo variations. Therefore, our aim is to represent the different illumination effects separately into direct (i.e. light source) and indirect light (i.e. ambient light and strong shadows) components. The aim is to explicitly model photometric effects that may cause drastic changes in pixel values to provide extra cues to the albedo map estimation for better disentanglement of color changes from those strong intensity variations.

In this paper, instead of decomposing images into shading and reflectance only, we propose to decompose the shading component into three separate components to represent the different photometric effects. To this end, the body term of the dichromatic reflection model [30]

is extended to decompose the shading component into direct (light source) and indirect light conditions (ambient light and cast shadows). Based on the fine-grained model, two different end-to-end deep convolutional neural networks (CNNs) are proposed. Furthermore, surface normal features are jointly learned by the proposed CNN networks. Surface normals are expected to assist (1) the shading prediction as they are part of the shading formation process and (2) the reflectance prediction as they are invariant to photometric effects.

Supervised deep learning methods heavily depend on the availability of large-scale datasets with ground-truth annotations. Recently, supervised intrinsic image decomposition research is enhanced by the availability of large-scale datasets [5, 19, 24, 32, 23]. A drawback is that these large-scale datasets are limited in the number of different photometric effects (e.g. cast shadows). These datasets are either object centered, focusing solely on indoor scenes, or they are limited by sparse annotations. On the other hand, outdoor scenes usually contain complex shading and shadow patterns. Additionally, the lighting conditions can be very diverse in outdoor scenes. Lately, [3, 28] introduced a dataset of natural environments (NED) under varying illumination conditions containing dense intrinsic image decomposition labels, surface normals, light source properties, semantic segmentation and optical flow maps. In this paper, we extend a subset of that dataset to generate direct shading (shading due to surface geometry and illumination conditions), cast shadows, and ambient light (inter-reflections) maps. The dataset includes more than 30K images with corresponding intrinsic components and surface normal ground-truths. It is the first available large-scale dataset providing dense photometric maps.

In summary, our contributions are: (1) a reflection model is used to represent fine-grained shading decomposition, (2) two end-to-end supervised CNN models are proposed exploiting the fine-grained model, (3) an analysis is provided on the gains of the different shading decompositions, (4) an analysis is given of the contribution of surface normals which are jointly learned by the CNNs, and (5) the creation of a dataset of scene-level synthetic images of outdoor natural environments with fine-grained intrinsic image and surface normal ground-truths.

Ii Related Work

Intrinsic image decomposition is an ill-posed and under-constrained problem. Traditional work usually aims to constrain the search space by imposing priors on the intrinsic components. One of the pioneering work is the Retinex algorithm [21]. It assumes that reflectance changes cause large gradients, whereas shading variations result in smaller ones. Since then, many priors have been introduced to approach the problem, such as reflectance sparsity [11], texture [31] and depth [7]. It is also shown that using image sequences (video) is favorable for intrinsic image decomposition as it imposes a constant reflectance prior and varying shading over the same pixels within the sequence [35]. On the other hand, recent research uses large-scale datasets and supervised CNNs. [27] is the first work that directly predicts albedo and shading maps from images. Since then, many deep learning based methods are proposed. For instance, [32] exploits the correlation between the intrinsic components, [4] considers both a physics-based reflection model and intrinsic gradient supervision to steer the learning process, and [8] introduces a Laplacian pyramid inspired neural network architecture to exploit the scale space properties. Finally, [22, 25] exploit image sequences over time to constrain the reflectance within a deep learning framework.

Surface normals are used by a number of intrinsic image decomposition methods as they provide valuable information about an object’s geometry. For example, [18] computes low-dimensional, neighborhood-preserving embeddings by using surface normals to ensure both local and global shading consistency. [7] uses surface normal information to model the spatial and angular coherence of direct illumination. Recently, for face intrinsics, surface normal features are considered by SfSNet to achieve inverse rendering [29]. However, to the best of our knowledge, no existing work provides a thorough analysis on the influence of surface normals on the intrinsic image decomposition problem.

Most of the intrinsic image decomposition algorithms represent shading as one unified component including all photometric effects. However, there are a number of optimization-based methods that disentangle the shading problem by performing additional decompositions. For instance, [1] presents a model that recovers shape, albedo and chromatic illumination. [20] proposes a model that not only separates albedo from illumination, but also factorizes the illumination into a sun, sky and an indirect layer. [7] decomposes an image into albedo, shading, direct irradiance, indirect irradiance and illumination color by using a multiplicative model that prevents further decomposition of cast shadows and ambient light. Moreover, their method requires that all components to be individually regularized. On the other hand, there are a few deep learning based methods that perform additional decompositions.  [17] decomposes single images to albedo, shape, and lighting conditions. However, instead of modelling the photometric effects, they approximate the shading process of a rendering engine, which again aims at estimating a unified shading component. Finally, [14] decomposes an object centered image to albedo, occlusion, diffuse illumination, specular shading and surface normals for user friendly photo editing. In contrast to existing methods, we propose to decompose (scene-level) shading into direct and indirect shading terms to model the different photometric effects such as shading caused by object geometry, ambient light and shadows without any specialized regularization.

Both traditional and recent approaches may fail to disentangle different photometric effects like surface albedo transitions and strong cast shadows [1, 4, 7, 11, 27, 32]. In fact, [16] shows that intrinsic image decomposition methods perform relatively weak in detecting cast shadows. To separate these photometric effects (i.e. direct light, ambient light, and cast shadows), in this paper, the body reflection term [30] is extended to incorporate a shading component which is decomposed into direct and indirect light. Then, based on the extended image formation model, we propose to represent the photometric changes separately by using end-to-end supervised CNN models.

Recently, supervised-based CNN methods use large-scale datasets [3, 24, 25, 32]. Outdoor scenes are frequently influenced by cast shadows and varying lighting conditions. Unfortunately, existing large-scale datasets lack variations of these types of photometric effects. Except for  [3, 28], other datasets are either object centered or taken from indoor scenes. This dataset contains natural (outdoor) environments under varying illumination with intrinsic ground-truths. We extend a subset of this dataset to generate direct shading (shading due to surface geometry and illumination conditions), cast shadows, and ambient light (inter-reflections) ground-truth images.

Iii Image Formation Models

Iii-a Standard Image Formation Model

We use the diffuse (Lambertian) component of the dichromatic reflection model [30] as the basis of our image formation model. Then, an image I over the visible spectrum is modelled by:

(1)

where indicates the surface normal, denotes the light source direction, and m is a function of the geometric dependencies (e.g. Lambertian ). Furthermore, represents the wavelength, indicates the camera spectral sensitivity, and describes the spectral power distribution of the illuminant. Finally, denotes the reflectance i.e. the albedo (intrinsic color). Then, assuming a linear sensor response, a single light source, and narrow band filters, the equation can be simplified as follows:

(2)

where an image I can be modelled by a product of its unified shading and reflectance components. If the light source is colored, then the color information is embedded in the illumination (shading) component. In general, in the context of intrinsic image decomposition, the shading component is only defined for direct light (i.e. no occlusion) as follows:

(3)

where is the intensity of the light source. Obviously, Equation  (3) does not include photometric effects such as ambient light or cast shadows. However, this assumption is often violated for real images. To compute intrinsic images, explicitly modelling these photometric effects may help to correctly distinguish strong (cast) shadows from albedo variations.

Iii-B Extended Image Formation Model with Composite Shading

To incorporate the photometric effects of ambient light and (cast) shadows, two terms are added to Equation (2):

(4)

where is the intensity of direct light as defined by Equation (3), and and are the ambient light directions (if any). If we assume that the ambient light is uniform (not directional), then it is independent of the ambient light direction:

(5)

The indirect light consists of ambient light, denoted by (), resulting in an additive term. Shadows are modelled by a negative term . Ambient light () causes objects to appear brighter, whereas shadows () cause objects to appear dimmer. Then, the intensity of the composite lighting effects is given by . Finally, we obtain the composite shading model:

(6)

where the fine-grained shading component distinguishes the three photometric effects according to Equation (4) or Equation (5).

Iv ShadingNet

Using the image formation model of Equation (6), we propose two different modifications that can be applied to any regular encoder-decoder type CNN architecture that is designed for the standard intrinsic image decomposition task (simultaneous estimation of the intrinsics of Equation (2)). Both modifications include end-to-end trainable encoder-decoder CNN models, ShadingNets. Figure 1 illustrates the different models. First, we extend the shading decoder to contain multiple outputs for the photometric effects (intrinsic modification). Secondly, we extend the entire architecture by adding extra decoder blocks for each photometric effect (extrinsic modification). We show the effectiveness of the extensions by modifying which is a state-of-the-art architecture specifically engineered for intrinsic image decomposition. We use this model, because it is lightweight and the cross links in the decoders further enforces correlations between intrinsic components. Both modifications can be directly applied to any encoder-decoder type of network. For the evaluation phase, all models are trained including the image formation loss (IMF), which also includes the shading formation process.

Fig. 1: On the left, a standard encoder-decoder architecture (Eq. (2)), in the middle, intrinsic modification with blocks [13], on the right, extrinsic modification with extra decoders. denotes direct shading, is for cast shadows and is for ambient light, is for unified shading, and is for albedo.

Iv-a Intrinsic Modification

We extend the shading decoder to generate multiple outputs for the different photometric effects. To this end, we elaborate Squeeze-and-Excitation blocks (SE) [13] into the shading decoder module. The motivation is that for the standard intrinsic image decomposition task, shading is taken as a single, unified component including all photometric effects. The shading decoder includes all shading features which can be further decomposed into the photometric effects. Therefore, we integrate SE blocks at the end of the shading decoder to perform feature re-calibration. By using SE blocks, predictions of the photometric cues are conditioned by one unified shading decoder enhancing feature discriminability.

Iv-B Extrinsic Modification

We extend the entire architecture by adding extra decoder blocks per photometric effect. As a result, the architecture has 1 encoder and 4 distinct decoders; for albedo, direct shading, cast shadows and ambient light predictions. Unlike intrinsic modification, shading features are not shared within one decoder. In this way, the gradient flow from separate decoder blocks will individually boost the feature discriminability. Furthermore, we follow the design of ShapeNet and interconnect all the decoder blocks with each other to reinforce joint learning of features.

Iv-C Joint Learning of Surface Normal Features

We also explore the usage of surface normals. The surface normals are expected to assist (1) the shading prediction as they are part of the shading formation process and (2) the reflectance prediction as they are invariant to photometric effects. We explore two different ways to utilize surface normals as an input to a network. First, we use a single encoder with a 6-channel input: color image and surface normal ground-truths are concatenated and fed to the network for joint learning (joint learning by early fusion). Second, we use one separate encoder per input source. Then, the latent representations of both branches are combined to create a joint representation of the image (joint learning by intermediate fusion).

Iv-D Network Details

Following the design of ShapeNet [32], the initial encoder block takes an input image of 3 channels () and produces 16 feature maps. Then, the number of feature maps is doubled for each following convolution block until the penultimate block. Then, the bottleneck has 256 feature maps. Convolution kernels have size of 3

3 with stride of 2 (downsampling). On the decoder side, the encoder part is mirrored. Furthermore, for all blocks, the convolutions are followed by batch normalization 

[15]

and ReLU activation. Decoder convolutions have size of 3

3 with stride of 1. Bilinear upsampling is used to recover the sizes of feature maps. Moreover, before applying a convolution in the decoder block, the encoder’s corresponding block is concatenated channel-wise with the output from the previous block as a skip connection [26]. For the cases where surface normals are used as input, the encoder part is replicated. All the block parameters remain the same. After the bottleneck, the features are concatenated together, fused with 1 1 convolution, and fed into the decoder blocks. Skip connections are used for both encoders. Finally, Squeeze-and-Excitation blocks have a reduction ratio of . The blocks are followed by a final convolution to obtain the estimations.

Iv-E Training Details

All kernel weights are initialized using a normal distribution. To train the models, Adadelta 

[37] optimizer with learning rate of is used. The learning rate is decayed until . Moreover, the input and output image sizes are fixed to 256 256 pixels and normalized to a range of . Images that are not 256 256 are resized to the this resolution for all the models in all the experiments for fair comparison. A batch size of

is used and the networks are trained for 35 epochs.

Iv-F Loss Functions

Iv-F1 Intrinsic Component Reconstruction Loss

We train our models using mean squared error (MSE) together with its scale invariant version (SMSE). Nowadays, it is the standard reconstruction loss for the intrinsic image decomposition task as the combination yields better results than using just MSE. Let be the ground-truth intrinsic image and be the estimation of the network. Then, MSE is defined as:

(7)

where is the color channel index, denotes the pixel coordinate, and is the total number of valid pixels. Then, SMSE first scales and then compares its MSE with :

(8)
(9)

Then, we combine MSE and SMSE into one loss function;

. Thus, to evaluate the quality of the estimation of an intrinsic component, the loss becomes:

(10)

where the s are the loss weights. For the experiments, we follow common practice and set to 0.95 and to 0.05. Then, for fine-grained intrinsics, one is assigned to each intrinsic component, yielding 4 distinct loss functions (albedo, direct shading, ambient light and cast shadows). In the end, all the loss functions are added up without any weight tuning (all the weights are set to 1).

Iv-F2 Shading and Image Formation Loss

Here, we present the image formation loss (IMF) that is used for the final model, which also involves the shading formation process and loss. IMF compares the input image I with the reconstructed image of the predicted albedo and shading images ():

(11)

In the case of the extended image formation model, we do not predict one unified shading component, but we estimate fine-grained shading components (direct shading (), ambient light () and cast shadows ()). As a result, to use the IMF, the unified shading needs to be reconstructed with the fine-grained shading components. Thus, the reconstructed unified shading () is defined as:

(12)

Then, the image formation loss for the extended image formation model () becomes:

(13)

which also embeds the shading formation process and loss by nature. For the image formation loss, only MSE is used (no SMSE). Then, for the final model that uses the image formation loss, we essentially add to the overall loss by giving it a weight factor of 0.05.

Fig. 2: Sample scene from NED with ground-truth intrinsics and fine-grained shading components.

V Dataset

V-a Natural Environments Dataset (NED)

To train our models and baselines, we extend a subset of the (synthetic) Natural Environment Dataset (NED) introduced by [3, 28] to generate albedo, surface normal, direct shading (shading due to surface geometry and illumination conditions), cast shadows, and ambient light (inter-reflections) ground-truth images. The dataset contains garden/park like natural (outdoor) scenes including trees, plants, bushes, fences, etc. Furthermore, scenes are rendered with different types of terrains, landscapes, and lighting conditions. Additionally, real HDR sky images with a parallel light source are used to provide realistic ambient light. Moreover, light source properties are designed to model daytime lighting conditions to enrich the photometric effects. Figure 2 illustrates a sample scene from the dataset with dense ground-truth annotations. For the experiments, the dataset is randomly (scene) split resulting 15 gardens for training, around 25k images, and 3 gardens for testing, around 6k images.

V-B Fine-grained Shading Rendering Pipeline

From NED, we randomly pick 15 garden models and re-render them to obtain fine-grained shading components with dense ground-truth annotations. Scenes are rendered by using the physics-based Blender Cycles111https://www.blender.org/ engine. The rendering pipeline is modified to output albedo and (unified) shading ground-truth intrinsic images, ground-truth surface normal images, and light source properties (color, position, and intensity). Then, we use Lambert’s law to form the direct shading () component.

(14)

where is the intensity of the light source, is the color of the light source, is the surface normals, and denotes the light source position (direction). Since Blender Cycles engine is modified to output surface normals and light source properties, ground-truth direct shading component can be created using the above equation.

Next step is to create the ground-truth indirect light effects (i.e. ambient light and cast shadows). For the task, we use the ground-truth unified shading component, which is already made available by the rendering engine. Ambient light is due to extra light present on top of the direct shading, while cast shadows cause reduction in intensity values. As a result, subtracting the direct shading ground-truth from the unified shading ground-truth, we are left with indirect light effects that are modelled on top of the direct shading component. After the subtraction, the resulting component has both positive (due to extra indirect light) and negative (due to lack of direct light) pixel values. We classify positive values as ambient light, whereas negative values are classified as cast shadows. In the end, Equation (

12) is satisfied and the input image is created by element-wise multiplying the unified shading and albedo components to obtain a composite image that follows the physics-based image formation model. Finally, it is also good to mentioned that in our case a pixel is not classified as either in shadow or not, but has a continuous value. In that sense, umbra and penumbra regions can also be observed in the shadow map depending on the intensity. Nonetheless, the formulation can be modified to facilitate a binary shadow map if desired.

Vi Experiments and Evaluation

We perform experiments on NED [3, 28] and MPI Sintel [6], both datasets of synthetic outdoor scenes but having completely different rendering processes, and 3DRMS [28] dataset of real world outdoor garden scenes. All reported models are trained on the extended NED for fair comparison. We report on the mean squared error (MSE), where the absolute brightness of each image is adjusted to minimize the error, the local mean squared error (LMSE) with window size 20, and the structural dissimilarity index (DSSIM) for perceptual visual quality comparison.

Vi-a Influence of Surface Normals

In this experiment, we provide ablation studies to evaluate the influence of surface normals as an extra source of information for intrinsic image decomposition. We train ShapeNet [32] architecture with modifications to the input branch by using NED. First, we use a single encoder with a 6-channel input; color image and surface normal ground-truth image are concatenated an used as an input (early fusion). Second, one separate encoder is used per input source to create a joint image representation (intermediate fusion). In this case, skip connections [26] are used for both encoders. We compare the results with the baseline that only use one branch encoder with color images as input. All networks produce one reflectance map and one unified shading map. For the experiments, ground-truth surface normals are used. The results are summarized in Table I.

ShapeNet [32] MSE LMSE DSSIM
Albedo Shading Albedo Shading Albedo Shading
RGB only 0.0055 0.0051 0.0875 0.1309 0.2931 0.2936
Intermediate Fusion 0.0054 0.0048 0.0636 0.0987 0.2917 0.2765
Early Fusion 0.0044 0.0056 0.0596 0.1102 0.3280 0.3285
Intermediate Fusion 0.0043 0.0035 0.0581 0.0854 0.2502 0.2500
TABLE I: Influence of surface normals on the accuracy of intrinsic images. Intermediate fusing strategy appears to be the best approach as it outperforms the single . (+) denotes ground-truth and (-) denotes predicted surface normals.

It is shown in Table I that providing ground-truth surface normals provides strong information for the intrinsic image decomposition task. However, for the early fusion strategy (even with ground-truth normals) a decline in performance for half of the measures is obtained. As a result, simply using surface normals as input to the networks is not helpful. In contrast, intrinsic image decomposition highly benefits from surface normals as an additional input using the intermediate fusion strategy. The approach is to assign each input source by its own encoder (with skip connections).

The ablation study demonstrates the positive influence of the surface normals when modelled carefully. However, in practice, accurate surface normals are never available. Even if a depth sensor is available, normals are still noisy and may not cover the whole image. To this end, we train an encoder-decoder VGG16 [33] network to predict surface normals and use them instead of ground-truths to obverse the predicted surface normal influence to simulate real world cases.

For the training, we use a similar strategy as [36] and use a combination of L1, SMSE and angular error. All three errors are equally weighted with 1 without any tuning. Table II provides quantitative results for the performance of the surface normal prediction network as a reference for quality assessment. For this, we follow the evaluation criterion introduced in [10]. Mean, median, and RMSE are provided as the difference in degrees (the lower the better), and 11.25, 22.5, 30 intervals (in degrees) count the number of pixels within the specified angle thresholds and are provided in percentages (the higher the better). The network is able to achieve decent degrees of errors.

TABLE II: Quantitative evaluation of the surface normal prediction on the garden test split.

For the intrinsic image decomposition task, the results denoted as (-) in Table I show that better performance is achieved even with estimated normals.

The results show that intrinsic image decomposition benefits from surface normals as an additional input using the intermediate fusion strategy (joint learning). Thus, we provide ShadingNet modifications including a surface normal encoder employing the intermediate fusion strategy. Thus, surface normals are jointly learned with the intrinsic images. Similar strategy was successfully applied for face intrinsics by SfSNet to achieve inverse rendering [29]. Before integrating the surface normal encoder branch, the VGG16 model is first pre-trained for 150k iterations for stabilization. In this way, instead of using an external surface normal estimator, the process is embedded in our solution in an end-to-end fashion. For the next sections, ShadingNets using surface normals based on joint learning are denoted by .

Vi-B Intrinsic vs. Extrinsic Modification

In this experiment, we evaluate the architectural modifications. We train the ShadingNet architectures with the intrinsic modification, where a single shading decoder contains multiple outputs for the photometric effects with SE blocks, and, the extrinsic modification, where photometric effects are estimated via individual decoder blocks. Ultimately, modifications with surface normal features using intermediate fusion strategy are presented by . The results are summarized in Table III.

MSE LMSE DSSIM
Albedo Shading Albedo Shading Albedo Shading
ShapeNet [32] 0.0055 0.0051 0.0875 0.1309 0.2931 0.2936
Intrinsic M. 0.0059 0.0052 0.0667 0.1143 0.3200 0.3158
Extrinsic M. 0.0057 0.0052 0.0667 0.1133 0.3000 0.3158
Intrinsic M. 0.0046 0.0053 0.0743 0.1366 0.2122 0.2822
Extrinsic M. 0.0045 0.0050 0.0630 0.1100 0.1990 0.2721
TABLE III: Influence of intrinsic and extrinsic modifications. (*) denotes the model with surface normal features. Both modifications improve albedo estimations due to the presence of extra photometric cues.

The results show that albedo estimations are improved by the proposed fine-grained shading model. Both intrinsic and extrinsic modifications improve the reflectance results. Thus, intrinsic image decomposition benefits from the fine-grained shading decomposition. Further, the contribution of surface normal features is more significant than the standard Lambertian model. That is because trying to estimate additional shading components is already significantly increasing the number of parameters to estimate. Then, the surface normal features act as a regularizer to constrain the predictions.

To evaluate the accuracy of the shading component for both intrinsic and extrinsic modification, the 3 photometric effects are combined to form the unified shading. For the Lambertian model, there is only one shading component estimated, which may explain the drop in some of the shading results for the the multi-varied shading models.

Additionally, the quality of the photometric effects and the influence of the surface normal are presented for both modifications in Table IV. Finally, we present visual results of ShadingNet with both modifications for fine-grained shading components on NED. Figure 3 shows cast shadow predictions. Both models can differentiate most of the shadow cues. In general, EM estimates proper shadow maps. However, IM sometimes has difficulties in distinguishing strong cast shadows from objects with very dark albedo, see the second row of the figure. Then, Figure 4 provides direct shading predictions. Both models can differentiate most of the direct shading cues. As in the case of the cast shadow prediction, EM estimates are better than IM ones. Finally, Figure 5 illustrates ambient light predictions. Both models can differentiate ambient light cues. As in the previous cases, also for the ambient light prediction, EM estimates are better than IM ones.

SN (+) SN (-) SN (+) SN (-)
Cast Shadow 0.0261 0.0273 0.0258 0.0270
Ambient Light 0.0021 0.0024 0.0021 0.0024
Direct Shading 0.0375 0.0389 0.0351 0.0381
TABLE IV: Quality of the direct and indirect shading effects. SN (+) denotes the model with surface normal prediction.

Table IV shows that surface normal features influence positively the quality of the fine-grained shading intrinsics. The extrinsic model with individual decoders appears to be the better option for direct shading and photometric effects, and appears to differentiate intrinsic cues better.

Fig. 3: Qualitative results for cast shadow estimations. IM is for intrinsic modification, EM is for extrinsic modification. Both models can differentiate shadow cues. EM appears better.
Fig. 4: Qualitative results for direct shading estimations. IM is for intrinsic modification, EM is for extrinsic modification. Both models can differentiate direct shading cues. EM appears better.
Fig. 5: Qualitative results for ambient light estimations. IM is for intrinsic modification, EM is for extrinsic modification. Both models can differentiate ambient light cues from other photometric effects. EM appears better.

Vi-C Influence of Fine-Grained Shading

In this experiment, we evaluate our (additive) extended image formation model. We compare our model with [7], as it is the only work that performs additional shading decompositions with a specialized image formation model. To that end, we train the ShadingNet architecture with the multiplicative image formation model of [7] and compare it with our model. Both setups have 3 decoders; reflectance, direct shading () and composite indirect shading (). The results are summarized in Table V.

MSE LMSE DSSIM
Albedo Shading Albedo Shading Albedo Shading
Multiplicative [7] 0.0057 0.0292 0.0791 0.6989 0.2195 0.3943
Additive (ours) 0.0045 0.0050 0.0630 0.1100 0.1990 0.2721
TABLE V: Influence of the extended image formation model for fine-grained shading. Our (additive) model yields better results than the (multiplicative) model of [7].

The results show that our (additive) model yields better results than the (multiplicative) model of [7]. The reason is that their shading estimates are very noisy caused by the multiplicative model yielding unstable indirect shading values.

Vi-D Synthetic Outdoor Images

In this section, we compare our method with the color version of Retinex [12] a threshold-based traditional method, IIW [5] using optimization-based dense CRF, DirectIntrinsics [27] the pioneering coarse-to-fine multi-scale CNN network, and ShapeNet [32] with interconnections (all layers are connected to promote correlation between components). All the models are trained on the same dataset (NED). Table VI shows the quantitative evaluation results (6000 images) and Figure 6 displays the visual comparison results for NED. In addition, Table VII shows quantitative evaluation results (890 images) for the MPI Sintel dataset.

Fig. 6: Qualitative results on NED. ShadingNets produce better reflectance images with minimal shadow leakage compared with the ShapeNet [32] baseline using the standard model.
MSE LMSE DSSIM
Albedo Shading Albedo Shading Albedo Shading
Color Retinex [12] 0.0114 0.0193 0.1204 0.2334 0.328 0.3515
IIW [5] 0.0095 0.0111 0.1343 0.1861 0.2098 0.3511
DirectIntrinsics [27] 0.0073 0.0065 0.1205 0.1798 0.3756 0.3843
ShapeNet [32] 0.0055 0.0051 0.0875 0.1309 0.2931 0.2936
ShadingNet (IM) 0.0059 0.0052 0.0667 0.1143 0.3200 0.3158
ShadingNet (EM) 0.0057 0.0052 0.0667 0.1133 0.3000 0.3158
ShadingNet (IM) 0.0046 0.0053 0.0743 0.1366 0.2122 0.2822
ShadingNet (EM). 0.0045 0.0050 0.0630 0.1100 0.1990 0.2721
TABLE VI: Quantitative results for NED. (*) denotes the model with surface normal features, IM denotes intrinsic modification, and EM denotes extrinsic modification.
MSE LMSE DSSIM
Albedo Shading Albedo Shading Albedo Shading
Color Retinex [12] 0.0537 0.0617 0.0719 0.0665 0.2999 0.2646
IIW [5] 0.0371 0.0388 0.0720 0.0656 0.2673 0.2360
DirectIntrinsics [27] 0.0269 0.0315 0.0607 0.0943 0.3140 0.2895
ShapeNet [32] 0.0366 0.0467 0.0648 0.0724 0.2556 0.2275
ShadingNet (IM) 0.0243 0.0287 0.0489 0.0644 0.2253 0.2011
ShadingNet (EM) 0.0227 0.0335 0.0477 0.0893 0.2215 0.2121
ShadingNet (IM) 0.0277 0.0376 0.0686 0.0912 0.2440 0.2750
ShadingNet (EM) 0.0220 0.0325 0.0451 0.0731 0.2210 0.2259
TABLE VII: Quantitative results for Sintel. (*) denotes the model with surface normal features, IM stands for intrinsic modification and EM for extrinsic modification.

Table VI shows that our proposed models outperforms other baselines on all metrics for the NED dataset. Table VII demonstrates the generalization ability of our models. Moreover, our ShadingNet models obtain better albedo and shading results. Further, Figure 6 shows that ShadingNet with extrinsic modification (extra decoders) obtains better reflectance images. Moreover, our models remove cast shadows and the shading leakage is minimal for the reflectance images.

Vi-E In-the-wild Real World Outdoor Images

We present a comparison for a real world (outdoor) garden dataset [28] in Figure 7. Region of interests are marked by a red bounding box. Extrinsic modification produces better reflectance images with minimal shadow leakage. ShapeNet [32] baseline almost completely fails. To conclude, our approach, which is based on the extended shading model, is more stable and achieves better results than the baselines using the standard model for in-the-wild real world complex outdoor images. Extrinsic modification is capable of extracting strong cast shadows.

Finally, fine-grained shading estimations of the EM model for the real world garden data are presented in Figure 8. Shadow predictions can detect strong shadows where the bushes meet the ground, and complex shadow patterns of the bushes due to their intrinsic geometry. Further, the sky is completely ignored by the predictions. Ambient light estimations mostly focus on the sky and extra reflected light from the buildings in the background. The extended dataset does not include sky images to the inputs and the outputs as it is not possible to acquire ground-truth values of the sky. Nonetheless, the results suggest that the network is capable of detecting the extra light presented in scenes and ignoring shadow regions. That also shows the complementary relation between ambient light and shadow predictions. Direct shading computations yield smooth ground patterns as they share the same normals and the light source direction. However, they are not so good at handling internal shadow leakages. Finally, unified shading is presented by adding up all the 3 photometric components together.

Fig. 7: Qualitative results on real world garden images. Extrinsic modification produce better reflectance images with minimal shadow leakage. ShapeNet [32] baseline almost completely fails producing artifacts.
Fig. 8: Qualitative results of the fine-grained shading components on real world garden images for EM model. shadow predictions can detect strong shadows where the bushes meet the ground, and complex shadow patters of the bushes due to their intrinsic geometry. Ambient light estimations mostly focus on the sky and extra reflected light from the buildings in the background. Direct shading computations yield smooth ground patterns as they share the same normals and the light source direction. Finally, unified shading is presented by adding up all the 3 photometric components together.

Vi-F Significance of the Results

Finally, we conduct statistical significance tests to prove that our quantitative results are meaningful. The test results show that, for all metrics of the all experiments, our improvements are significant within 95% confidence interval, except for MSE for the shading maps. For the shading estimations, the 3 fine-grained photometric effects (direct shading, ambient light and cast shadows) are combined to form the unified shading. For the Lambertian model, there is only one shading component estimated, which may explain the drop in performance in some of the shading results for the the multi-varied shading models. The significance of the results are more compelling when evaluated visually. For instance, consider Figure 

7, where the baseline completely fails on real world outdoor images.

Vii Conclusion

Our aim was to improve the intrinsic image decomposition quality by extending the standard (Lambertian) image formation model. To that end, we proposed to separate shading into different photometric effects such as shading caused by direct shading (object geometry) and indirect shading (shadows and ambient light) to improve intrinsic image decomposition results. Two end-to-end supervised CNN models, ShadingNets, were utilized to exploit the fine-grained shading model. Further, surface normals were considered as an input to the models using joint feature learning. To train the models, a large-scale dataset of synthetic images of outdoor natural environments was extended to generate fine-grained intrinsic image and surface normal ground-truths.

The proposed models were evaluated on synthetic and real world in-the-wild images. The evaluation results show that intrinsic image decomposition highly benefits from (1) surface normals as an input to a CNN model and (2) the proposed fine-grained shading model. For almost all cases, ShadingNet with extrinsic modification, where photometric effects are estimated via individual decoder blocks, achieve better results than intrinsic modification, where a single shading decoder contains multiple outputs for the photometric effects. Both of our approaches outperform the existing unified shading methods. Moreover, visual analysis shows that the proposed method reduces the leakage of photometric effects in reflectance images and appears more stable.

Acknowledgment

This project was funded by the EU Horizon 2020 program No. 688007 (TrimBot2020).

References

  • [1] J. T. Barron and J. Malik (2015) Shape, illumination, and reflectance from shading. IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 1670–1687. Cited by: §I, §II, §II.
  • [2] H. G. Barrow and J. M. Tenenbaum (1978) Recovering intrinsic scene characteristics from images. Computer Vision Systems, pp. 3–26. Cited by: §I.
  • [3] A. S. Baslamisli, T. T. Groenestege, P. Das, H. A. Le, S. Karaoglu, and T. Gevers (2018) Joint learning of intrinsic images and semantic segmentation. In European Conference on Computer Vision, Cited by: §I, §I, §II, §V-A, §VI.
  • [4] A. S. Baslamisli, H. A. Le, and T. Gevers (2018) CNN based learning using reflection and retinex models for intrinsic image decomposition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §I, §II, §II.
  • [5] S. Bell, K. Bala, and N. Snavely (2014) Intrinsic images in the wild. ACM Trans. on Graphics (TOG). Cited by: §I, §VI-D, TABLE VI, TABLE VII.
  • [6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012)

    A naturalistic open source movie for optical flow evaluation

    .
    In European Conference on Computer Vision, Cited by: §VI.
  • [7] Q. Chen and V. Koltun (2013) A simple model for intrinsic image decomposition with depth cues. In IEEE International Conference on Computer Vision, Cited by: §II, §II, §II, §II, §VI-C, §VI-C, TABLE V.
  • [8] L. Cheng, C. Zhang, and Z. Liao (2018) Intrinsic image transformation via scale space decomposition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II.
  • [9] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf (2018) Revisiting deep intrinsic image decompositions. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [10] D. F. Fouhey, A. Gupta, and M. Hebert (2013) Data-driven 3d primitives for single image understanding. In IEEE International Conference on Computer Vision, Cited by: §VI-A.
  • [11] P. V. Gehler, C. Rother, M. Kiefel, L. Zhang, and B. Schölkopf (2011) Recovering intrinsic images with a global sparsity prior on reflectance. In Advances in Neural Information Processing Systems, Cited by: §I, §II, §II.
  • [12] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman (2009) Ground truth dataset and baseline evaluations for intrinsic image algorithms. In IEEE International Conference on Computer Vision, Cited by: §VI-D, TABLE VI, TABLE VII.
  • [13] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Fig. 1, §IV-A.
  • [14] C. Innamorati, T. Ritschel, T. Ritschel, and T. Ritschel (2017) Decomposing single images for layered photo retouching. Computer Graphics Forum, pp. 15–25. Cited by: §II.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §IV-D.
  • [16] C. Isaza, J. Salas, and B. Raducanu (2012) Evaluation of intrinsic image algorithms to detect the shadows cast by static objects outdoors. Sensors, pp. 13333–13348. Cited by: §II.
  • [17] M. Janner, J. Wu, T. D. Kulkarni, I. Yildirim, and J. B. Tenenbaum (2017) Self-supervised intrinsic image decomposition. In Advances in Neural Information Processing Systems, Cited by: §II.
  • [18] J. Jeon, S. Cho, X. Tong, and S. Lee (2014) Intrinsic image decomposition using structure-texture separation and surface normals. In European Conference on Computer Vision, Cited by: §II.
  • [19] B. Kovacs, S. Bell, N. Snavely, and K. Bala (2018) Shading annotations in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [20] P. Y. Laffont, A. Bousseau, and G. Drettakis (2013) Rich intrinsic image decomposition of outdoor scenes from multiple views. IEEE Transactions on Visualization and Computer Graphics, pp. 210–224. Cited by: §II.
  • [21] E. H. Land and J. J. McCann (1971) Lightness and retinex theory. Journal of Optical Society of America, pp. 1–11. Cited by: §II.
  • [22] L. Lettry, K. Vanhoey, and L. van Gool (2018) Unsupervised deep single-image intrinsic decompositionusing illumination-varying image sequences. In International Pacific Conference on Computer Graphics and Applications, Cited by: §II.
  • [23] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger (2018) InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In British Machine Vision Conference, Cited by: §I.
  • [24] Z. Li and N. Snavely (2018) CGIntrinsics: better intrinsic image decomposition through physically-based rendering. In European Conference on Computer Vision, Cited by: §I, §I, §II.
  • [25] Z. Li and N. Snavely (2018) Learning intrinsic image decomposition from watching the world. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II, §II.
  • [26] X. Mao, C. Shen, and Y. Yang (2016) Image restoration using very deep fully convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems, Cited by: §IV-D, §VI-A.
  • [27] T. Narihira, M. Maire, and S. X. Yu (2015) Direct intrinsics: learning albedo-shading decomposition by convolutional regression. In IEEE International Conference on Computer Vision, Cited by: §II, §II, §VI-D, TABLE VI, TABLE VII.
  • [28] T. Sattler, R. Tylecek, T. Brox, M. Pollefeys, and R. B. Fisher (2017) 3D reconstruction meets semantics - reconstruction challange 2017. In IEEE International Conference on Computer Vision Workshop, Cited by: §I, §II, §V-A, §VI-E, §VI.
  • [29] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs (2018) SfSNet: learning shape, reflectance and illuminance of faces in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II, §VI-A.
  • [30] S. Shafer (1985) Using color to separate reflection components. Color research and applications, pp. 210–218. Cited by: §I, §II, §III-A.
  • [31] L. Shen, P. Tan, and S. Lin (2008) Intrinsic image decomposition with non-local texture cues. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II.
  • [32] J. Shi, Y. Dong, H. Su, and S. X. Yu (2017) Learning non-lambertian object intrinsics across shapenet categories. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §I, §II, §II, §II, §IV-D, §IV, Fig. 6, Fig. 7, §VI-A, §VI-D, §VI-E, TABLE I, TABLE III, TABLE VI, TABLE VII.
  • [33] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §VI-A, TABLE II.
  • [34] T. Wada, H. Ukida, and T. Matsuyama (1995) Shape from shading with interreflections under proximal light source-3d shape reconstruction of unfolded book surface from a scanner image. In IEEE International Conference on Computer Vision, Cited by: §I.
  • [35] Y. Weiss (2001) Deriving intrinsic images from image sequences. In IEEE International Conference on Computer Vision, Cited by: §II.
  • [36] Y. Yoon, G. Choe, N. Kim, J. Y. Lee, and I. S. Kweon (2016) Fine-scale surface normal estimation using a single nir image. In European Conference on Computer Vision, Cited by: §VI-A.
  • [37] M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §IV-E.