Semantic segmentation of outdoor scenes is a challenging problem in computer vision. Variations in imaging conditions may negatively influence the segmentation process. These varying conditions include shading, shadows, inter-reflections, illuminant color and its intensity. As image segmentation is the process of identifying and semantically grouping pixels, drastic changes in pixel values may hinder a successful segmentation. To address this problem, several methods are proposed to mitigate the effects of illumination to obtain more robust image features to help semantic segmentation[1, 2, 3, 4]. Unfortunately, these methods provide illumination invariance artificially by hand crafted features. Instead of using narrow and specific invariant features, in this paper, we focus on image formation invariance induced by a full intrinsic image decomposition.
Intrinsic image decomposition is the process of decomposing an image into its image formation components such as albedo (reflectance) and shading (illumination) . The reflectance component contains the true color of objects in a scene. In fact, albedo is invariant to illumination, while the shading component heavily depends on object geometry and illumination conditions in a scene. As a result, using reflectance images for semantic segmentation task can be favorable, as they do not contain any illumination effect. Additionally, not only segmentation may benefit from reflectance, but also segmentation may be useful for reflectance computation. Information about an object reveals strong priors about its intrinsic properties. Each object label constrains the color distribution and is expected to reflect that property to class specific reflectance values. Therefore, distinct object labels provided by semantic segmentation can guide intrinsic image decomposition process by yielding object specific color distributions per label. Furthermore, semantic segmentation process can act as an object boundary guidance map for intrinsic image decomposition by enhancing cues that differentiate between reflectance and occlusion edges in a scene. In addition, homogeneous regions (i.e. in terms of color) within an object segment should have similar reflectance values. Therefore, in this paper, the tasks of semantic segmentation and intrinsic image decomposition are considered as a combined process by exploring their mutual relationship in a joint fashion.
To this end, we propose a supervised end-to-end convolutional neural network (CNN) architecture to jointly learn intrinsic image decompositionand semantic segmentation. The joint learning includes an end-to-end trainable encoder-decoder CNN with one shared encoder and three separate decoders: one for reflectance prediction, one for shading prediction, and one for semantic segmentation prediction. In addition to joint learning, we explore new cascade CNN architectures to use reflectance to improve semantic segmentation, and semantic segmentation to steer the process of intrinsic image decomposition.
To train the proposed supervised network, a large dataset is needed with ground-truth images for both image semantic segmentation (i.e. class labels) and intrinsic properties (i.e. reflectance and shading). However, there is no such a dataset. Therefore, we have created a large-scale dataset featuring plants and objects under varying illumination conditions that are mostly found in natural environments. The dataset is at scene-level considering natural environments containing intrinsic image decomposition and semantic segmentation ground-truths. The dataset contains 35K synthetic images with corresponding albedo and shading (intrinsics), as well as semantic labels (segmentation) assigned to each object/scene.
Our contributions are: (1) a CNN architecture for joint learning of intrinsic image decomposition and semantic segmentation, (2) analysis on the gains of addressing those two problems jointly, (3) new cascade CNN architectures for intrinsic-for-segmentation and segmentation-for-intrinsic, and (4) a very large-scale dataset of synthetic images of natural environments with scene level intrinsic image decomposition and semantic segmentation ground-truths.
2 Related Work
Intrinsic Image Decomposition. Intrinsic image decomposition is an ill-posed and under-constrained problem since an infinite number of combinations of photometric and geometric properties of a scene can produce the same 2D image. Therefore, most of the work on intrinsic image decomposition considers priors about scene characteristics to constrain a pixel-wise optimization task. For instance, both  and  use non-local texture cues, whereas  and  constrain the problem with the assumption of sparsity of reflectance. In addition, the use of multiple images helps to resolve the ambiguity where the reflectance is constant and the illumination changes [10, 11]. Nonetheless, with the success of supervised deep CNNs [12, 13]
, more recent research on intrinsic image decomposition has shifted towards using deep learning. is the first work that uses end-to-end trained CNNs to address the problem. They argue that the model should learn both local and global cues together with a multi-scale architecture. In addition,  proposes a model by introducing inter-links between decoder modules, based on the expectation that intrinsic components are correlated. Moreover,  demonstrates the capability of generative adversarial networks for the task. On the other hand, in more recent work,  considers an image formation loss together with gradient supervision to steer the learning process to achieve more vivid colors and sharper edges.
In contrast, our proposed method jointly learns intrinsic properties and segmentation. Additionally, the success of supervised deep CNNs not only depends on a successful model, but also on the availability of annotated data. Generating ground-truth intrinsic images is only possible in a fully-controlled setup and it requires enormous effort and time . To that end, the most popular real-world dataset for intrinsic image decomposition includes only 20 object-centered images with their ground-truth intrinsics , which alone is not feasible for deep learning. On the other hand,  presents scene-level real world relative reflectance comparisons over point pairs of indoor scenes. However, it does not include ground-truth intrinsic images. The most frequently used scene-level synthetic dataset for intrinsic image decomposition is the MPI Sintel Dataset . It provides around a thousand of cartoon-like images with their ground-truth intrinsics. Therefore, a new dataset is created consisting of 35K synthetic (outdoor) images with 16 distinct object types/scenes which are recorded under different illumination conditions. The dataset contains intrinsic properties and object segmentation ground-truth labels. The dataset is described in detail in the experimental section.
Semantic Segmentation.21, 22, 23]. On the other hand, contemporary semantic segmentation methods such as [24, 25, 26] benefit from the powerful CNN models and large-scale datasets such as [27, 28]. A detailed review on deep learning techniques applied to semantic segmentation task can be found in .
Photometric changes, which are due to varying illumination conditions, cause changes in the appearance of objects. Consequently, these appearance changes create problems for the semantic segmentation task. Therefore, several methods are proposed to mitigate the effects of varying illumination to accomplish a more robust semantic segmentation by incorporating illumination invariance in their algorithms [1, 2, 3, 4]. However, these methods provide invariance artificially by hand crafted features. Therefore, they are limited in compensating for possible changes in photometry (i.e. illumination). Deep learning based methods may learn to accommodate photometric changes through data exploration. However, they are constrained by the amount of data. In this paper, we propose to use the intrinsic reflectance property (i.e. fully illumination invariance) to be used for semantic segmentation.
Joint Learning. Semantic segmentation has been used for joint learning tasks as it provides useful cues about objects and scenes. For instance, [30, 31, 32] propose joint depth prediction and semantic segmentation models. Joint semantic segmentation and 3D scene reconstruction is proposed by . Furthermore,  formulates dense stereo reconstruction and semantic segmentation in a joint framework.
For intrinsic image decomposition,  introduces the first unified model for recovering shape, reflectance, and chromatic illumination in a joint optimization framework. Other works [36, 37], jointly predict depth and intrinsic property. Finally,  exploits the relation between the intrinsic property and objects (i.e. attributes and segments). The authors propose to address these problems in a joint optimization framework. Using hand crafted priors,  designs energy terms per component and combines them in one global energy to be minimized. In contrast to previous methods, our proposed method is an end-to-end solution and does not rely on any hand crafted priors. Additionally,  does not optimize their energy function for each component separately. Therefore, the analysis on the influence of intrinsic image decomposition on semantic segmentation is omitted. In this paper, an in-depth analysis for each component is given.
3.1 Image Formation Model
To formulate our intrinsic image decomposition, the diffuse reflectance component is considered . Then, an image, I, over the visible spectrum , is defined by:
In the equation, denotes the surface normal, whereas is the light source direction; together forming the geometric dependencies m, which in return forms the shading component under white light. Additionally, represents the wavelength, is the camera spectral sensitivity, specifies the spectral power distribution of the illuminant, and represents the diffuse surface reflectance . Then, using narrow band filters and considering a linear sensor response under white light, intrinsic image decomposition can be formulated as:
Then, for a position , can be approximated by the element-wise product of its intrinsic components. When the light source is colored, it is also included in the shading component.
3.2 Baseline Model Architectures
Intrinsic Image Decomposition. We use the model proposed by , , without the specular highlight module. The model is shown in the dotted rectangle part of Figure 1. The model provides state-of-the results for intrinsic image decomposition task. Early features in the encoder block are connected with the corresponding decoder layers, which are called mirror links. That proves to be useful for keeping visual details and producing sharp outputs. Furthermore, the features across the decoders are linked to each other (inter-connections) to further strengthen the correlation between the components.
To train the model for intrinsic image decomposition task, we use a combination of the standard reconstruction loss (MSE) with its scale invariant version (SMSE). Let be the prediction of the network and be the ground-truth intrinsic image. Then, the standard reconstruction loss is given by:
where denotes the pixel coordinate, is the color channel index and is the total number of evaluated pixels. Then, SMSE scales first and compares MSE with :
Then, the combined loss for training an intrinsic component becomes:
where the s are the corresponding loss weights. The final loss for training the model for intrinsic image decomposition task becomes:
Semantic segmentation The same architecture is used as the baseline for semantic segmentation task. However, one of the decoders is removed from the architecture, because there is only one task. As a consequence, inter-connection links are not used for the semantic segmentation task. Furthermore, as a second baseline, we train an off-the-shelf segmentation algorithm , , that is specifically engineered for semantic segmentation task.
To train the model for semantic segmentation, we use the cross entropy loss:
is the output of the softmax function to compute the posterior probability of a given pixelbelonging to class, where and as the category set for pixel level class label.
3.3 Joint Model Architecture
In this section, a new joint model architecture is proposed. It is an extension of the base model architecture for intrinsic image decomposition task, , that combines the two tasks i.e. intrinsic image decomposition and semantic segmentation. We modify the baseline model architecture to have one encoder and three distinct decoders i.e. one for reflectance prediction, one for shading prediction, and one for semantic segmentation prediction. We maintain the mirror links and inter-connections. That allows for the network to be constrained with different outputs, and thus reinforce the learned features from different tasks. As a result, the network is forced to learn joint features for the two tasks at hand not only in the encoding phase, but also in the decoding phase. Both encoder and decoder parts contain both intrinsic properties and semantic segmentation characteristics. This setup is expected to be exploited by individual decoder blocks to learn extra cues for the task at hand. Figure 1
illustrates the joint model architecture. To train the model jointly, we combine the task specific loss functions by summing them together:
The effect of the gamma parameters of Equation 6 and more implementation details can be found in the supplementary materials.
4.1 New Synthetic Dataset of Natural Environments
A large set of synthetic images is created featuring plants and objects that are mostly found in natural environments such as parks and gardens. The dataset contains different species of vegetation such as trees and flowering plants with different types of terrains and landscapes under different lighting conditions. Furthermore, scenarios are created which involves human intervention such as the presence of bushes (like rectangular hedges or spherical topiaries), fences, flowerpots and planters, and etc. (16 classes in total). There is a substantial variety of object colors and geometry. The dataset is constructed by using the parametric tree models  (implemented as add-ons in Blender software), and several manually-designed models from the Internet that aim for realistic natural scenes and environments. Ambient lighting is provided by real HDR sky images with a parallel light source. Light source properties are designed to correspond to daytime lighting conditions such as clear sky, cloudy, sunset, twilight, etc. For each virtual park/garden, we captured the scene from different perspectives with motion blur effects. Scene are rendered with the physics-based Blender Cycles111https://www.blender.org/ engine. To obtain annotations, the rendering pipeline is modified to output images, their corresponding albedo and shading profiles (intrinsics) and semantic labels (segmentation). The dataset consists of 35K images, depicted 40 various parks/gardens under 5 lighting conditions. A number of samples are shown in Figure 2. For the experiments, the dataset is randomly split into 80% training and 20% testing (scene split).
4.2 Error Metrics
To evaluate our method for intrinsic image decomposition task, we report on mean squared error (MSE), its scale invariant version (SMSE), local mean squared error (LMSE), and dissimilarity version of the structural similarity index (DSSIM). DSSIM accounts for the perceptual visual quality of the results. Following , for MSE, the absolute brightness of each image is adjusted to minimize the error. Further, = 20 is used for the window size of LMSE. For semantic segmentation task, we report on global pixel accuracy, mean class accuracy and mean intersection over union (mIoU).
5.1 Influence of Reflectance on Semantic Segmentation
In this experiment, we evaluate the performance of reflectance and RGB color images as input for semantic segmentation task. We train an off-the-shelf segmentation algorithm  using (i) ground-truth reflectance () and (ii) color images (); separately, and (iii) + reflectance (); together, as input. The results are summarized in Table 1 and illustrated in Figure 3. Further, confusion matrices for () and () are provided in Figure 4.
|Methodology||Global Pixel||Class Average||mIoU|
The results show that semantic segmentation algorithm highly benefits from illumination invariant intrinsic properties (i.e. reflectance). The combination () outperforms single RGB input (). On the other hand, the results with reflectance as single input () are superior to the results with inputs including color images in all metrics. The combined input () is not better than using only reflectance (), because the network may be negatively influenced by the varying photometric cues introduced by the RGB input. Although the CNN framework may learn, to a certain degree, illumination invariance, it is not possible to cover all the variations caused by the illumination. Therefore, a full illumination invariant representation (i.e. reflectance) helps the CNN to improve semantic segmentation performance. Moreover, the confusion matrices show that the network is unable to distinguish a number of classes based on RGB input. Using reflectance, the same network gains the ability to correctly classify the ground class, as well as making fewer mistakes with similar-looking box and topiary classes.
5.2 Influence of Semantic Segmentation on Intrinsic Decomposition
In this experiment, we evaluate the performance of intrinsic image decomposition using ground-truth semantic segmentation labels as an extra source of information to the images. We compare the performance of intrinsic image decomposition trained with images () only as input and intrinsic decomposition trained with images and ground-truth semantic segmentation labels () together as their input. As for , four input channels (i.e. color image and semantic segmentation labels) are provided as input. The results are summarized in Table 2.
As shown in Table 2, intrinsic image decomposition clearly benefits from segmentation labels. outperforms in all metrics. DSSIM metric, accounting for the perceptual visual quality, shows the improvement on reflectance predictions, which indicates that the semantic segmentation process can act as an object boundary guidance map for reflectance prediction. A number of qualitative comparisons are shown for and in Fig. 5.
5.3 Joint Learning of Semantic Segmentation and Intrinsic Decomposition
In this section, we evaluate the influence of joint learning on intrinsic image decomposition and semantic segmentation performances. We perform three experiments. First, we evaluate the effectiveness of joint learning of intrinsic properties and semantic segmentation considering semantic segmentation performance. Second, we evaluate the effectiveness of joint learning of intrinsic property and semantic segmentation to obtain intrinsic property prediction. Finally, we study the effects of the weights of the loss functions for the tasks.
Experiment I. In this experiment, we evaluate the performance of the proposed joint learning-based semantic segmentation algorithm (), an off-the-shelf semantic segmentation algorithm  () and the baseline of one encoder one decoder ShapeNet  (). All CNNs receive color images as their input. and output only pixel level object class label predictions, whereas the proposed method predicts intrinsic property (i.e. reflectance and shading) in addition to the object class labels. We compare the accuracy of the models in Table 3. As shown in Table 3, the proposed joint learning framework outperforms the single task frameworks in all metrics. Further, visual comparison between and the proposed framework is provided in Fig. 6. In addition, confusion matrices are provided in the supplementary material.
|Methodology||Global Pixel||Class Average||mIoU|
By analyzing the 3rd and 4th row of the figure, it can be derived that unusual lighting conditions negatively influence the results of the . In contrast, our proposed method is not effected by varying illumination due to the joint learning scheme. Furthermore, our method preserves object shapes and boundaries when compared to the model (rows 1, 2 and 5). Note that the joint network does not perform any additional fine-tuning operations (e.g. CRF etc.). Additionally, architecture is deeper than our proposed model. However, our method still outperforms . Finally, the joint network outperforms the single task cascade network; for mIoU 0.6332 vs. 0.5810, see Table 1 and Table 3, as the joint scheme enforces to augment joint features.
Experiment II. In this experiment, we evaluate the performance of the proposed joint learning-based and the state-of-the-art intrinsic image decomposition algorithms  (). Both CNNs receive color images as input. outputs only intrinsic properties (i.e. reflectance and shading), whereas the proposed method predicts pixel level object class labels as well as intrinsic properties. We train and the proposed method using ground-truth reflectance and shading labels on the training set of the proposed dataset. We compare the accuracy of and the proposed method in Table 4.
As shown in Table 4, the performance of the proposed joint learning framework outperforms single task learning (
) in all the metrics for reflectance (albedo) and shading estimation. Further, our joint model obtains lower standard deviation values. To give more insight on reflectance prediction performances, a number of visual comparisons betweenand the proposed framework are given in Fig. 7. In the figure, (the first two columns) it can be derived that the semantic segmentation process acts as an object boundary guidance map for the intrinsic image decomposition task by enhancing cues to differentiate between reflectance and occlusion edges in a scene. Hence, object boundaries are better preserved by the proposed method (e.g. the separation between pavement and ground in the first image and the space between fences in the second image). In addition, information about an object reveals strong priors about it’s intrinsic properties. Each object label adopts to a constrained color distribution. That can be observed in third and fourth columns. Semantic segmentation guides intrinsic image decomposition process by yielding the trees to be closer to green and flowers to be closer to pink. Moreover, for class-level intrinsics, the best improvement (3.3 times better) is obtained by concrete step blocks, which have achromatic colors. Finally, as in segmentation, the joint network outperforms the single task cascade network, see Table 2 and Table 4.
Experiment III. In this experiment, we study the effects of the weightings of the loss functions. As the cross entropy loss is an order of magnitude higher than the SMSE loss, we first normalize them by multiplying the intrinsic loss by 100. Then, we evaluate different weights on top of the normalization (). See Table 5 for the results. If higher weights are assigned to intrinsics, they both jointly increase. However, weights which are too high, negatively influence the mIoU values. Therefore, appears to be the proper setting for both tasks.
5.4 Real World Outdoor Dataset
Finally, our model is evaluated on real world garden images provided by the 3D Reconstruction meets Semantics challenge . The images are captured by a robot driving through a semantically-rich garden with fine geometric details. Results of  are provided as a visual comparison on the performance in Fig. 8. It shows that our method generates better results on real images with sharper reflectance images having more vivid and realistic colors. Moreover, our method mitigates sharp shadow effects better. Note that our model is trained fully on synthetic images and still provides satisfactory results on real, natural scenes. For semantic segmentation comparison, we fine-tuned SegNet  and our approach on the real world dataset after pre-training on the garden dataset. Since we only have the ground-truth for segmentation, we (only) unfreeze the segmentation branch. Results show that SegNet and our approach obtain 0.54 and 0.54 for mIoU and a global pixel accuracy of 0.85 and 0.88 respectively. Note that our model is much smaller in size and predicts the intrinsics together with the segmentation. More results are provided in the supplementary material.
Our approach jointly learns intrinsic image decomposition and semantic segmentation. New CNN architectures are proposed for joint learning, and single intrinsic-for-segmentation and segmentation-for-intrinsic learning. A dataset of 35K synthetic images of natural environments has been created with corresponding albedo and shading (intrinsics), and semantic labels (segmentation). The experiments show joint performance benefit when performing the two tasks (intrinsics and semantics) in joint manner for natural scenes.
Acknowledgements: This project was funded by the EU Horizon 2020 program No. 688007 (TrimBot2020). We thank Gjorgji Strezoski for his contributions to the website.
-  Upcroft, B., McManus, C., Churchill, W., Maddern, W., Newman, P.: Lighting invariant urban street classification. In: IEEE International Conference on Robotics and Automations. (2014)
-  Wang, C., Tang, Y., Zou, X., Situ, W., Feng, W.: A robust fruit image segmentation algorithm against varying illumination for vision system of fruit harvesting robot. Optik-International Journal for Light and Electron Optics 131 (2017) 626–631
-  Suh, H.K., Hofstee, J.W., van Henten, E.J.: Shadow-resistant segmentation based on illumination invariant image transformation. In: International Conference of Agricultural Engineering. (2014)
-  Ramakrishnan, R., Nieto, J., Scheding, S.: Shadow compensation for outdoor perception. In: IEEE International Conference on Robotics and Automation. (2015)
-  Land, E.H., McCann, J.J.: Lightness and retinex theory. Journal of Optical Society of America (1971) 1–11
Shen, L., Tan, P., Lin, S.:
Intrinsic image decomposition with non-local texture cues.
In: IEEE Conferance on Computer Vision and Pattern Recognition. (2008)
-  Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A closed-form solution to retinex with nonlocal texture constraints. IEEE Trans. on Pattern Analysis and Machine Intelligence (2012) 1437–1444
-  Gehler, P.V., Rother, C., Kiefel, M., Zhang, L., Schölkopf, B.: Recovering intrinsic images with a global sparsity prior on reflectance. In: Advances in Neural Information Processing Systems. (2011)
-  Shen, L., Yeo, C.: Intrinsic images decomposition using a local and global sparse representation of reflectance. In: IEEE Conferance on Computer Vision and Pattern Recognition. (2011)
-  Weiss, Y.: Deriving intrinsic images from image sequences. In: IEEE International Conference on Computer Vision. (2001)
-  Matsushita, Y., Lin, S., Kang, S.B., Shum, H.Y.: Estimating intrinsic images from image sequences with biased illumination. In: European Conference on Computer Vision. (2004)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. (2015)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conferance on Computer Vision and Pattern Recognition. (2014)
-  Narihira, T., Maire, M., Yu, S.X.: Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In: IEEE International Conference on Computer Vision. (2015)
-  Shi, J., Dong, Y., Su, H., Yu, S.X.: Learning non-lambertian object intrinsics across shapenet categories. In: IEEE Conferance on Computer Vision and Pattern Recognition. (2017)
-  Lettry, L., Vanhoey, K., Gool, L.V.: Darn: a deep adversarial residual network for intrinsic image decomposition. In: IEEE Winter Conference on Applications of Computer Vision. (2018)
-  Baslamisli, A.S., Le, H.A., Gevers, T.: Cnn based learning using reflection and retinex models for intrinsic image decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition. (2018)
-  Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth dataset and baseline evaluations for intrinsic image algorithms. In: IEEE International Conference on Computer Vision. (2009)
-  Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. In: ACM Trans. on Graphics (TOG). (2014)
-  Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European Conference on Computer Vision. (2012)
-  Fulkerson, B., Vedaldi, A., Soatto, S.: Class segmentation and object localization with superpixel neighborhoods. In: IEEE International Conference on Computer Vision. (2009)
-  Csurka, G., Perronnin, F.: An efficient approach to semantic segmentation. International Journal of Computer Vision, 95(2) (2011) 198–212
-  Shotton, J., Winn, J., Rother, C., Criminisi, A.: Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 95(2) (2009) 2–23
-  Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (2017)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conferance on Computer Vision and Pattern Recognition. (2015)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
-  Everingham, M., Eslami, S.M.A., van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1) (2015) 98–136
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.:
The cityscapes dataset for semantic urban scene understanding.In: IEEE Conferance on Computer Vision and Pattern Recognition. (2016)
-  Garcia-Garcia, A., Orts-Escolano, S., Oprea, S.O., Villena-Martinez, V., Garcia-Rodriguez, J.: A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing, 70 (2018) 41–65
-  Jafari, O.H., Groth, O., Kirillov, A., Yang, M.Y., Rother, C.: Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In: IEEE International Conference on Robotics and Automation. (2017)
-  Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: IEEE International Conference on Computer Vision. (2015)
-  Mousavian, A., Pirsiavash, H., Kosecka, J.: Joint semantic segmentation and depth estimation with deep convolutional networks. In: IEEE International Conference on 3D Vision. (2016)
-  Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint semantic segmentation and 3d reconstruction from monocular video. In: European Conference on Computer Vision. (2014)
-  Ladicky, L., Sturgess, P., Russell, C., Sengupta, S., Bastanlar, Y., Clocksin, W., Torr, P.H.S.: Joint optimization for object class segmentation and dense stereo reconstruction. International journal of computer vision, 100(2) (2012)
-  Barron, J.T., Malik, J.: Color constancy, intrinsic images,and shape estimation. In: European Conference on Computer Vision. (2012)
-  Kim, S., Park, K., Sohn, K., Lin, S.: Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In: IEEE Conference on Computer Vision and Pattern Recognition. (2016)
-  Shelhamer, E., Barron, J.T., Darrell, T.: Scene intrinsics and depth from a single image. In: IEEE International Conference on Computer Vision Workshop. (2015)
-  Vineet, V., Rother, C., Torr, P.H.S.: Higher order priors for joint intrinsic image, objects, and attributes estimation. In: Advances in Neural Information Processing Systems. (2013)
-  Shafer, S.: Using color to separate reflection components. Color research and applications (1985) 210–218
-  Weber, J., Penn, J.: Creation and rendering of realistic trees. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). (1995)
-  Sattler, T., Tylecek, R., Brok, T., Pollefeys, M., Fisher, R.B.: 3d reconstruction meets semantics - reconstruction challange 2017. In: IEEE International Conference on Computer Vision Workshop. (2017)
S1 Implementation Details
Our models are implemented using the Adadelta supp_adadelta optimizer with learning rate of 0.01. Convolution weights are initialized using a normal distribution with a weight decay factor of 1e-9. The input to the networks are fixed to a resolution of 352480. The images are normalized to the range of [0, 1] as the pre-processing step. The batch sizes are fixed at 16 for all experiments. In addition, in the last decoder block, the feature dimension is reduced to the expected output dimensions. For the semantic segmentation task, the loss is weighted per class, since the classes are not equally distributed. Furthermore, for the output of the semantic segmentation task, the feature dimensions are unchanged and is simply convolved at the same feature dimension one additional time. This produces an output corresponding to the 16 class labels.
Baseline network architecture: The encoder part is composed of 6 convolution blocks with 3
3 kernels and stride of 2 having [16, 32, 64, 128, 256, 256] feature maps. The encoder part is mirrored to build the decoder. Before convolving the feature maps of the decoder part, previous layer’s feature maps are first up-sampled and concatenated with their corresponding encoder features. All convolutions are followed by batch normalization supp_batchnorm and ReLU.
Gamma parameters for SMSE: For all the experiments involving intrinsic image decomposition task, to form the combined MSE and SMSE loss, we followed the setup of supp_shi and set to 0.95 and to 0.05 for Equation 6 in the main manuscript. Nonetheless, we conducted a small experiment to see the effect of the gamma parameters for SMSE. Table 1 provides the average intrinsic image decomposition errors. The small experiment suggest that giving higher weight to tends to improve the results.
S2 Confusion Matrix for Joint Learning of Semantic Segmentation and Intrinsic Decomposition
Confusion matrices for and proposed model are provided in Figure S1. Confusion matrices show that the ability to distinguish close-color classes under different lighting conditions is further improved by joint learning. Similar to the case with using albedo as input for SegNet architecture, joint learning also improves the semantic segmentation performance significantly with certain classes. For the ground class, confusion is reduced remarkably by also learning intrinsics. Likewise, similar looking (in terms of shape and color) box and topiary classes are also better distinguished. In addition, most of the small confusions are eliminated.
S3 Results in Higher Resolutions
In this part of the supplementary material, the results of reflectance prediction in higher resolution are presented for better visual comparisons.
s3.1 Influence of Semantic Segmentation on Intrinsic Image Decomposition
In this experiment, we evaluate the performance of intrinsic image decomposition using ground-truth semantic segmentation labels as an extra source of information to the color images. Qualitative comparison between predictions made from images as input (), against predictions made from along with segmentation labels as input () are provided in higher resolutions in Fig. S2 and Fig. S3.
s3.2 Joint Learning of Semantic Segmentation and Intrinsic Decomposition
In this experiment, we evaluate the performance of the proposed joint learning-based and the a state-of-the-art intrinsic image decomposition algorithm supp_shi. Both CNNs receive color images as input. The proposed method provides sharper outputs especially at object boundaries and predicts colours that are closer to the ground truth reflectance. Higher resolution results are provided from Fig. S4 to Fig. S8.
S4 More Results on Real World Images
In this part, additional result of real world garden images are presented. The proposed method generates reflectance images with more vivid and realistic colors. Moreover, our method mitigates sharp shadow effects better and produces sharper images. Additional results of reflectance images are shown in Fig. S9, semantic segmentation results are shown in Fig. S10. For segmentation, the joint learning performs comparable to the baseline, yet we achieve sharper results.