Algorithms for automatic recovery of physical scene properties from an input image are of interest for many applications across computer vision and graphics; examples include material recognition and re-rendering. The intrinsic image model assumes that color image is the point-wise product of albedo and shading :
Here, albedo is the physical reflectivity of surfaces in the scene. Considerable research focuses on automated recovery of and given as input only color image [15, 11], or given and a depth map for the scene [19, 1, 2, 5]. Our work falls into the former category as we predict the decomposition using only color input. Yet, we outperform modern approaches that rely on color and depth input [19, 1, 2, 5].
We achieve such results through a drastic departure from most traditional approaches to the intrinsic image problem. Many works attack this problem by incorporating strong physics-inspired priors. One expects albedo and material changes to be correlated, motivating priors such as piecewise constancy of albedo [18, 20, 1, 2] or sparseness of the set of unique albedo values in a scene [25, 10, 26]. One also expects shading to vary smoothly over the image . Tang 
explore generative learning of priors using deep belief networks. Though learning aligns with our philosophy, we take a discriminative approach.
Systems motivated by physical priors are usually formulated as optimization routines solving for a point-wise decomposition that satisfies Equation 1 and also fits with priors imposed over an extended spatial domain. Hence, graph-based inference algorithms  and conditional random fields (CRFs) in particular  are often used.
We forgo both physical modeling constraints and graph-based inference methods. Our direct intrinsics approach is purely data-driven and learns a convolutional regression which maps a color image input to its corresponding albedo and shading outputs. It is instantiated in the form of a multiscale fully convolutional neural network (Figure 1).
Key to enabling our direct intrinsics approach is availability of a large-scale dataset with example ground-truth albedo-shading decompositions. Unfortunately, collecting such ground-truth for real images is a challenging task as it requires full control over the lighting environment in which images are acquired. This is possible in a laboratory setting , but difficult for more realistic scenes.
The Intrinsic Images in the Wild (IIW) dataset  attempts to circumvent the lack of training data through large-scale human labeling effort. However, its ground-truth is not in the form of actual decompositions, but only relative reflectance judgements over a sparse set of point pairs. These are human judgements rather than physical properties. They may be sufficient for training models with strong priors , or most recently, CNNs for replicating human judgements . But they are insufficient for data-driven learning of intrinsic image decompositions from scratch.
We circumvent the data availability roadblock by training on purely synthetic images and testing on both real and synthetic images. The MPI Sintel dataset  provides photo-realistic rendered images and corresponding albedo-shading ground-truth derived from the underlying 3D models and art assets. These were first used as training data by Chen and Koltun  for deriving a more accurate physics-based intrinsics model. Figure 2 shows examples.
describes the details of our CNN architecture and learning objectives for direct intrinsics. Our design is motivated by recent work on using CNNs to recover depth and surface normal estimates from a single image[28, 8, 21, 7]. Section 3 provides experimental results and benchmarks on the Sintel dataset, and examines the portability of our model to real images. Section 4 concludes.
2 Direct Intrinsics
We break the full account of our system into specification of the CNN architecture, description of the training data, and details of the loss function during learning.
Intrinsic decomposition requires access to all the precise details of an image patch as well as overall gist of the entire scene. The multiscale model of Eigen and Fergus  for predicting scene depth has these ingredients and we build upon their network architecture. In their two-scale network, they first extract global contextual information in a coarse subnetwork (scale ), and use that subnetwork’s output as an additional input to a finer-scale network (scale ). As Figure 3 shows, we adopt a Multiscale CNN Regression (MSCR) architecture with important differences from :
Instead of fully connected layers in scale , we use a convolution layer following the upsampling layer. This choice enables our model to run on arbitrary-sized images in a fully convolutional fashion.
For nonlinear activations, we use Parametric Rectified Linear Units (PReLUs). With PReLUs, a negative slope for each activation map channel appears as a learnable parameter:
where is pre-activation value at -th dimension of a feature map. During experiments, we observe better convergence with PReLUs compared to ReLUs.
Our network has two outputs, albedo and shading (-a and -s in Figure 3), which it predicts simultaneously.
and the layer between uses fixed bilinear interpolation). With deconvolution, we setchannels, , and learn to upsample from a richer representation.
In addition to these basic changes, we explore a variant of our model, shown on the right side of Figure 3 that connects multiple layers of the scale subnetwork directly to that subnetwork’s output. The reasoning follows that of Maire  and Hariharan , with the objective of directly capturing a representation of the input at multiple levels of abstraction. We adopt the ”hypercolumn“ (HC) terminology  to designate this modification to MSCR.
The remaining architectural details are as follows. For convolutional layers through in the scale net, we take the common AlexNet  design. Following those, we upsample the feature map to a quarter of the original image size, and feed it to a convolutional layer with -dimensional output (conv6). Scale consists of
convolutional layers for feature extraction followed by albedo and shading prediction. The first of these layers hasfilters and output maps. Subsequently, we concatenate output of the scale subnetwork and feed the result into the remaining convolutional and prediction layers, all of which use filters. The optional learned deconvolutional layer uses
filters with stride. Whether using deconvolution or simple upsampling, we evaluate our output on a grid of the same spatial resolution as the original image.
2.2 MPI Sintel Dataset
For training data, we follow Chen and Koltun  and use the “clean pass” images of MPI Sintel dataset instead of their “final” images, which are the result of additional computer graphics tricks which distract from our application. This eliminates effects such as depth of field, motion blur, and fog. Ground-truth shading images are generated by rendering the scene with all elements assigned a constant grey albedo.
Some images contain defect pixels due to software rendering issues. We follow  and do not use images with defects in evaluation. However, limited variation within the Sintel dataset is a concern for data-driven learning. Hence, we use defective images in training by masking out defective pixels (ignoring their contribution to training error).
2.3 MIT Intrinsic Image Dataset
To demonstrate adaptability of our model to the real world images, we use the MIT intrinsic image dataset . Images in this dataset are acquired with special apparatus, yielding ground-truth reflectance and shading components for real world objects. Here, reflectance is synonymous with our terminology of albedo.
Due to the limited scalability of the collection method, the MIT dataset contains only different objects, with each object having images from different light sources. Only image of has shading ground-truth. We generate each of shading images from a corresponding original image and reflectance image (identical for all the images because they are taken from the same object and the same camera settings) by element-wise division: , where and denote mean values of RGB channels of and respectively, and is the value that minimizes the sum of squared error of .
For our models trained on MIT, we denote inclusion of these additional generated examples in training by appending the designation GenMIT to the model name. We find that some images in the MIT dataset do not exactly follow , but including these generated shadings still improves overall performance.
2.4 Data Synthesis: Matching Sintel to MIT
Even after generating shading images, the size of the MIT dataset is still small enough to be problematic for data-driven approaches. While we can simply train on Sintel and test on MIT, we observed some differences in dataset characteristics. Specifically, the rendering procedure generating Sintel ground-truth produces output that does not satisfy . In order to shift the Sintel training data into a domain more representative of real images, we resynthesized ground-truth from the ground-truth and . In experiments, we denote this variant by ResynthSintel and find benefit from training with it when testing on MIT.
2.5 Data Augmentation
Throughout all experiments, we crop and mirror training images to generate additional training examples. We optionally utilize further data augmentation, denoted DA in experiments, consisting of scaling and rotating images.
Given an image , we denote our dense prediction of albedo and shading maps as:
where consists of all CNN parameters to be learned.
2.6.1 Scale Invariant L2 Loss
Since the intensity of our ground-truth albedo and shading is not absolute, imposing standard regression loss (L2 error) does not work. Hence, to learn , we use the scale invariant L2 loss described in . Let be a ground-truth image in space of either albedo or shading and be a prediction map. By denoting as their difference, the scale invariant L2 loss is:
where are image coordinates, is the channel index (RGB) and is the number of evaluated pixels. is a coefficient for balancing the scale invariant term: it is simply least square loss when , scale invariant loss when , and an average of the two when . We select for training on MIT or Sintel separately, as it has been found to produce good absolute-scale predictions while slightly improving qualitative output . We select for training on MIT and Sintel jointly, as the intensity scales from the two datasets differ and the generated images no longer preserve the original intensity scale. Note that is not necessarily equal to the number of image pixels because we ignore defective pixels in the training set.
The loss function for our MSCR model is:
2.6.2 Gradient L2 Loss
We also consider training with a loss that favors recovery of piecewise constant output. To do so, we use the gradient loss, which is an L2 error loss between the gradient of prediction and that of the ground-truth. By letting and be derivative operators in the - and -dimensions, respectively, of an image, the gradient L2 loss is:
Shading cannot be assumed piecewise constant; we do not use gradient loss for it. Our objective with gradient loss is:
We denote as MSCR+GL the version of our model using it.
Though large compared to other datasets for intrinsic image decomposition, MPI Sintel, with examples, is still small compared to the large-scale datasets for image classification  on which deep networks have seen success. We find it necessary to add additional regularization during training and employ dropout 
with probabilityfor all convolutional layers except conv1 though conv5 in scale .
2.7 Implementation Details
We implement our algorithms in the Caffe framework
. We use stochastic gradient descent with random initialization and momentum ofto optimize our networks. Learning rates for each layer are tuned by hand to get reasonable convergences. We train networks with batch size for to mini-batch iterations (depending on convergence speed and dataset). We randomly crop images at a size of pixels and mirror them horizontally. For additional data augmentation (DA), we also randomly rotate images in the range of [, ] degrees and zoom by a random factor in the range [, ].
Due to the architecture of our scale subnetwork, our CNN may take as input any image whose width and height are each a multiple of
pixels. For testing, we pad the images to fit this requirement and then crop the output map to the original input size.
3 Empirical Evaluation
|Sintel Training & Testing: Image Split||MSE||LMSE||DSSIM|
|Baseline: Shading Constant|
|Baseline: Albedo Constant|
|Chen and Koltun |
|Sintel Training & Testing: Scene Split||MSE||LMSE||DSSIM|
Key: GL = gradient loss HC = hypercolumns DA = data augmentation (scaling, rotation) GenMIT = add MIT w/generated shading to training
|MIT Training & Testing: Our Split||MSE||LMSE|
|Ours without deconv|
|Ours without DA|
|Ours without GenMIT|
|Ours + Sintel|
|Ours + ResynthSintel|
|MIT Training & Testing: Barron ’s Split||MSE||LMSE|
|Naive Baseline (from , uniform shading)|
|Ours + ResynthSintel|
Key: DA = data augmentation (scaling, rotation) GenMIT / Sintel / ResynthSintel = add MIT generated shading / Sintel / resynthesized Sintel to training
MPI Sintel dataset: We use a total of images in the Sintel albedo/shading dataset, from scenes with frames each (one of the scenes has only frames). We use two-fold cross validation, that is, training on half of the images and testing on the remaining images, to obtain our test results on all images. Our training/testing split is a scene split, placing an entire scene (all images it contains) either completely in training or completely in testing. For comparison to prior work, we retrain with the less stringent historically-used image split of Chen and Koltun , which randomly assigns each image to the train/test set.
MIT-intrinsic image dataset: MIT has objects with different light source images, for images total. For MIT-intrinsic evaluation, we also split into two and use two-fold cross validation. Following best practices, we split the validation set by objects rather than images.
We adopt the same three error measures as :
LMSE is the local mean-squared error, which is the average of the scale-invariant MSE errors computed on overlapping square windows of size 10% of the image along its larger dimension.
DSSIM is the dissimilarity version of the structural similarity index (SSIM), defined as
. SSIM characterizes image similarity as perceived by human observers. It combines errors from independent aspects of luminance, contrast, and structure, which are captured by mean, variance, and covariance of patches.
On Sintel, we compare our model with two trivial decomposition baselines where either shading or albedo is assumed uniform grey, the classical Retinex algorithm ( version) which obtains intrinsics by thresholding gradients, and three state-of-the-art intrinsics approaches which use not only RGB image input but also depth input. Barron  estimate the most likely intrinsics using a shading rendering engine and learned priors on shapes and illuminations. Lee  estimate intrinsic image sequences from RGB+D video subject to additional shading and temporal constraints. Chen and Koltun  use a refined shading model by decomposing it into direct irradiance, indirect irradiance, and a color component. On MIT, we compare with Barron  as well as the trivial baseline.
The top panel of Table 1 (image split case) shows that evaluated on Chen and Koltun’s test set, our MSCR+dropout+GL model significantly outperforms all competing methods according to MSE and LMSE. It is also overall better according to DSSIM than the current state-of-art method of Chen and Koltun: while our albedo DSSIM is larger, our shading DSSIM is smaller. Note that Chen and Koltun’s method utilizes depth information and is also trained on the DSSIM measure directly, whereas ours is based on the color image alone and is not trained to optimize the DSSIM score.
The bottom panel of Table 1 (scene-split case) is more indicative of an algorithm’s out-of-sample generalization performance; the test scenes have not been seen during training. These results show that: 1) The out-of-sample errors in the scene-split case are generally larger than the in-sample errors in the image-split case; 2) While HC has negligible effect, each tweak with dropout, gradient loss, learned deconvolutional layers, and data augmentation improves performance; 3) Training on Sintel and MIT together provides a small improvement when testing on Sintel.
Figure 2 shows sample results from our best model, while Figure 4 displays a side-by-side comparison with three other approaches. An important distinction is that our results are based on RGB alone, while the other approaches require both RGB and depth input. Across a diversity of scenes, any of the three RGB+D approaches could break down in one of the scenes on either albedo or shading: Lee ’s method on the bamboo scene, Barron ’s method on the dragon scene, Chen and Koltun’s method on the old man scene. The quality of our results is even across scenes and remains overall consistent with both albedo and shading ground-truth.
Table 2 shows that our model graciously adapts to real images. Trained on MIT alone, it produces reasonable results. Naively adding Sintel data to training hurts performance, but mixing our resynthesized version of Sintel into training results in noticeable improvements to albedo estimation when testing on MIT. The behavior of ablated system variants on MIT mirrors our findings on Sintel. On MIT, the learned deconvolutional layer is especially important. Output in Figure 5 exhibits clear visual degradation upon its removal. Figure 5 illustrates a tradeoff when using resynthesized Sintel training data: there is an overall benefit, but a Sintel-specific shading prior (bluish tint) leaks in.
In addition to Sintel and MIT, we briefly experiment with testing, but not training, our models on the IIW dataset . Here, performance is less than satisfactory (WHDR=), compared to both our own prior work  and the current state-of-the-art , which are trained specifically for IIW. We speculate that there could be some discrepancy between the tasks of predicting human reflectance judgements (WHDR metric) and physically-correct albedo-shading decompositions. As we observed when moving from Sintel to MIT, there could be a domain shift between Sintel/MIT and IIW for which we are not compensating. We leave these interesting issues for future work.
We propose direct intrinsics, a new intrinsic image decomposition approach that is not based on the physics of image formation or the statistics of shading and albedo priors, but learns the dual associations between the image and the albedo+shading components directly from training data.
We develop a two-level feed-forward CNN architecture based on a successful previous model for RGB to depth prediction, where the coarse level architecture predicts the global context and the finer network uses the output of the coarse network to predict the finer resolution result. Combined with well-designed loss functions, data augmentation, dropout, and deconvolution, we demonstrate that direct intrinsics outperforms state-of-the-art methods that rely not only on more complex priors and graph-based inference, but also on the additional input of scene depth.
Our data-driven learning approach is more flexible, generalizable, and easier to model. It only needs training data, requires no hand-designed features or representations, and can adapt to unrealistic illuminations and complex albedo, shape, and lighting patterns. Our model works with both synthetic and real images and can further improve on real images when augmenting training with synthetic examples.
Acknowledgments. We thank Ayan Chakrabarti for valuable discussion. We thank both Qifeng Chen and Jon Barron for providing their code and accompanying support.
J. T. Barron and J. Malik.
Intrinsic scene properties from a single RGB-D image.
IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
-  S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Trans. on Graphics, 2014.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
-  Q. Chen and V. Koltun. A simple model for intrinsic image decomposition with depth cues. International Conference on Computer Vision, 2013.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. Neural Information Processing Systems, 2014.
-  E. Garces, A. Munoz, J. Lopez-Moreno, and D. Gutierrez. Intrinsic images by clustering. In Computer Graphics Forum (Eurographics Symposium on Rendering), 2012.
-  P. Gehler, C. Roth, M. Kiefel, L. Zhang, and B. Scholkopf. Recovering intrinsic images with a global sparsity prior on reflectance. In Neural Information Processing Systems, 2011.
-  R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In International Conference on Computer Vision, 2009.
-  B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv:1502.01852, 2015.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
-  B. Horn. Determining lightness from an image. Computer Graphics and Image Processing, 3:277–99, 1974.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In arXiv preprint arXiv:1408.5093, 2014.
-  A. Krizhevsky, S.Ilya, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
-  E. H. Land and J. J. McCann. Lightness and retinex theory. Journal of Optical Society of America, 61(1):1–11, 1971.
-  K. J. Lee, Q. Zhao, X. Tong, M. Gong, S. Izadi, S. U. Lee, P. Tan, and S. Lin. Estimation of intrinsic image sequences from image+depth video. European Conference on Computer Vision, 2012.
-  Z. Liao, J. Rock, Y. Wang, and D. Forsyth. Non-parametric filtering for geometric detail extraction and material representation. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
-  F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. arXiv:1502.07411, 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  M. Maire, S. X. Yu, and P. Perona. Reconstructive sparse code transfer for contour detection and semantic labeling. Asian Conference on Computer Vision, 2014.
-  T. Narihira, M. Maire, and S. X. Yu. Learning lightness from human judgement on relative reflectance. In Computer Vision and Pattern Recognition (CVPR), 2015.
-  I. Omer and M. Werman. Color lines: image specific color representation. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
-  L. Shen and C. Yeo. Intrinsic images decomposition using a local and global sparse representation of reflectance. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
Y. Tang, R. Salakhutdinov, and G. Hinton.
Deep lambertian networks.
International Conference on Machine Learning, 2012.
-  X. Wang, D. F. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. arXiv:1411.4958, 2014.
-  S. X. Yu. Angular embedding: from jarring intensity differences to perceived luminance. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2302–9, 2009.
-  T. Zhou, P. Krähenbühl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. International Conference on Computer Vision, 2015.