Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression

12/08/2015 ∙ by Takuya Narihira, et al. ∙ berkeley college Toyota Technological Institute at Chicago Sony 0

We introduce a new approach to intrinsic image decomposition, the task of decomposing a single image into albedo and shading components. Our strategy, which we term direct intrinsics, is to learn a convolutional neural network (CNN) that directly predicts output albedo and shading channels from an input RGB image patch. Direct intrinsics is a departure from classical techniques for intrinsic image decomposition, which typically rely on physically-motivated priors and graph-based inference algorithms. The large-scale synthetic ground-truth of the MPI Sintel dataset plays a key role in training direct intrinsics. We demonstrate results on both the synthetic images of Sintel and the real images of the classic MIT intrinsic image dataset. On Sintel, direct intrinsics, using only RGB input, outperforms all prior work, including methods that rely on RGB+Depth input. Direct intrinsics also generalizes across modalities; it produces quite reasonable decompositions on the real images of the MIT dataset. Our results indicate that the marriage of CNNs with synthetic training data may be a powerful new technique for tackling classic problems in computer vision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithms for automatic recovery of physical scene properties from an input image are of interest for many applications across computer vision and graphics; examples include material recognition and re-rendering. The intrinsic image model assumes that color image is the point-wise product of albedo and shading :

(1)

Here, albedo is the physical reflectivity of surfaces in the scene. Considerable research focuses on automated recovery of and given as input only color image  [15, 11], or given and a depth map for the scene [19, 1, 2, 5]. Our work falls into the former category as we predict the decomposition using only color input. Yet, we outperform modern approaches that rely on color and depth input [19, 1, 2, 5].

Figure 1: Direct intrinsics. We construct a convolutional neural network (CNN) that, acting across an input image, directly predicts the decomposition into albedo and shading images. It essentially encodes nonlinear convolutional kernels for the output patches (green boxes) from a much larger receptive field in the input image (cyan box). We train the network on computer graphics generated images from the MPI Sintel dataset [4] (Figure 2).

We achieve such results through a drastic departure from most traditional approaches to the intrinsic image problem. Many works attack this problem by incorporating strong physics-inspired priors. One expects albedo and material changes to be correlated, motivating priors such as piecewise constancy of albedo [18, 20, 1, 2] or sparseness of the set of unique albedo values in a scene [25, 10, 26]. One also expects shading to vary smoothly over the image [9]. Tang  [27]

explore generative learning of priors using deep belief networks. Though learning aligns with our philosophy, we take a discriminative approach.

Systems motivated by physical priors are usually formulated as optimization routines solving for a point-wise decomposition that satisfies Equation 1 and also fits with priors imposed over an extended spatial domain. Hence, graph-based inference algorithms [29] and conditional random fields (CRFs) in particular [3] are often used.

 

Image Ground-truth Albedo Our Albedo Ground-truth Shading Our Shading
Figure 2: Albedo-shading decomposition on the MPI Sintel dataset. Top: A sampling of frames from different scenes comprising the Sintel movie. Bottom: Our decomposition results alongside ground-truth albedo and shading for some example frames.

We forgo both physical modeling constraints and graph-based inference methods. Our direct intrinsics approach is purely data-driven and learns a convolutional regression which maps a color image input to its corresponding albedo and shading outputs. It is instantiated in the form of a multiscale fully convolutional neural network (Figure 1).

Key to enabling our direct intrinsics approach is availability of a large-scale dataset with example ground-truth albedo-shading decompositions. Unfortunately, collecting such ground-truth for real images is a challenging task as it requires full control over the lighting environment in which images are acquired. This is possible in a laboratory setting [11], but difficult for more realistic scenes.

The Intrinsic Images in the Wild (IIW) dataset [3] attempts to circumvent the lack of training data through large-scale human labeling effort. However, its ground-truth is not in the form of actual decompositions, but only relative reflectance judgements over a sparse set of point pairs. These are human judgements rather than physical properties. They may be sufficient for training models with strong priors [3], or most recently, CNNs for replicating human judgements [24]. But they are insufficient for data-driven learning of intrinsic image decompositions from scratch.

We circumvent the data availability roadblock by training on purely synthetic images and testing on both real and synthetic images. The MPI Sintel dataset [4] provides photo-realistic rendered images and corresponding albedo-shading ground-truth derived from the underlying 3D models and art assets. These were first used as training data by Chen and Koltun [5] for deriving a more accurate physics-based intrinsics model. Figure 2 shows examples.

Section 2

describes the details of our CNN architecture and learning objectives for direct intrinsics. Our design is motivated by recent work on using CNNs to recover depth and surface normal estimates from a single image 

[28, 8, 21, 7]. Section 3 provides experimental results and benchmarks on the Sintel dataset, and examines the portability of our model to real images. Section 4 concludes.

2 Direct Intrinsics

We break the full account of our system into specification of the CNN architecture, description of the training data, and details of the loss function during learning.


MSCR
[width=1.0grid=false]figs/Hypercolumns25.pdf whitewhitewhite - whitewhitewhite - MSCR+HC
Figure 3: CNN architectures. We explore two architectural variants for implementing the direct intrinsics network shown in Figure 1. Left: Motivated by the multiscale architecture used by Eigen and Fergus [7] for predicting depth from RGB, we adapt a similar network structure for direct prediction of albedo and shading from RGB and term it Multiscale CNN Regression (MSCR). Right: Recent work [23, 12] shows value in directly connecting intermediate layers to the output. We experiment with a version of such connections in the scale  subnetwork, adopting the hypercolumn (HC) terminology [12]. The subnetwork for scale  is identical to that on the left. M is input size factor.

2.1 Model

Intrinsic decomposition requires access to all the precise details of an image patch as well as overall gist of the entire scene. The multiscale model of Eigen and Fergus [7] for predicting scene depth has these ingredients and we build upon their network architecture. In their two-scale network, they first extract global contextual information in a coarse subnetwork (scale ), and use that subnetwork’s output as an additional input to a finer-scale network (scale ). As Figure 3 shows, we adopt a Multiscale CNN Regression (MSCR) architecture with important differences from [7]:

  • Instead of fully connected layers in scale , we use a convolution layer following the upsampling layer. This choice enables our model to run on arbitrary-sized images in a fully convolutional fashion.

  • For nonlinear activations, we use Parametric Rectified Linear Units (PReLUs) 

    [13]. With PReLUs, a negative slope for each activation map channel appears as a learnable parameter:

    (2)

    where is pre-activation value at -th dimension of a feature map. During experiments, we observe better convergence with PReLUs compared to ReLUs.

  • Our network has two outputs, albedo and shading (-a and -s in Figure 3), which it predicts simultaneously.

  • We optionally use deconvolution to learn to upsample the scale  output to the resolution of the original images [22]. Without deconvolution, we upsample an RGB output ( in Figure 3

    and the layer between uses fixed bilinear interpolation). With deconvolution, we set

    channels, , and learn to upsample from a richer representation.

In addition to these basic changes, we explore a variant of our model, shown on the right side of Figure 3 that connects multiple layers of the scale subnetwork directly to that subnetwork’s output. The reasoning follows that of Maire  [23] and Hariharan  [12], with the objective of directly capturing a representation of the input at multiple levels of abstraction. We adopt the ”hypercolumn“ (HC) terminology [12] to designate this modification to MSCR.

The remaining architectural details are as follows. For convolutional layers through in the scale net, we take the common AlexNet [17] design. Following those, we upsample the feature map to a quarter of the original image size, and feed it to a convolutional layer with -dimensional output (conv6). Scale consists of

convolutional layers for feature extraction followed by albedo and shading prediction. The first of these layers has

filters and output maps. Subsequently, we concatenate output of the scale subnetwork and feed the result into the remaining convolutional and prediction layers, all of which use filters. The optional learned deconvolutional layer uses

filters with stride

. Whether using deconvolution or simple upsampling, we evaluate our output on a grid of the same spatial resolution as the original image.

2.2 MPI Sintel Dataset

For training data, we follow Chen and Koltun [5] and use the “clean pass” images of MPI Sintel dataset instead of their “final” images, which are the result of additional computer graphics tricks which distract from our application. This eliminates effects such as depth of field, motion blur, and fog. Ground-truth shading images are generated by rendering the scene with all elements assigned a constant grey albedo.

Some images contain defect pixels due to software rendering issues. We follow [5] and do not use images with defects in evaluation. However, limited variation within the Sintel dataset is a concern for data-driven learning. Hence, we use defective images in training by masking out defective pixels (ignoring their contribution to training error).

2.3 MIT Intrinsic Image Dataset

To demonstrate adaptability of our model to the real world images, we use the MIT intrinsic image dataset [11]. Images in this dataset are acquired with special apparatus, yielding ground-truth reflectance and shading components for real world objects. Here, reflectance is synonymous with our terminology of albedo.

Due to the limited scalability of the collection method, the MIT dataset contains only different objects, with each object having images from different light sources. Only image of has shading ground-truth. We generate each of shading images from a corresponding original image and reflectance image (identical for all the images because they are taken from the same object and the same camera settings) by element-wise division: , where and denote mean values of RGB channels of and respectively, and is the value that minimizes the sum of squared error of .

For our models trained on MIT, we denote inclusion of these additional generated examples in training by appending the designation GenMIT to the model name. We find that some images in the MIT dataset do not exactly follow , but including these generated shadings still improves overall performance.

2.4 Data Synthesis: Matching Sintel to MIT

Even after generating shading images, the size of the MIT dataset is still small enough to be problematic for data-driven approaches. While we can simply train on Sintel and test on MIT, we observed some differences in dataset characteristics. Specifically, the rendering procedure generating Sintel ground-truth produces output that does not satisfy . In order to shift the Sintel training data into a domain more representative of real images, we resynthesized ground-truth from the ground-truth and . In experiments, we denote this variant by ResynthSintel and find benefit from training with it when testing on MIT.

2.5 Data Augmentation

Throughout all experiments, we crop and mirror training images to generate additional training examples. We optionally utilize further data augmentation, denoted DA in experiments, consisting of scaling and rotating images.

2.6 Learning

Given an image , we denote our dense prediction of albedo and shading maps as:

(3)

where consists of all CNN parameters to be learned.

2.6.1 Scale Invariant L2 Loss

Since the intensity of our ground-truth albedo and shading is not absolute, imposing standard regression loss (L2 error) does not work. Hence, to learn , we use the scale invariant L2 loss described in [7]. Let be a ground-truth image in space of either albedo or shading and be a prediction map. By denoting as their difference, the scale invariant L2 loss is:

(4)

where are image coordinates, is the channel index (RGB) and is the number of evaluated pixels. is a coefficient for balancing the scale invariant term: it is simply least square loss when , scale invariant loss when , and an average of the two when . We select for training on MIT or Sintel separately, as it has been found to produce good absolute-scale predictions while slightly improving qualitative output [7]. We select for training on MIT and Sintel jointly, as the intensity scales from the two datasets differ and the generated images no longer preserve the original intensity scale. Note that is not necessarily equal to the number of image pixels because we ignore defective pixels in the training set.

The loss function for our MSCR model is:

(5)

2.6.2 Gradient L2 Loss

We also consider training with a loss that favors recovery of piecewise constant output. To do so, we use the gradient loss, which is an L2 error loss between the gradient of prediction and that of the ground-truth. By letting and be derivative operators in the - and -dimensions, respectively, of an image, the gradient L2 loss is:

(6)

Shading cannot be assumed piecewise constant; we do not use gradient loss for it. Our objective with gradient loss is:

(7)

We denote as MSCR+GL the version of our model using it.

2.6.3 Dropout

Though large compared to other datasets for intrinsic image decomposition, MPI Sintel, with examples, is still small compared to the large-scale datasets for image classification [6] on which deep networks have seen success. We find it necessary to add additional regularization during training and employ dropout [14]

with probability

for all convolutional layers except conv1 though conv5 in scale .

2.7 Implementation Details

We implement our algorithms in the Caffe framework 

[16]

. We use stochastic gradient descent with random initialization and momentum of

to optimize our networks. Learning rates for each layer are tuned by hand to get reasonable convergences. We train networks with batch size for to mini-batch iterations (depending on convergence speed and dataset). We randomly crop images at a size of pixels and mirror them horizontally. For additional data augmentation (DA), we also randomly rotate images in the range of [, ] degrees and zoom by a random factor in the range [, ].

Due to the architecture of our scale subnetwork, our CNN may take as input any image whose width and height are each a multiple of

pixels. For testing, we pad the images to fit this requirement and then crop the output map to the original input size.

3 Empirical Evaluation

Sintel Training & Testing: Image Split MSE LMSE DSSIM
Albedo Shading Avg Albedo Shading Avg Albedo Shading Avg
Baseline: Shading Constant
Baseline: Albedo Constant
Retinex [11]
Lee [19]
Barron [2]
Chen and Koltun [5]
MSCR+dropout+GL
Sintel Training & Testing: Scene Split MSE LMSE DSSIM
Albedo Shading Avg Albedo Shading Avg Albedo Shading Avg
MSCR
MSCR+dropout
MSCR+dropout+HC
MSCR+dropout+GL
MSCR+dropout+deconv+DA
MSCR+dropout+deconv+DA+GenMIT

Key:   GL = gradient loss  HC = hypercolumns  DA = data augmentation (scaling, rotation)  GenMIT = add MIT w/generated shading to training

Table 1: MPI Sintel benchmarks. We report the standard MSE, LMSE, and DSSIM metrics (lower is better) as used in [5]. The upper table displays test performance for the historical split in which frames from Sintel are randomly assigned to train or test sets. Our method significantly outperforms competitors. The lower table compares our architectural variations on a more stringent dataset split which ensures that images from a single scene are either all in the training set or all in the test set. Figures 2 and 4 display results of our starred method.
MIT Training & Testing: Our Split MSE LMSE
Albedo Shading Avg Albedo Shading Total [11]
Ours: MSCR+dropout+deconv+DA+GenMIT
Ours without deconv
Ours without DA
Ours without GenMIT
Ours + Sintel
Ours + ResynthSintel
MIT Training & Testing: Barron ’s Split MSE LMSE
Albedo Shading Avg Albedo Shading Total [11]
Naive Baseline (from [2], uniform shading)
Barron [2]
Ours + ResynthSintel

Key:   DA = data augmentation (scaling, rotation)  GenMIT / Sintel / ResynthSintel = add MIT generated shading / Sintel / resynthesized Sintel to training

Table 2: MIT Intrinsic benchmarks. On the real images of the MIT dataset [11], our system is competitive with Barron  [2] according to MSE, but lags behind in LMSE. Ablated variants (upper table, middle rows) highlight the importance of replacing upsampling with learned deconvolutional layers. Variants using additional sources of training data (upper table, bottom rows) show gain from training with resynthesized Sintel ground-truth that obeys the same invariants as the MIT data. Note that the last column displays the reweighted LMSE score according to [11] rather than the simple average. For visual comparison between results of the starred methods, see Figure 5.

MPI Sintel dataset: We use a total of images in the Sintel albedo/shading dataset, from scenes with frames each (one of the scenes has only frames). We use two-fold cross validation, that is, training on half of the images and testing on the remaining images, to obtain our test results on all images. Our training/testing split is a scene split, placing an entire scene (all images it contains) either completely in training or completely in testing. For comparison to prior work, we retrain with the less stringent historically-used image split of Chen and Koltun [5], which randomly assigns each image to the train/test set.

MIT-intrinsic image dataset: MIT has objects with different light source images, for images total. For MIT-intrinsic evaluation, we also split into two and use two-fold cross validation. Following best practices, we split the validation set by objects rather than images.

We adopt the same three error measures as [5]:

MSE is the mean-squared error between albedo/shading results and their ground-truth. Following [11, 5], we use scale-invariant measures when benchmarking intrinsics results; the absolute brightness of each image is adjusted to minimize the error.

LMSE is the local mean-squared error, which is the average of the scale-invariant MSE errors computed on overlapping square windows of size 10% of the image along its larger dimension.

DSSIM is the dissimilarity version of the structural similarity index (SSIM), defined as

. SSIM characterizes image similarity as perceived by human observers. It combines errors from independent aspects of luminance, contrast, and structure, which are captured by mean, variance, and covariance of patches.

On Sintel, we compare our model with two trivial decomposition baselines where either shading or albedo is assumed uniform grey, the classical Retinex algorithm ([11] version) which obtains intrinsics by thresholding gradients, and three state-of-the-art intrinsics approaches which use not only RGB image input but also depth input. Barron  [2] estimate the most likely intrinsics using a shading rendering engine and learned priors on shapes and illuminations. Lee  [19] estimate intrinsic image sequences from RGB+D video subject to additional shading and temporal constraints. Chen and Koltun [5] use a refined shading model by decomposing it into direct irradiance, indirect irradiance, and a color component. On MIT, we compare with Barron  [2] as well as the trivial baseline.

3.1 Results

Input Color Depth Color Depth Ours      Chen & Koltun  Barron      Lee     Ground-truth Albedo Shading Albedo Shading

 

Input Color Depth Color Depth Ours      Chen & Koltun  Barron      Lee     Ground-truth Albedo Shading Albedo Shading
Figure 4: Comparison on MPI Sintel. We compare our intrinsic image decompositions with those of Lee  [19], Barron  [2], and Chen and Koltun [5]. Our algorithm is unique in using only RGB and not depth input channels, yet it generates decompositions superior to those of the other algorithms, which all rely on full RGB+D input (inverse depth shown above). See Table 1 for quantitative benchmarks.
Image Ground-truth w/o deconv Base System w/ResynthSintel Albedo Ground-truth w/o deconv Base System w/ResynthSintel Shading
Figure 5: Adaptability to real images. Our base system (MSCR+dropout+deconv+DA+GenMIT) produces quite reasonable results when trained on the MIT intrinsic image dataset. Without learned deconvolution, both albedo and shading quality suffer noticeably. Including resynthesized Sintel data during training improves albedo prediction, but biases shading towards Sintel-specific lighting conditions.

The top panel of Table 1 (image split case) shows that evaluated on Chen and Koltun’s test set, our MSCR+dropout+GL model significantly outperforms all competing methods according to MSE and LMSE. It is also overall better according to DSSIM than the current state-of-art method of Chen and Koltun: while our albedo DSSIM is larger, our shading DSSIM is smaller. Note that Chen and Koltun’s method utilizes depth information and is also trained on the DSSIM measure directly, whereas ours is based on the color image alone and is not trained to optimize the DSSIM score.

The bottom panel of Table 1 (scene-split case) is more indicative of an algorithm’s out-of-sample generalization performance; the test scenes have not been seen during training. These results show that: 1) The out-of-sample errors in the scene-split case are generally larger than the in-sample errors in the image-split case; 2) While HC has negligible effect, each tweak with dropout, gradient loss, learned deconvolutional layers, and data augmentation improves performance; 3) Training on Sintel and MIT together provides a small improvement when testing on Sintel.

Figure 2 shows sample results from our best model, while Figure 4 displays a side-by-side comparison with three other approaches. An important distinction is that our results are based on RGB alone, while the other approaches require both RGB and depth input. Across a diversity of scenes, any of the three RGB+D approaches could break down in one of the scenes on either albedo or shading: Lee ’s method on the bamboo scene, Barron ’s method on the dragon scene, Chen and Koltun’s method on the old man scene. The quality of our results is even across scenes and remains overall consistent with both albedo and shading ground-truth.

Table 2 shows that our model graciously adapts to real images. Trained on MIT alone, it produces reasonable results. Naively adding Sintel data to training hurts performance, but mixing our resynthesized version of Sintel into training results in noticeable improvements to albedo estimation when testing on MIT. The behavior of ablated system variants on MIT mirrors our findings on Sintel. On MIT, the learned deconvolutional layer is especially important. Output in Figure 5 exhibits clear visual degradation upon its removal. Figure 5 illustrates a tradeoff when using resynthesized Sintel training data: there is an overall benefit, but a Sintel-specific shading prior (bluish tint) leaks in.

In addition to Sintel and MIT, we briefly experiment with testing, but not training, our models on the IIW dataset [3]. Here, performance is less than satisfactory (WHDR=), compared to both our own prior work [24] and the current state-of-the-art [30], which are trained specifically for IIW. We speculate that there could be some discrepancy between the tasks of predicting human reflectance judgements (WHDR metric) and physically-correct albedo-shading decompositions. As we observed when moving from Sintel to MIT, there could be a domain shift between Sintel/MIT and IIW for which we are not compensating. We leave these interesting issues for future work.

4 Conclusion

We propose direct intrinsics, a new intrinsic image decomposition approach that is not based on the physics of image formation or the statistics of shading and albedo priors, but learns the dual associations between the image and the albedo+shading components directly from training data.

We develop a two-level feed-forward CNN architecture based on a successful previous model for RGB to depth prediction, where the coarse level architecture predicts the global context and the finer network uses the output of the coarse network to predict the finer resolution result. Combined with well-designed loss functions, data augmentation, dropout, and deconvolution, we demonstrate that direct intrinsics outperforms state-of-the-art methods that rely not only on more complex priors and graph-based inference, but also on the additional input of scene depth.

Our data-driven learning approach is more flexible, generalizable, and easier to model. It only needs training data, requires no hand-designed features or representations, and can adapt to unrealistic illuminations and complex albedo, shape, and lighting patterns. Our model works with both synthetic and real images and can further improve on real images when augmenting training with synthetic examples.


Acknowledgments. We thank Ayan Chakrabarti for valuable discussion. We thank both Qifeng Chen and Jon Barron for providing their code and accompanying support.

References

  • [1] J. T. Barron and J. Malik. Intrinsic scene properties from a single RGB-D image. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2013.
  • [2] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015.
  • [3] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Trans. on Graphics, 2014.
  • [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
  • [5] Q. Chen and V. Koltun. A simple model for intrinsic image decomposition with depth cues. International Conference on Computer Vision, 2013.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • [7] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [8] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. Neural Information Processing Systems, 2014.
  • [9] E. Garces, A. Munoz, J. Lopez-Moreno, and D. Gutierrez. Intrinsic images by clustering. In Computer Graphics Forum (Eurographics Symposium on Rendering), 2012.
  • [10] P. Gehler, C. Roth, M. Kiefel, L. Zhang, and B. Scholkopf. Recovering intrinsic images with a global sparsity prior on reflectance. In Neural Information Processing Systems, 2011.
  • [11] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In International Conference on Computer Vision, 2009.
  • [12] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. arXiv:1502.01852, 2015.
  • [14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
  • [15] B. Horn. Determining lightness from an image. Computer Graphics and Image Processing, 3:277–99, 1974.
  • [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In arXiv preprint arXiv:1408.5093, 2014.
  • [17] A. Krizhevsky, S.Ilya, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, 2012.
  • [18] E. H. Land and J. J. McCann. Lightness and retinex theory. Journal of Optical Society of America, 61(1):1–11, 1971.
  • [19] K. J. Lee, Q. Zhao, X. Tong, M. Gong, S. Izadi, S. U. Lee, P. Tan, and S. Lin. Estimation of intrinsic image sequences from image+depth video. European Conference on Computer Vision, 2012.
  • [20] Z. Liao, J. Rock, Y. Wang, and D. Forsyth. Non-parametric filtering for geometric detail extraction and material representation. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [21] F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. arXiv:1502.07411, 2015.
  • [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [23] M. Maire, S. X. Yu, and P. Perona. Reconstructive sparse code transfer for contour detection and semantic labeling. Asian Conference on Computer Vision, 2014.
  • [24] T. Narihira, M. Maire, and S. X. Yu. Learning lightness from human judgement on relative reflectance. In Computer Vision and Pattern Recognition (CVPR), 2015.
  • [25] I. Omer and M. Werman. Color lines: image specific color representation. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
  • [26] L. Shen and C. Yeo. Intrinsic images decomposition using a local and global sparse representation of reflectance. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
  • [27] Y. Tang, R. Salakhutdinov, and G. Hinton. Deep lambertian networks. In

    International Conference on Machine Learning

    , 2012.
  • [28] X. Wang, D. F. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. arXiv:1411.4958, 2014.
  • [29] S. X. Yu. Angular embedding: from jarring intensity differences to perceived luminance. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2302–9, 2009.
  • [30] T. Zhou, P. Krähenbühl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. International Conference on Computer Vision, 2015.