1 Introduction
Perceiving the physical properties of a scene undoubtedly plays a fundamental role in understanding realworld imagery. Such inherent properties include the 3D geometric configuration, the illumination or shading, and the reflectance or albedo of each scene surface. Depth prediction and intrinsic image decomposition, which aims to recover shading and albedo, are thus two fundamental yet challenging tasks in computer vision. While they address different aspects of scene understanding, there exist strong consistencies among depth and intrinsic images, such that information about one provides valuable prior knowledge for recovering the other.
In the intrinsic image decomposition literature, several works have exploited measured depth information to make the decomposition problem more tractable [1, 2, 3, 4, 5]. These techniques have all demonstrated better performance than using RGB images alone. On the other hand, in the literature for singleimage depth prediction, illuminationinvariant features have been utilized for greater robustness in depth inference [6, 7], and shading discontinuities have been used to detect surface boundaries [8], suggesting that intrinsic images can be employed to enhance depth prediction performance. Although the two tasks are mutually beneficial, previous research have solved for them only in sequence, by using estimated intrinsic images to constrain depth prediction [8], or vice versa [9]. We propose in this paper to instead jointly predict depth and intrinsic images in a manner where the two complementary tasks can assist each other.
We address this joint prediction problem using convolutional neural networks (CNNs), which have yielded stateoftheart performance for the individual problems of singleimage depth prediction [6, 7] and intrinsic image decomposition [9, 10, 11], but are hampered by ambiguity issues that arise from limited training sets. In our work, the two tasks are formulated synergistically in a joint conditional random field (CRF) that is solved using a novel CNN architecture, called the joint convolutional neural field (JCNF) model. This architecture differs from previous CNNs in several ways tailored to our particular problem. One is the sharing of convolutional activations and layers between networks for each task, which allows each network to account for inferences made in other networks. Another is to perform learning in the gradient domain, where there exist stronger correlations between depth and intrinsic images than in the image value domain, which helps to deal with the ambiguity problem from limited training sets. A third is the incorporation of a gradient scale network which jointly learns the confidence of the estimated gradients, to more robustly balance them in the solution. These networks of the JCNF model are jointly learned using a unified energy function in a joint CRF.
Within this system, depth, shading and albedo are predicted in a coarsetofine manner that yields more globally consistent results. Our experiments show that this joint prediction outperforms existing depth prediction methods and intrinsic image decomposition techniques on various benchmarks.
2 Related Work
2.0.1 Depth Prediction from a Single Image
Traditional methods for this task have formulated the depth prediction as a Markov random field (MRF) learning problem [12, 13, 14]
. As exact MRF learning and inference are intractable in general, most of these approaches employ approximation methods, such as through linear regression of depth with image features
[12], learning imagedepth correlation with a nonlinear kernel function [13], and training categoryadaptive model parameters [14]. Although these parametric models infer plausible depth maps to some extent, they cannot estimate the depth of natural scenes reliably due to their limited learning capability.
By leveraging the availability of large RGBD databases, datadriven approaches have been actively researched [15, 16]. Konrad et al. [15] proposed a depth fusion scheme to infer the depth map by retrieving the nearest images in the dataset, followed by an aggregation via weighted median filtering. Karsch et al. [16] presented the depth transfer (DT) approach which retrieves the nearest similar images and warps their depth maps using dense SIFT flow. Inspired by this method, Choi et al. [17] proposed the depth analogy (DA) approach that transfers depth gradients from the nearest images, demonstrating the effectiveness of gradient domain learning. Although these methods can extract reliable depth for certain scenes, there exist many others for which the nearest images are dissimilar and unsuitable. Recently, Kong et al. [8] extended the DT approach [16] by using albedo and shading for image matching as well as for detecting contours at surface boundaries. In contrast to our approach, the intrinsic images are estimated independently from the depth prediction.
More recently, methods have been proposed based on CNNs. Eigen et al. [6] proposed multiscale CNNs (MSCNNs) for predicting depth maps directly from a single image. Other CNN models were later proposed for depth estimation [18], including a deep convolutional neural field (DCNF) by Fayao et al. [7] that estimates depth on each superpixel while enforcing smoothness within a CRF. CNNbased methods clearly outperform conventional techniques, and we aim to elevate the performance further by accounting for intrinsic image information.
2.0.2 Intrinsic Image Decomposition
The notion of intrinsic images was first introduced in [19]. Conventional methods are largely based on Retinex theory [20, 21, 22], which attributes large image gradients to albedo changes, and smaller gradients to shading. More recent approaches have employed a variety of techniques, based on gradient distribution priors [23], dense CRFs [24], and hybrid  optimization to separate albedo and shading gradients [25]. These singleimage based methods, however, are inherently limited by the fundamental illposedness of the problem. To partially alleviate this limitation, several approaches have utilized additional input, such as multiple images [26, 27, 28], user interaction [29, 30], and measured depth maps [1, 2, 3, 4, 5]. The use of additional data such as measured depth clearly increases performance but reduces their applicability.
Related to our work is the method of Barron and Malik [31], which estimates object shape in addition to intrinsic images. To regularize the estimation, the method utilizes statistical priors on object shape and albedo which are not generally applicable to images of full scenes.
More recently, intrinsic image decomposition has been addressed using CNNs [9, 10, 11]. Zhou et al. [10] proposed a multistream CNN to predict the relative reflectance ordering between image patches from largescale human annotations. Narihira et al. [11] learned a CNN that directly predicts albedo and shading from an RGB image patch. Shelhamer et al. [9] estimated depth through a fully convolutional network and used it to constrain the intrinsic image decomposition. Unlike our approach, the depth and intrinsic images are estimated sequentially.
3 Formulation
3.1 Problem Statement and Model Architecture
Let us define a color image such that for pixel , where is a discrete image domain. Similarly, depth, albedo and shading can be defined as and . All images are defined in the log domain. Given a training set of color, depth, albedo, and shading images denoted by , where is the number of training images, we first aim to learn a prediction model that approximates depth , albedo , and shading from each color image . This prediction model will then be used to infer reliable depth , albedo , and shading simultaneously from a single query image .
We specifically learn the joint prediction model in the gradient domain, where depth and intrinsic images generally exhibit stronger correlation than in the value domain, as exemplified in Fig. 1. This greater correlation and reduced discrepancy among , , and facilitate joint learning of the two tasks by allowing them to better leverage information from each other^{1}^{1}1 is a differential operator defined in the  and direction such that .. We therefore formulate our model to predict the depth, albedo, and shading gradient fields from the color image. Our method additionally learns the confidence of predicted gradients based on their consistency among one another in the training set.
We formulate this joint prediction using convolutional neural networks (CNNs) in a joint conditional random field (CRF). Our system architecture is structured as three cooperating networks, namely a depth prediction network, an intrinsic prediction network, and a gradient scale network. The depth prediction network is modeled by two feedforward processes and , where and represent the network parameters for depth and depth gradients. The intrinsic prediction network is similarly modeled by feedforward processes and , where and represent the network parameters for albedo gradients and shading gradients. The gradient scale network learns the confidence of depth, albedo and shading gradients using a feedforward process for each, denoted by , , and , where , , and are their respective network parameters. The three networks in our system are jointly learned in a manner where each can leverage information from the other networks.
3.2 Joint Conditional Random Field
The networks in our model are jointly learned by minimizing the energy function of a joint CRF. The joint CRF is formulated so that each task can leverage information from the other complementary task, leading to improved prediction in comparison to separate estimation models. Our energy function is defined as unary potentials and pairwise potentials for each task:
(1) 
where , , and are weights for each pairwise potential. In the training procedure, this energy function is minimized over all the training images, i.e., by minimizing . For testing, given a query image and the learned network parameters, the final solutions of , , and are estimated by minimizing the energy function .
3.2.1 Unary Potentials
The unary potentials consist of two energy functions, and . The depth unary function is formulated as
(2) 
which represents the squared differences between depths and a predicted depths from , where is the local neighborhood^{2}^{2}2It is defined as the receptive field through the CNNs for pixel [33]. for pixel . It can be considered as a Dirichlet boundary condition for depth pairwise potentials, which will be described shortly.
The unary function for intrinsic images is used in minimizing the reconstruction errors of color image from albedo and shading :
(3) 
where , and denotes the luminance of with . It has been noted that processing of luminance balances out the influence of the unary potential across the image [1, 28], and that treating the image formation equation (i.e., ) as a soft constraint can bring greater stability in optimization [25], especially for dark pixels whose chromaticity can be greatly distorted by sensor noise.
3.2.2 Pairwise Potentials
The pairwise potentials, which include , , and , represent differences between gradients and estimated gradients in the depth, albedo, and shading images. The pairwise potential for depth gradients is defined as
(4) 
where denotes the Hadamard product, and the estimated depth gradients of provide a guidance gradient field for depth, similar to a Poisson equation [34, 35]. They are weighted by a confidence factor learned in the gradient scale network to reduce the impact of erroneous gradients. This gradient scale is similar to the derivativelevel confidence employed in [36] for image restoration, except that our gradient scale is learned nonlocally with CNNs and different types of guidance images, as later described in Sec. 3.4. The pairwise potentials for albedo gradients and shading gradients are defined in the same manner. Since the gradient scales are jointly estimated with each other task, these pairwise potentials are computed within an iterative solver, which will be described in Sec. 4.1.
3.3 Joint Depth and Intrinsic Prediction Network
Our joint depth and intrinsic prediction network utilizes the aforementioned energy function to predict , , , and from a single image . The joint network consists of a depth prediction network for and , and an intrinsic prediction network for and . In contrast to previous methods for singleimage depth prediction [6, 37, 11], our system jointly estimates the gradient fields , , and , which are used to reduce ambiguity in the solution and obtain more edgepreserved results. To allow the different estimation tasks to leverage information from one another, we design the depth and intrinsic networks to share concatenated convolutional activations, and share convolutional layers between albedo and shading networks, as illustrated in Fig. 2.
3.3.1 Depth Prediction Network
The depth prediction network consists of a global depth network and a depth gradient network. For the global depth network, we learn its parameters for predicting an overall depth map from the entire image structure. Similar to [6, 37, 11], it provides coarse, spatiallyvarying depth that may be lacking in fine detail. This coarse depth will later be refined using the output of the depth gradient network.
The global depth network consists of five convolutional layers, three pooling layers, six nonlinear activation layers, and two fullyconnected (FC) layers. For the first five layers, the pretrained parameters from the AlexNet architecture [38]
are employed, and finetuning for the dataset is done. Rectified linear units (ReLUs) are used for the nonlinear layers, and the pooling layers employ max pooling. The first FC layer encodes the network responses into fixeddimensional features, and the second FC layer infers a coarse global depth map at
scale of the original depth map.The depth gradient network predicts finedetail depth gradients for each pixel. Its parameters are learned using an endtoend patchlevel scheme inspired by [39, 35]
, where the network input is an image patch and the output is a depth gradient patch. For inference of depth gradients at the pixel level, the depth gradient network consists of five convolutional networks followed by ReLUs, without stride convolutions or pooling layers. The first convolutional layer is identical to the first convolutional layer in the AlexNet architecture
[38]. Four additional convolutional layers are also used as shown in Fig. 2. The depth gradient patches that are output by this network will be used for depth reconstruction in Sec. 4.2. Note that in the testing procedure, the depth gradient network is applied to overlapping patches over the entire image, which are aggregated in the last convolutional layer to yield the full gradient field.global depth net.  gradient scale net.  
conv1  conv2  conv3  conv4  conv5  FC1  FC2  conv1  conv2  conv3  
kernel  
channel    or  
depth gradient net.  intrinsic gradient net.  
conv1  conv2  conv3  conv4  conv5  conv1  conv2  conv3  conv4  conv5  
kernel  
channel 
3.3.2 Intrinsic Prediction Network
The intrinsic prediction network has a structure similar to the depth gradient prediction network. The network parameters and are learned for predicting the albedo and shading gradients at each pixel. To jointly infer the depth and intrinsic image gradients, the second convolutional activations for each task are concatenated and passed to their third convolutional layers as shown in Fig. 2. In the training procedure, the depth and intrinsic networks are iteratively learned, which enables each task to benefit from each other’s activations to provide more reliable estimates. Furthermore, similar to [11], the albedo and shading gradient networks share their first three convolutional layers, while the last two are separate. Since the albedo and shading images have related properties, these shared convolutional layers benefit their estimation. Details on kernel sizes and the number of channels for each layer are provided in Table 1 for all the networks.
3.4 Gradient Scale Network
The estimated gradients from the depth and intrinsic prediction networks might contain errors due to the illposed nature of their problems. To help in identifying such errors, our system additionally learns the confidence of estimated gradients, specifically, whether a gradient exists at a particular location or not. The basic idea is to learn from the training data about the consistencies that exist among the different types of gradients given their local neighborhood . From this, we can determine the confidence of a gradient (e.g., a depth gradient), based on the other estimated gradients (e.g., the albedo, shading, and image gradients). This confidence is modeled as a gradient scale that is similar to the scale map used in [36] to model derivativelevel confidence for image restoration. It can be noted that in some depth and intrinsic image decomposition methods [1, 4, 7], the solutions are filtered with fixed parameters using the color image as guidance. Our system instead learns a network for defining the parameters, using not only a color image but also depth and intrinsic images as guidance.
The gradient scale network consists of three convolutional layers and one nonlinear activation layer. For the case of depth gradients, the output of the gradient scale network is estimated as the convolution between and , followed by a nonlinear activation i.e., , which is defined within . Here,
for a vector of gradients denotes a vector of the gradient magnitudes. Thus, in the gradient scale network, the network parameters are convolved with the gradient magnitudes. With the learned parameters
, the confidence of is estimated from , , . This can alternatively be viewed as a guidance filtering weight for with guidance images , , and . and are also similarly defined.Some properties of gradient scales are as follows. A gradient scale can be either positive or negative. A large positive value indicates high confidence in the presence of a gradient. A large negative value also indicates high confidence, but for the reversed gradient direction. In addition, when a gradient field contains extra erroneous regions, gradient scales of value can help to disregard them.
4 Unified Depth and Intrinsic Image Prediction
4.1 Training
The energy function from (1) is used to simultaneously learn the depth and intrinsic network parameters (, , , ) and the gradient scale network parameters (, , ). Although the overall form of the energy is nonquadratic, it has a quadratic form with respect to each of its terms. The energy function can thus be minimized by alternating among its terms.
4.1.1 Loss Functions
For the global depth unary potential of (2), the global depth network parameters can be solved by minimizing the following loss function
(5) 
We note that the intrinsic image unary term does not contain network parameters to be learned, so it is used only in the testing procedure.
The pairwise potentials each incorporate two networks, namely the gradient prediction network and gradient scale network, so they are iteratively trained. The loss function for the depth gradient pairwise potential of (4) is defined as
(6) 
The loss functions for the pairwise potentials of the albedo gradients and shading gradients are similarly defined.
These loss functions are minimized using stochastic gradient descent with the standard backpropagation
[40]. First, is estimated through . Then and are iteratively estimated through and . In each iteration, the loss functions are differently defined according to the other network outputs, where the network parameters are initialized with the values obtained from the previous iteration. In this way, the networks are trained jointly and account for the improving outputs of the other networks.



Algorithm 1: Unified Depth and Intrinsic Image Prediction  


Input: training image set , query color image  
Output: depth , albedo , shading .  
Training Procedure  
For training set , learn parameters using backwardpropagation.  
Initialize parameters of , , and to provide constant values.  
while not converged do  
For , update parameters , , with fixed , , .  
For , update parameters , , with fixed , , .  
end while  
Testing Procedure  
for do  
Estimate , using forwardpropagation , .  
Estimate , using forwardpropagation , .  
while not converged do  
Estimate , , and using forwardpropagation.  
Estimate , , by optimizing and .  
Compute , , , from , , .  
end while  
Interpolate , , into the size of using a bilinear interpolation.  
end for  
Estimate depth , albedo , shading as , , .  

4.2 Testing
4.2.1 Iterative Joint Prediction
In the testing procedure, the outputs , , and for a given input image are predicted by minimizing the energy function from (1) with constraints from the estimates computed using the learned network parameters and forwardpropagation. Similar to the training procedure, we minimize with an iterative scheme due to its nonquadratic form, where and are minimized in alternation.
For the depth prediction, is defined as a data term for global depth and a pairwise term for depth gradients:
(7) 
where denotes network outputs, and is the gradient scale of derived from . We note that since is computed with , , and , all of the predictions need to be iteratively estimated.
For the intrinsic prediction, is also defined as data and pairwise terms, with the image formation equation and the albedo and shading gradients:
(8) 
where and are defined similarly to . This energy function can be optimized with an existing linear solver [1]. These two energy functions and are iteratively minimized while providing information in the form of depth, albedo, and shading gradients to each other.
4.2.2 CoarsetoFine Joint Prediction
In estimating depth and intrinsic images, enforcing a degree of global consistency can lead to performance gains [1, 5]. Although our JCNF model is solved by global energy minimization, its global consistency is limited because gradients are defined just between pixel neighbors. For greater global consistency, we apply our joint prediction model in a coarsetofine manner, where color images are constructed at image pyramid levels , and the depth and intrinsic images and are predicted from . Coarser scale results are then used as guidance for finer levels.
Specifically, we reformulate as :
(9) 
Similarly, is reformulated as :
(10) 
where the multiscale unary functions lead to more reliable solutions and faster convergence. The highlevel algorithm for the training and testing procedures is provided in the Algorithm 1.
5 Experimental Results
For our experiments, we implemented the JCNF model using the VLFeat MatConvNet toolbox [40]. Our code with pretrained parameters will be made publicly available upon publication.
The inputs of the global depth network were color images and the corresponding depth images at of the original scale. The inputs of each gradient network were randomly cropped patches from training images and the corresponding gradient maps. The patch sizes were for color images, for depth gradient maps, and for albedo and shading gradient maps. The reduced resolution for gradient map patches was due to boundary regions not processed by convolution [39]. For the gradient scale networks, the input patches are of size . The energy function weights were set to
by crossvalidation. The filter weights of each network layer were initialized by drawing randomly from a Gaussian distribution with zero mean and a standard deviation of
. The network learning rates were set to , except for the final layer of the gradient networks where it is set to .We additionally augmented the training data by applying random transforms to it, including scalings in the range , inplane rotations in the range , translations, RGB scalings, image flips, and different gammas.
In the following, we evaluated our system through comparisons to stateoftheart depth prediction and intrinsic image decomposition methods on the MPI SINTEL [41], NYU v2 [42], and Make3D [43] benchmarks. We additionally examined the performance contributions of the joint network learning (wo/jnl), the gradient scale network (wo/gsn), and the coarsetofine scheme (wo/ctf).



Methods  Error  Accuracy  
rel  log  rms  rms  
Depth Transfer [16]  0.448  0.193  9.242  3.121  0.524  0.712  0.735 
Depth Analogy [17]  0.432  0.167  8.421  2.741  0.621  0.799  0.812 
DCNFFCSP(NYU) [7]  0.424  0.164  8.112  2.421  0.652  0.782  0.824 
JCNF(NYU)  0.293  0.131  7.421  1.812  0.715  0.812  0.831 
JCNF wo/jnl  0.292  0.138  7.471  1.973  0.714  0.783  0.839 
JCNF wo/gsn  0.271  0.119  7.451  1.921  0.724  0.793  0.893 
JCNF wo/ctf  0.252  0.101  7.233  1.622  0.729  0.812  0.878 
JCNF  0.183  0.097  6.118  1.037  0.823  0.834  0.902 



Methods  MSE  LMSE  DSSIM  
albedo  shading  avg.  albedo  shading  avg.  albedo  shading  avg.  
Retinex [44]  0.053  0.049  0.051  0.033  0.028  0.031  0.214  0.206  0.210 
Li et al. [23]  0.042  0.041  0.037  0.024  0.031  0.034  0.242  0.224  0.194 
Shen et al. [30]  0.043  0.039  0.048  0.028  0.027  0.032  0.221  0.210  0.232 
Zhao et al. [22]  0.047  0.041  0.031  0.028  0.029  0.031  0.210  0.257  0.214 
IIW [24]  0.041  0.032  0.041  0.032  0.031  0.027  0.281  0.241  0.284 
SIRFS [31]  0.042  0.047  0.043  0.029  0.026  0.028  0.210  0.206  0.208 
Jeon et al. [4]  0.042  0.033  0.032  0.021  0.021  0.023  0.204  0.181  0.193 
Chen et al. [1]  0.031  0.028  0.029  0.019  0.019  0.019  0.196  0.165  0.181 
MSCR [11]  0.020  0.017  0.021  0.016  0.011  0.011  0.201  0.150  0.176 
JCNF wo/jnl  0.012  0.015  0.016  0.014  0.010  0.010  0.149  0.123  0.141 
JCNF wo/gsn  0.008  0.011  0.011  0.010  0.009  0.008  0.146  0.112  0.132 
JCNF wo/ctf  0.008  0.012  0.010  0.009  0.008  0.008  0.127  0.110  0.119 
JCNF  0.007  0.009  0.007  0.006  0.007  0.007  0.092  0.101  0.097 

5.1 MPI SINTEL Benchmark
We evaluated our JCNF model on both depth prediction and intrinsic image decomposition on the MPI SINTEL benchmark [41], which consists of images from scenes with frames each. For a fair evaluation, we followed the same experimental protocol as in [1, 11], with their twofold crossvalidation and training/testing image splits. Fig. 3 and Fig. 4 exhibit predicted depth and intrinsic images from a single image, respectively. Table 2 and Table 3 are quantitative evaluations for both tasks using a variety of metrics, including average relative difference (rel), average log error (log), rootmeansquared error (rms), its log version (rms), and accuracy with thresholds [7]. For quantitatively evaluating intrinsic image decomposition performance, we used meansquared error (MSE), local meansquared error (LMSE), and the dissimilarity version of the structural similarity index (DSSIM) [11].


Methods  Error  Accuracy  
rel  log  rms  rms  
Make3D [12]  0.349    1.214  0.409  0.447  0.745  0.897 
Depth Transfer [16]  0.350  0.134  1.1  0.378  0.460  0.742  0.893 
Depth Analogy [17]  0.328  0.132  1.31  0.392  0.471  0.799  0.891 
MSCNNs [6]  0.228    0.901  0.293  0.611  0.873  0.961 
DCNFFCSP [7]  0.221  0.095  0.760  0.281  0.604  0.885  0.974 
JCNF(MPI)  0.214  0.093  0.716  0.241  0.677  0.879  0.927 
JCNF wo/jnl  0.216  0.101  0.753  0.241  0.625  0.896  0.925 
JCNF wo/gsn  0.210  0.091  0.728  0.254  0.621  0.890  0.975 
JCNF wo/ctf  0.208  0.106  0.708  0.237  0.681  0.901  0.972 
JCNF  0.201  0.077  0.711  0.212  0.690  0.910  0.979 

For the depth prediction task, datadriven approaches (DT [16] and DA [17]) provided limited performance due to their low learning capacity. CNNbased depth prediction (DCNFFCSP [7]) using a pretrained model from NYU v2 [42] showed better performance, but is restricted by depth ambiguity problems. Our JCNF model achieved the best results both quantitatively and qualitatively, whether pretrained using MPI SINTEL or NYU v2 datasets. Furthermore, it is shown that omitting the gradient scale network, coarsetofine processing, or joint learning significantly reduced depth prediction performances.
In intrinsic image decomposition, existing singleimage based methods [44, 23, 30, 22, 24] produced the lowest quality results as they do not benefit from any additional information. RGBD based methods [1, 5, 4] performed better with measured depth as input. CNNbased intrinsic decomposition [11] surpassed RGBD based techniques even without having depth as an input, but its results exhibit some blur, likely due to ambiguity from limited training datasets. Thanks to its gradient domain learning and leverage of estimated depth information, our JCNF model provides more accurate and edgepreserved results, with the best qualitative and quantitative performance.



Methods  Error ()  Error ()  
rel  log  rms  rms  rel  log  rms  rms  
Make3D [12]  0.412  0.165  11.1  0.451  0.407  0.155  16.1  0.486 
Depth Transfer [16]  0.355  0.127  9.20  0.421  0.438  0.161  14.81  0.461 
Depth Analogy [17]  0.371  0.121  8.11  0.381  0.410  0.144  14.52  0.479 
DCNFFCSP [7]  0.331  0.119  8.60  0.392  0.307  0.125  12.89  0.412 
JCNF(MPI)  0.273  0.110  7.70  0.351  0.263  0.117  8.62  0.347 
JCNF(NYU)  0.274  0.097  7.22  0.352  0.287  0.127  8.22  0.341 
JCNF  0.262  0.092  6.61  0.321  0.243  0.091  6.34  0.302 

5.2 NYU v2 RGBD Benchmark
For further evaluation, we obtained a set of RGB, depth, and intrinsic images by applying RGBD based intrinsic image decomposition methods [1, 4] on the NYU v2 RGBD database [42]. Of its RGBD images of indoor scenes, we used for training and for testing, which is the standard training/testing split for the dataset.
For depth prediction, comparisons are made to the ground truth depth in Fig. 5 and Table 3 using the same experimental settings as in [7]. The stateoftheart CNNbased methods [6, 7] clearly outperformed other previous methods. The performance of our JCNF model was even higher, with pretraining on either MPI SINTEL or NYU v2. Our depth prediction network is similar to [6], but it additionally predicts depth gradients and leverages intrinsic image estimates to elevate performance.
5.3 Make3D RGBD Benchmark
We also evaluated our JCNF model on the Make3D dataset [43], which contains images depicting outdoor scenes (with used for training and for testing). To account for a limitation of this dataset [12, 45, 7], we calculate depth errors in two ways [45, 7]: on only regions with ground truth depth less than meters (denoted by ), and over the entire image (). From the depth prediction results in Fig. 7 and Table 5, our JCNF model is found to yield the highest accuracy, even when pretrained on MPI SINTEL [41] or NYU v2 [42] (i.e., JCNF(MPI) and JCNF(NYU)). For the intrinsic image decomposition results given in Fig. 8, JCNF also outperforms the comparison techniques.
6 Conclusion
We presented Joint Convolutional Neural Fields (JCNF) for jointly predicting depth, albedo and shading maps from a single input image. Its high performance can be attributed to its sharing network architecture, its gradient domain inference, and the incorporation of gradient scale network. It is shown through extensive experimentation that synergistically solving for these physical scene properties through the JCNF leads to stateoftheart results in both singleimage depth prediction and intrinsic image decomposition. In furture work, JCNF can potentially benefit shape refinement and image relighting from a single image.
References
 [1] Chen, Q., Koltun, V.: A simle model for intrinsic image decomposition with depth cues. ICCV (2013)
 [2] Laffont, P.Y., Bousseau, A., Paris, S., Durand, F., Drettakis, G.: Coherent intrinsic images from photo collections. ACM TOG 31(6) (2012) 1–11
 [3] Lee, K.J., Zhao, Q., Tong, X., Gong, M., Izadi, S., L.S.U., Tan, P., Lin, S.: Estimation of intrinsic image sequences from image + depth video. ECCV (2012)
 [4] Jeon, J., Cho, S., Tong, X., Lee, S.: Intrinsic image decomposition using structuretexture separation and surface normals. ECCV (2014)
 [5] Barron, J.T., Malik, J.: intrinsic scene properties from a single rgbd image. CVPR (2013)
 [6] Eigen, D., Puhrsch, C., Ferus, R.: Depth map prediction from a single image using a multiscale deep network. NIPS (2014)
 [7] Fayao, L., Chunhua, S., Guosheng, L.: Deep convolutional neural fields for depth estimation from a single images. CVPR (2015)
 [8] Kong, N., Black, M.J.: Intrinsic depth: Improving depth transfer with intrinsic images. ICCV (2015)
 [9] Shelhamer, E., Barron, J., Darrell, T.: Scene intrinsics and depth from a single image. ICCV workshop (2015)
 [10] Zhou, T., Krahenbuhl, P., Efors, A.A.: Learning datadriven reflectnace priors for intrinsic image decomposition. ICCV (2015)
 [11] Narihira, T., Maire, M., Yu, S.X.: Direct intrinsics: Learning albedoshading decomposition by convolutional regression. ICCV (2015)
 [12] Saxena, A., Sun, M., Andrew, Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. PAMI 31(5) (2009) 824–840
 [13] Wang, Y., Wang, R., Dai, Q.: A parametric model for describing the correlation between single color images and depth maps. IEEE SPL 21(7) (2014) 800–803
 [14] Xiu, L., Hongwei, Q., Yangang, W., Yongbing, Z., Qionghai, D.: Dept: Depth estimation by parameter transfer for single still images. CVPR (2014)
 [15] Konrad, J., Wang, M., Ishwar, P., Wu, C., Mukherjee, D.: Learningbased, automatic 2dto3d image and video conversion. IEEE Trans. IP 22(9) (2013) 3485–3496
 [16] Karsch, K., Liu, C., Kang, S.B.: Depth transfer: Depth extraction from video using nonparametric sampling. IEEE Trans. PAMI 32(11) (2014) 2144–2158
 [17] Choi, S., Min, D., Ham, B., Kim, Y., Oh, C., Sohn, K.: Depth analogy: Datadriven approach for single image depth estimation using gradient samples. IEEE Trans. IP 24(12) (2015) 5953–5966
 [18] Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.: Towards unified depth and semantic prediction from a single image. CVPR (2015)
 [19] Barrow, H.G., Tenenbaum, J.M.: Recovering intrinsic scene characteristics from images. CVS (1978)
 [20] Land, E.H., Mccann, J.J.: Lightness and retinex theory. JOSA 61(1) (1971) 1–11
 [21] Shen, J., Tan, P., Lin, S.: Intrinsic image decomposition with nonlocal texture cues. CVPR (2008)
 [22] Zhao, Q., Tan, P., Dai, Q., SHen, L., Wu, E., Lin, S.: A closedform solution to retinex with nonlocal texture constraints. IEEE Trans. PAMI 34(7) (2012) 1437–1444
 [23] Li, Y., Brown, M.S.: Single image layer separation using relative smoothness. CVPR (2004)
 [24] Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM TOG 33(4) (2014)
 [25] Bonneel, N., Sunkavalli, K., Tompkin, J., Sun, D., Paris, S., Pfister, H.: Interactive intrinsic video editing. ACM Trans. Graphics (SIGGRAPH ASIA) (2014)
 [26] Wiess, Y.: Deriving intrinsic images from image sequences. ICCV (2001)
 [27] Laffont, P.Y., Bousseau, A., Drettakis, G.: Rich intrinsic image decomposition of outdoor scenes from multiple views. IEEE TVCG 19(2) (2013) 1–11
 [28] Kong, N., Gehler, P.V., Black, M.J.: Intrinsic video. ECCV (2014)
 [29] Bousseau, A., Paris, S., Durand, F.: Userassisted intrinsic images. ACM TOG 28(5) (2009) 1–11
 [30] Shen, J., Yang, X., Jia, Y.: Intrinsic image using optimization. CVPR (2011)
 [31] Barron, J., Malik, J.: Shape, albedo, and illumination from a single image of an unknown object. CVPR (2012)
 [32] Butler, D., Wulff, J., Stanley, G., Black, M.: A naturalistic open source movie for optical flow evaluation. ECCV (2012)
 [33] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. PAMI 37(9) (2015) 1904–1916
 [34] Perez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM TOG 22(3) (2003)
 [35] Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edgeaware filters. ICML (2015)
 [36] Shen, X., Yan, Q., Xu, L., Ma, L., Jia, J.: Multispectral joint image restoration via optimizing a scale map. IEEE Trans. PAMI 31(9) (2015) 1582–1599
 [37] Eigen, D., R, F.: Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. ICCV (2015)
 [38] Alex, K., Ilya, S., E, H.: Imagenet classification with deep convolutional neural networks. NIPS (2012)

[39]
Dong, C., Loy, C.C., He, K., Tang, X.:
Image superresolution using deep convolutional networks.
IEEE Trans. PAMI 37(3) (2015) 597–610  [40] Online.: http://www.vlfeat.org/matconvnet/.
 [41] Online.: http://sintel.is.tue.mpg.de/.
 [42] Online.: http://cs.nyu.edu/~silberman/datasets/.
 [43] Online.: http://make3d.cs.cornell.edu/.
 [44] Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth and baseline evaluations for intrinsic image algorithms. ICCV (2009)
 [45] Liu, M., Salzmann, M., He, X.: Discretecontinuous depth estimation from a single image. CVPR (2014)