Log In Sign Up

Unified Depth Prediction and Intrinsic Image Decomposition from a Single Image via Joint Convolutional Neural Fields

by   Seungryong Kim, et al.

We present a method for jointly predicting a depth map and intrinsic images from single-image input. The two tasks are formulated in a synergistic manner through a joint conditional random field (CRF) that is solved using a novel convolutional neural network (CNN) architecture, called the joint convolutional neural field (JCNF) model. Tailored to our joint estimation problem, JCNF differs from previous CNNs in its sharing of convolutional activations and layers between networks for each task, its inference in the gradient domain where there exists greater correlation between depth and intrinsic images, and the incorporation of a gradient scale network that learns the confidence of estimated gradients in order to effectively balance them in the solution. This approach is shown to surpass state-of-the-art methods both on single-image depth estimation and on intrinsic image decomposition.


page 4

page 11

page 13

page 15


Deep Convolutional Neural Fields for Depth Estimation from a Single Image

We consider the problem of depth estimation from a single monocular imag...

Dilated Fully Convolutional Neural Network for Depth Estimation from a Single Image

Depth prediction plays a key role in understanding a 3D scene. Several t...

Single Image Dehazing Using Ranking Convolutional Neural Network

Single image dehazing, which aims to recover the clear image solely from...

Intrinsic Light Field Images

We present a method to automatically decompose a light field into its in...

Structured Prediction using cGANs with Fusion Discriminator

We propose the fusion discriminator, a single unified framework for inco...

An Optical physics inspired CNN approach for intrinsic image decomposition

Intrinsic Image Decomposition is an open problem of generating the const...

1 Introduction

Perceiving the physical properties of a scene undoubtedly plays a fundamental role in understanding real-world imagery. Such inherent properties include the 3-D geometric configuration, the illumination or shading, and the reflectance or albedo of each scene surface. Depth prediction and intrinsic image decomposition, which aims to recover shading and albedo, are thus two fundamental yet challenging tasks in computer vision. While they address different aspects of scene understanding, there exist strong consistencies among depth and intrinsic images, such that information about one provides valuable prior knowledge for recovering the other.

In the intrinsic image decomposition literature, several works have exploited measured depth information to make the decomposition problem more tractable [1, 2, 3, 4, 5]. These techniques have all demonstrated better performance than using RGB images alone. On the other hand, in the literature for single-image depth prediction, illumination-invariant features have been utilized for greater robustness in depth inference [6, 7], and shading discontinuities have been used to detect surface boundaries [8], suggesting that intrinsic images can be employed to enhance depth prediction performance. Although the two tasks are mutually beneficial, previous research have solved for them only in sequence, by using estimated intrinsic images to constrain depth prediction [8], or vice versa [9]. We propose in this paper to instead jointly predict depth and intrinsic images in a manner where the two complementary tasks can assist each other.

We address this joint prediction problem using convolutional neural networks (CNNs), which have yielded state-of-the-art performance for the individual problems of single-image depth prediction [6, 7] and intrinsic image decomposition [9, 10, 11], but are hampered by ambiguity issues that arise from limited training sets. In our work, the two tasks are formulated synergistically in a joint conditional random field (CRF) that is solved using a novel CNN architecture, called the joint convolutional neural field (JCNF) model. This architecture differs from previous CNNs in several ways tailored to our particular problem. One is the sharing of convolutional activations and layers between networks for each task, which allows each network to account for inferences made in other networks. Another is to perform learning in the gradient domain, where there exist stronger correlations between depth and intrinsic images than in the image value domain, which helps to deal with the ambiguity problem from limited training sets. A third is the incorporation of a gradient scale network which jointly learns the confidence of the estimated gradients, to more robustly balance them in the solution. These networks of the JCNF model are jointly learned using a unified energy function in a joint CRF.

Within this system, depth, shading and albedo are predicted in a coarse-to-fine manner that yields more globally consistent results. Our experiments show that this joint prediction outperforms existing depth prediction methods and intrinsic image decomposition techniques on various benchmarks.

2 Related Work

2.0.1 Depth Prediction from a Single Image

Traditional methods for this task have formulated the depth prediction as a Markov random field (MRF) learning problem [12, 13, 14]

. As exact MRF learning and inference are intractable in general, most of these approaches employ approximation methods, such as through linear regression of depth with image features

[12], learning image-depth correlation with a non-linear kernel function [13], and training category-adaptive model parameters [14]

. Although these parametric models infer plausible depth maps to some extent, they cannot estimate the depth of natural scenes reliably due to their limited learning capability.

By leveraging the availability of large RGB-D databases, data-driven approaches have been actively researched [15, 16]. Konrad et al. [15] proposed a depth fusion scheme to infer the depth map by retrieving the nearest images in the dataset, followed by an aggregation via weighted median filtering. Karsch et al. [16] presented the depth transfer (DT) approach which retrieves the nearest similar images and warps their depth maps using dense SIFT flow. Inspired by this method, Choi et al. [17] proposed the depth analogy (DA) approach that transfers depth gradients from the nearest images, demonstrating the effectiveness of gradient domain learning. Although these methods can extract reliable depth for certain scenes, there exist many others for which the nearest images are dissimilar and unsuitable. Recently, Kong et al. [8] extended the DT approach [16] by using albedo and shading for image matching as well as for detecting contours at surface boundaries. In contrast to our approach, the intrinsic images are estimated independently from the depth prediction.

More recently, methods have been proposed based on CNNs. Eigen et al. [6] proposed multi-scale CNNs (MS-CNNs) for predicting depth maps directly from a single image. Other CNN models were later proposed for depth estimation [18], including a deep convolutional neural field (DCNF) by Fayao et al. [7] that estimates depth on each superpixel while enforcing smoothness within a CRF. CNN-based methods clearly outperform conventional techniques, and we aim to elevate the performance further by accounting for intrinsic image information.

2.0.2 Intrinsic Image Decomposition

The notion of intrinsic images was first introduced in [19]. Conventional methods are largely based on Retinex theory [20, 21, 22], which attributes large image gradients to albedo changes, and smaller gradients to shading. More recent approaches have employed a variety of techniques, based on gradient distribution priors [23], dense CRFs [24], and hybrid - optimization to separate albedo and shading gradients [25]. These single-image based methods, however, are inherently limited by the fundamental ill-posedness of the problem. To partially alleviate this limitation, several approaches have utilized additional input, such as multiple images [26, 27, 28], user interaction [29, 30], and measured depth maps [1, 2, 3, 4, 5]. The use of additional data such as measured depth clearly increases performance but reduces their applicability.

Related to our work is the method of Barron and Malik [31], which estimates object shape in addition to intrinsic images. To regularize the estimation, the method utilizes statistical priors on object shape and albedo which are not generally applicable to images of full scenes.

More recently, intrinsic image decomposition has been addressed using CNNs [9, 10, 11]. Zhou et al. [10] proposed a multi-stream CNN to predict the relative reflectance ordering between image patches from large-scale human annotations. Narihira et al. [11] learned a CNN that directly predicts albedo and shading from an RGB image patch. Shelhamer et al. [9] estimated depth through a fully convolutional network and used it to constrain the intrinsic image decomposition. Unlike our approach, the depth and intrinsic images are estimated sequentially.

3 Formulation

3.1 Problem Statement and Model Architecture

Let us define a color image such that for pixel , where is a discrete image domain. Similarly, depth, albedo and shading can be defined as and . All images are defined in the log domain. Given a training set of color, depth, albedo, and shading images denoted by , where is the number of training images, we first aim to learn a prediction model that approximates depth , albedo , and shading from each color image . This prediction model will then be used to infer reliable depth , albedo , and shading simultaneously from a single query image .

We specifically learn the joint prediction model in the gradient domain, where depth and intrinsic images generally exhibit stronger correlation than in the value domain, as exemplified in Fig. 1. This greater correlation and reduced discrepancy among , , and facilitate joint learning of the two tasks by allowing them to better leverage information from each other111 is a differential operator defined in the - and -direction such that .. We therefore formulate our model to predict the depth, albedo, and shading gradient fields from the color image. Our method additionally learns the confidence of predicted gradients based on their consistency among one another in the training set.

We formulate this joint prediction using convolutional neural networks (CNNs) in a joint conditional random field (CRF). Our system architecture is structured as three cooperating networks, namely a depth prediction network, an intrinsic prediction network, and a gradient scale network. The depth prediction network is modeled by two feed-forward processes and , where and represent the network parameters for depth and depth gradients. The intrinsic prediction network is similarly modeled by feed-forward processes and , where and represent the network parameters for albedo gradients and shading gradients. The gradient scale network learns the confidence of depth, albedo and shading gradients using a feed-forward process for each, denoted by , , and , where , , and are their respective network parameters. The three networks in our system are jointly learned in a manner where each can leverage information from the other networks.

Figure 1: For an example from the MPI-SINTEL dataset [32], its (a) color image , (b) depth , (c) albedo , (d) shading , and their corresponding gradient fields , , , and shown below. Compared to quantities in the value domain, correlations are stronger among gradient fields, such that estimates of one may help in learning others. Furthermore, the gradient consistency between , , , and can be used to estimate the confidence of each gradient.

3.2 Joint Conditional Random Field

The networks in our model are jointly learned by minimizing the energy function of a joint CRF. The joint CRF is formulated so that each task can leverage information from the other complementary task, leading to improved prediction in comparison to separate estimation models. Our energy function is defined as unary potentials and pairwise potentials for each task:


where , , and are weights for each pairwise potential. In the training procedure, this energy function is minimized over all the training images, i.e., by minimizing . For testing, given a query image and the learned network parameters, the final solutions of , , and are estimated by minimizing the energy function .

3.2.1 Unary Potentials

The unary potentials consist of two energy functions, and . The depth unary function is formulated as


which represents the squared differences between depths and a predicted depths from , where is the local neighborhood222It is defined as the receptive field through the CNNs for pixel [33]. for pixel . It can be considered as a Dirichlet boundary condition for depth pairwise potentials, which will be described shortly.

The unary function for intrinsic images is used in minimizing the reconstruction errors of color image from albedo and shading :


where , and denotes the luminance of with . It has been noted that processing of luminance balances out the influence of the unary potential across the image [1, 28], and that treating the image formation equation (i.e., ) as a soft constraint can bring greater stability in optimization [25], especially for dark pixels whose chromaticity can be greatly distorted by sensor noise.

3.2.2 Pairwise Potentials

The pairwise potentials, which include , , and , represent differences between gradients and estimated gradients in the depth, albedo, and shading images. The pairwise potential for depth gradients is defined as


where denotes the Hadamard product, and the estimated depth gradients of provide a guidance gradient field for depth, similar to a Poisson equation [34, 35]. They are weighted by a confidence factor learned in the gradient scale network to reduce the impact of erroneous gradients. This gradient scale is similar to the derivative-level confidence employed in [36] for image restoration, except that our gradient scale is learned non-locally with CNNs and different types of guidance images, as later described in Sec. 3.4. The pairwise potentials for albedo gradients and shading gradients are defined in the same manner. Since the gradient scales are jointly estimated with each other task, these pairwise potentials are computed within an iterative solver, which will be described in Sec. 4.1.

Figure 2:

Network architecture of the JCNF model. It consists of a depth prediction network, an intrinsic prediction network, and a gradient scale network. These networks are learned by minimizing a joint CRF loss function.

3.3 Joint Depth and Intrinsic Prediction Network

Our joint depth and intrinsic prediction network utilizes the aforementioned energy function to predict , , , and from a single image . The joint network consists of a depth prediction network for and , and an intrinsic prediction network for and . In contrast to previous methods for single-image depth prediction [6, 37, 11], our system jointly estimates the gradient fields , , and , which are used to reduce ambiguity in the solution and obtain more edge-preserved results. To allow the different estimation tasks to leverage information from one another, we design the depth and intrinsic networks to share concatenated convolutional activations, and share convolutional layers between albedo and shading networks, as illustrated in Fig. 2.

3.3.1 Depth Prediction Network

The depth prediction network consists of a global depth network and a depth gradient network. For the global depth network, we learn its parameters for predicting an overall depth map from the entire image structure. Similar to [6, 37, 11], it provides coarse, spatially-varying depth that may be lacking in fine detail. This coarse depth will later be refined using the output of the depth gradient network.

The global depth network consists of five convolutional layers, three pooling layers, six non-linear activation layers, and two fully-connected (FC) layers. For the first five layers, the pre-trained parameters from the AlexNet architecture [38]

are employed, and fine-tuning for the dataset is done. Rectified linear units (ReLUs) are used for the non-linear layers, and the pooling layers employ max pooling. The first FC layer encodes the network responses into fixed-dimensional features, and the second FC layer infers a coarse global depth map at

-scale of the original depth map.

The depth gradient network predicts fine-detail depth gradients for each pixel. Its parameters are learned using an end-to-end patch-level scheme inspired by [39, 35]

, where the network input is an image patch and the output is a depth gradient patch. For inference of depth gradients at the pixel level, the depth gradient network consists of five convolutional networks followed by ReLUs, without stride convolutions or pooling layers. The first convolutional layer is identical to the first convolutional layer in the AlexNet architecture

[38]. Four additional convolutional layers are also used as shown in Fig. 2. The depth gradient patches that are output by this network will be used for depth reconstruction in Sec. 4.2. Note that in the testing procedure, the depth gradient network is applied to overlapping patches over the entire image, which are aggregated in the last convolutional layer to yield the full gradient field.

global depth net. gradient scale net.
conv1 conv2 conv3 conv4 conv5 FC1 FC2 conv1 conv2 conv3
channel - or
depth gradient net. intrinsic gradient net.
conv1 conv2 conv3 conv4 conv5 conv1 conv2 conv3 conv4 conv5
Table 1: Network architecture of the JCNF model.

3.3.2 Intrinsic Prediction Network

The intrinsic prediction network has a structure similar to the depth gradient prediction network. The network parameters and are learned for predicting the albedo and shading gradients at each pixel. To jointly infer the depth and intrinsic image gradients, the second convolutional activations for each task are concatenated and passed to their third convolutional layers as shown in Fig. 2. In the training procedure, the depth and intrinsic networks are iteratively learned, which enables each task to benefit from each other’s activations to provide more reliable estimates. Furthermore, similar to [11], the albedo and shading gradient networks share their first three convolutional layers, while the last two are separate. Since the albedo and shading images have related properties, these shared convolutional layers benefit their estimation. Details on kernel sizes and the number of channels for each layer are provided in Table 1 for all the networks.

3.4 Gradient Scale Network

The estimated gradients from the depth and intrinsic prediction networks might contain errors due to the ill-posed nature of their problems. To help in identifying such errors, our system additionally learns the confidence of estimated gradients, specifically, whether a gradient exists at a particular location or not. The basic idea is to learn from the training data about the consistencies that exist among the different types of gradients given their local neighborhood . From this, we can determine the confidence of a gradient (e.g., a depth gradient), based on the other estimated gradients (e.g., the albedo, shading, and image gradients). This confidence is modeled as a gradient scale that is similar to the scale map used in [36] to model derivative-level confidence for image restoration. It can be noted that in some depth and intrinsic image decomposition methods [1, 4, 7], the solutions are filtered with fixed parameters using the color image as guidance. Our system instead learns a network for defining the parameters, using not only a color image but also depth and intrinsic images as guidance.

The gradient scale network consists of three convolutional layers and one non-linear activation layer. For the case of depth gradients, the output of the gradient scale network is estimated as the convolution between and , followed by a non-linear activation i.e., , which is defined within . Here,

for a vector of gradients denotes a vector of the gradient magnitudes. Thus, in the gradient scale network, the network parameters are convolved with the gradient magnitudes. With the learned parameters

, the confidence of is estimated from , , . This can alternatively be viewed as a guidance filtering weight for with guidance images , , and . and are also similarly defined.

Some properties of gradient scales are as follows. A gradient scale can be either positive or negative. A large positive value indicates high confidence in the presence of a gradient. A large negative value also indicates high confidence, but for the reversed gradient direction. In addition, when a gradient field contains extra erroneous regions, gradient scales of value can help to disregard them.

4 Unified Depth and Intrinsic Image Prediction

4.1 Training

The energy function from (1) is used to simultaneously learn the depth and intrinsic network parameters (, , , ) and the gradient scale network parameters (, , ). Although the overall form of the energy is non-quadratic, it has a quadratic form with respect to each of its terms. The energy function can thus be minimized by alternating among its terms.

4.1.1 Loss Functions

For the global depth unary potential of (2), the global depth network parameters can be solved by minimizing the following loss function


We note that the intrinsic image unary term does not contain network parameters to be learned, so it is used only in the testing procedure.

The pairwise potentials each incorporate two networks, namely the gradient prediction network and gradient scale network, so they are iteratively trained. The loss function for the depth gradient pairwise potential of (4) is defined as


The loss functions for the pairwise potentials of the albedo gradients and shading gradients are similarly defined.

These loss functions are minimized using stochastic gradient descent with the standard back-propagation

[40]. First, is estimated through . Then and are iteratively estimated through and . In each iteration, the loss functions are differently defined according to the other network outputs, where the network parameters are initialized with the values obtained from the previous iteration. In this way, the networks are trained jointly and account for the improving outputs of the other networks.


Algorithm 1: Unified Depth and Intrinsic Image Prediction


Input: training image set , query color image
Output: depth , albedo , shading .
Training Procedure
For training set , learn parameters using backward-propagation.
Initialize parameters of , , and to provide constant values.
while not converged do
For , update parameters , , with fixed , , .
For , update parameters , , with fixed , , .
end while
Testing Procedure
for do
Estimate , using forward-propagation , .
Estimate , using forward-propagation , .
while not converged do
Estimate , , and using forward-propagation.
Estimate , , by optimizing and .
Compute , , , from , , .
end while
Interpolate , , into the size of using a bilinear interpolation.
end for
Estimate depth , albedo , shading as , , .


4.2 Testing

4.2.1 Iterative Joint Prediction

In the testing procedure, the outputs , , and for a given input image are predicted by minimizing the energy function from (1) with constraints from the estimates computed using the learned network parameters and forward-propagation. Similar to the training procedure, we minimize with an iterative scheme due to its non-quadratic form, where and are minimized in alternation.

For the depth prediction, is defined as a data term for global depth and a pairwise term for depth gradients:


where denotes network outputs, and is the gradient scale of derived from . We note that since is computed with , , and , all of the predictions need to be iteratively estimated.

For the intrinsic prediction, is also defined as data and pairwise terms, with the image formation equation and the albedo and shading gradients:


where and are defined similarly to . This energy function can be optimized with an existing linear solver [1]. These two energy functions and are iteratively minimized while providing information in the form of depth, albedo, and shading gradients to each other.

4.2.2 Coarse-to-Fine Joint Prediction

In estimating depth and intrinsic images, enforcing a degree of global consistency can lead to performance gains [1, 5]. Although our JCNF model is solved by global energy minimization, its global consistency is limited because gradients are defined just between pixel neighbors. For greater global consistency, we apply our joint prediction model in a coarse-to-fine manner, where color images are constructed at image pyramid levels , and the depth and intrinsic images and are predicted from . Coarser scale results are then used as guidance for finer levels.

Specifically, we reformulate as :


Similarly, is reformulated as :


where the multi-scale unary functions lead to more reliable solutions and faster convergence. The high-level algorithm for the training and testing procedures is provided in the Algorithm 1.

Figure 3: Qualitative results on MPI SINTEL [41] for depth prediction. (a) color image, (b) DA [17], (c) DCNF-FCSP(NYU) [7], (d) JCNF, and (e) ground truth.
Figure 4: Qualitative results on MPI SINTEL [41] for intrinsic decomposition of Fig. 3. (a) Shen et al. [30], (b) SIRFS [31], (c) MSCR [11], (d) JCNF, and (e) ground truth.

5 Experimental Results

For our experiments, we implemented the JCNF model using the VLFeat MatConvNet toolbox [40]. Our code with pre-trained parameters will be made publicly available upon publication.

The inputs of the global depth network were color images and the corresponding depth images at of the original scale. The inputs of each gradient network were randomly cropped patches from training images and the corresponding gradient maps. The patch sizes were for color images, for depth gradient maps, and for albedo and shading gradient maps. The reduced resolution for gradient map patches was due to boundary regions not processed by convolution [39]. For the gradient scale networks, the input patches are of size . The energy function weights were set to

by cross-validation. The filter weights of each network layer were initialized by drawing randomly from a Gaussian distribution with zero mean and a standard deviation of

. The network learning rates were set to , except for the final layer of the gradient networks where it is set to .

We additionally augmented the training data by applying random transforms to it, including scalings in the range , in-plane rotations in the range , translations, RGB scalings, image flips, and different gammas.

In the following, we evaluated our system through comparisons to state-of-the-art depth prediction and intrinsic image decomposition methods on the MPI SINTEL [41], NYU v2 [42], and Make3D [43] benchmarks. We additionally examined the performance contributions of the joint network learning (wo/jnl), the gradient scale network (wo/gsn), and the coarse-to-fine scheme (wo/ctf).


Methods Error Accuracy
rel log rms rms
Depth Transfer [16] 0.448 0.193 9.242 3.121 0.524 0.712 0.735
Depth Analogy [17] 0.432 0.167 8.421 2.741 0.621 0.799 0.812
DCNF-FCSP(NYU) [7] 0.424 0.164 8.112 2.421 0.652 0.782 0.824
JCNF(NYU) 0.293 0.131 7.421 1.812 0.715 0.812 0.831
JCNF wo/jnl 0.292 0.138 7.471 1.973 0.714 0.783 0.839
JCNF wo/gsn 0.271 0.119 7.451 1.921 0.724 0.793 0.893
JCNF wo/ctf 0.252 0.101 7.233 1.622 0.729 0.812 0.878
JCNF 0.183 0.097 6.118 1.037 0.823 0.834 0.902


Table 2: Quantitative results on MPI SINTEL [41] for depth prediction. DCNF-FCSP (NYU) [7] and JCNF(NYU) predict the depth by pre-training on NYU v2 [42].


albedo shading avg. albedo shading avg. albedo shading avg.
Retinex [44] 0.053 0.049 0.051 0.033 0.028 0.031 0.214 0.206 0.210
Li et al. [23] 0.042 0.041 0.037 0.024 0.031 0.034 0.242 0.224 0.194
Shen et al. [30] 0.043 0.039 0.048 0.028 0.027 0.032 0.221 0.210 0.232
Zhao et al. [22] 0.047 0.041 0.031 0.028 0.029 0.031 0.210 0.257 0.214
IIW [24] 0.041 0.032 0.041 0.032 0.031 0.027 0.281 0.241 0.284
SIRFS [31] 0.042 0.047 0.043 0.029 0.026 0.028 0.210 0.206 0.208
Jeon et al. [4] 0.042 0.033 0.032 0.021 0.021 0.023 0.204 0.181 0.193
Chen et al. [1] 0.031 0.028 0.029 0.019 0.019 0.019 0.196 0.165 0.181
MSCR [11] 0.020 0.017 0.021 0.016 0.011 0.011 0.201 0.150 0.176
JCNF wo/jnl 0.012 0.015 0.016 0.014 0.010 0.010 0.149 0.123 0.141
JCNF wo/gsn 0.008 0.011 0.011 0.010 0.009 0.008 0.146 0.112 0.132
JCNF wo/ctf 0.008 0.012 0.010 0.009 0.008 0.008 0.127 0.110 0.119
JCNF 0.007 0.009 0.007 0.006 0.007 0.007 0.092 0.101 0.097


Table 3: Quantitative results on MPI SINTEL [41] for intrinsic decomposition using methods based on single images, RGB-D, CNNs, and our JCNF model.

5.1 MPI SINTEL Benchmark

We evaluated our JCNF model on both depth prediction and intrinsic image decomposition on the MPI SINTEL benchmark [41], which consists of images from scenes with frames each. For a fair evaluation, we followed the same experimental protocol as in [1, 11], with their two-fold cross-validation and training/testing image splits. Fig. 3 and Fig. 4 exhibit predicted depth and intrinsic images from a single image, respectively. Table 2 and Table 3 are quantitative evaluations for both tasks using a variety of metrics, including average relative difference (rel), average log error (log), root-mean-squared error (rms), its log version (rms), and accuracy with thresholds [7]. For quantitatively evaluating intrinsic image decomposition performance, we used mean-squared error (MSE), local mean-squared error (LMSE), and the dissimilarity version of the structural similarity index (DSSIM) [11].


Methods Error Accuracy
rel log rms rms
Make3D [12] 0.349 - 1.214 0.409 0.447 0.745 0.897
Depth Transfer [16] 0.350 0.134 1.1 0.378 0.460 0.742 0.893
Depth Analogy [17] 0.328 0.132 1.31 0.392 0.471 0.799 0.891
MS-CNNs [6] 0.228 - 0.901 0.293 0.611 0.873 0.961
DCNF-FCSP [7] 0.221 0.095 0.760 0.281 0.604 0.885 0.974
JCNF(MPI) 0.214 0.093 0.716 0.241 0.677 0.879 0.927
JCNF wo/jnl 0.216 0.101 0.753 0.241 0.625 0.896 0.925
JCNF wo/gsn 0.210 0.091 0.728 0.254 0.621 0.890 0.975
JCNF wo/ctf 0.208 0.106 0.708 0.237 0.681 0.901 0.972
JCNF 0.201 0.077 0.711 0.212 0.690 0.910 0.979


Table 4: Quantitative results on the NYU v2 dataset [42] for depth prediction.
Figure 5: Qualitative results on NYU v2 [42] for depth prediction. (a) color image, (b) MS-CNNs [6], (c) DCNF-FCSP [7], (d) JCNF(MPI), (e) JCNF, and (f) ground truth.
Figure 6: Qualitative results on NYU v2 [42] for intrinsic decomposition of Fig. 5. (a) Li et al. [23], (b) IIW [24], (c) Jeon et al. [4], (d) JCNF learned using [4], (e) Chen et al. [1], and (f) JCNF learned using [1].

For the depth prediction task, data-driven approaches (DT [16] and DA [17]) provided limited performance due to their low learning capacity. CNN-based depth prediction (DCNF-FCSP [7]) using a pre-trained model from NYU v2 [42] showed better performance, but is restricted by depth ambiguity problems. Our JCNF model achieved the best results both quantitatively and qualitatively, whether pre-trained using MPI SINTEL or NYU v2 datasets. Furthermore, it is shown that omitting the gradient scale network, coarse-to-fine processing, or joint learning significantly reduced depth prediction performances.

In intrinsic image decomposition, existing single-image based methods [44, 23, 30, 22, 24] produced the lowest quality results as they do not benefit from any additional information. RGB-D based methods [1, 5, 4] performed better with measured depth as input. CNN-based intrinsic decomposition [11] surpassed RGB-D based techniques even without having depth as an input, but its results exhibit some blur, likely due to ambiguity from limited training datasets. Thanks to its gradient domain learning and leverage of estimated depth information, our JCNF model provides more accurate and edge-preserved results, with the best qualitative and quantitative performance.


Methods Error () Error ()
rel log rms rms rel log rms rms
Make3D [12] 0.412 0.165 11.1 0.451 0.407 0.155 16.1 0.486
Depth Transfer [16] 0.355 0.127 9.20 0.421 0.438 0.161 14.81 0.461
Depth Analogy [17] 0.371 0.121 8.11 0.381 0.410 0.144 14.52 0.479
DCNF-FCSP [7] 0.331 0.119 8.60 0.392 0.307 0.125 12.89 0.412
JCNF(MPI) 0.273 0.110 7.70 0.351 0.263 0.117 8.62 0.347
JCNF(NYU) 0.274 0.097 7.22 0.352 0.287 0.127 8.22 0.341
JCNF 0.262 0.092 6.61 0.321 0.243 0.091 6.34 0.302


Table 5: Quantitative results on the Make3D dataset [43] for depth prediction.

5.2 NYU v2 RGB-D Benchmark

For further evaluation, we obtained a set of RGB, depth, and intrinsic images by applying RGB-D based intrinsic image decomposition methods [1, 4] on the NYU v2 RGB-D database [42]. Of its RGB-D images of indoor scenes, we used for training and for testing, which is the standard training/testing split for the dataset.

For depth prediction, comparisons are made to the ground truth depth in Fig. 5 and Table 3 using the same experimental settings as in [7]. The state-of-the-art CNN-based methods [6, 7] clearly outperformed other previous methods. The performance of our JCNF model was even higher, with pre-training on either MPI SINTEL or NYU v2. Our depth prediction network is similar to [6], but it additionally predicts depth gradients and leverages intrinsic image estimates to elevate performance.

In the intrinsic image decomposition results of Fig. 6, the RGB-D based methods of [1, 4] are used as ground truth for training. It is seen that our JCNF more closely resembles that assumed ground truth than single-image based techniques [23, 24].

5.3 Make3D RGB-D Benchmark

We also evaluated our JCNF model on the Make3D dataset [43], which contains images depicting outdoor scenes (with used for training and for testing). To account for a limitation of this dataset [12, 45, 7], we calculate depth errors in two ways [45, 7]: on only regions with ground truth depth less than meters (denoted by ), and over the entire image (). From the depth prediction results in Fig. 7 and Table 5, our JCNF model is found to yield the highest accuracy, even when pretrained on MPI SINTEL [41] or NYU v2 [42] (i.e., JCNF(MPI) and JCNF(NYU)). For the intrinsic image decomposition results given in Fig. 8, JCNF also outperforms the comparison techniques.

Figure 7: Qualitative results on Make3D [42] for depth prediction. (a) color image, (b) DA [17], (c) DCNF-FCSP [7], (d) JCNF(MPI), (e) JCNF(NYU), (f) JCNF, and (g) ground truth.
Figure 8: Qualitative results on Make3D [43] for intrinsic decomposition of Fig. 7. (a) Li et al. [23], (b) Zhao et al. [22], (c) IIW [24], (d) Jeon et al. [4], (e) JCNF learned using [4], (f) Chen et al. [1], and (g) JCNF learned using [1].

6 Conclusion

We presented Joint Convolutional Neural Fields (JCNF) for jointly predicting depth, albedo and shading maps from a single input image. Its high performance can be attributed to its sharing network architecture, its gradient domain inference, and the incorporation of gradient scale network. It is shown through extensive experimentation that synergistically solving for these physical scene properties through the JCNF leads to state-of-the-art results in both single-image depth prediction and intrinsic image decomposition. In furture work, JCNF can potentially benefit shape refinement and image relighting from a single image.


  • [1] Chen, Q., Koltun, V.: A simle model for intrinsic image decomposition with depth cues. ICCV (2013)
  • [2] Laffont, P.Y., Bousseau, A., Paris, S., Durand, F., Drettakis, G.: Coherent intrinsic images from photo collections. ACM TOG 31(6) (2012) 1–11
  • [3] Lee, K.J., Zhao, Q., Tong, X., Gong, M., Izadi, S., L.S.U., Tan, P., Lin, S.: Estimation of intrinsic image sequences from image + depth video. ECCV (2012)
  • [4] Jeon, J., Cho, S., Tong, X., Lee, S.: Intrinsic image decomposition using structure-texture separation and surface normals. ECCV (2014)
  • [5] Barron, J.T., Malik, J.: intrinsic scene properties from a single rgb-d image. CVPR (2013)
  • [6] Eigen, D., Puhrsch, C., Ferus, R.: Depth map prediction from a single image using a multi-scale deep network. NIPS (2014)
  • [7] Fayao, L., Chunhua, S., Guosheng, L.: Deep convolutional neural fields for depth estimation from a single images. CVPR (2015)
  • [8] Kong, N., Black, M.J.: Intrinsic depth: Improving depth transfer with intrinsic images. ICCV (2015)
  • [9] Shelhamer, E., Barron, J., Darrell, T.: Scene intrinsics and depth from a single image. ICCV workshop (2015)
  • [10] Zhou, T., Krahenbuhl, P., Efors, A.A.: Learning data-driven reflectnace priors for intrinsic image decomposition. ICCV (2015)
  • [11] Narihira, T., Maire, M., Yu, S.X.: Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. ICCV (2015)
  • [12] Saxena, A., Sun, M., Andrew, Y.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. PAMI 31(5) (2009) 824–840
  • [13] Wang, Y., Wang, R., Dai, Q.: A parametric model for describing the correlation between single color images and depth maps. IEEE SPL 21(7) (2014) 800–803
  • [14] Xiu, L., Hongwei, Q., Yangang, W., Yongbing, Z., Qionghai, D.: Dept: Depth estimation by parameter transfer for single still images. CVPR (2014)
  • [15] Konrad, J., Wang, M., Ishwar, P., Wu, C., Mukherjee, D.: Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. IP 22(9) (2013) 3485–3496
  • [16] Karsch, K., Liu, C., Kang, S.B.: Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. PAMI 32(11) (2014) 2144–2158
  • [17] Choi, S., Min, D., Ham, B., Kim, Y., Oh, C., Sohn, K.: Depth analogy: Data-driven approach for single image depth estimation using gradient samples. IEEE Trans. IP 24(12) (2015) 5953–5966
  • [18] Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.: Towards unified depth and semantic prediction from a single image. CVPR (2015)
  • [19] Barrow, H.G., Tenenbaum, J.M.: Recovering intrinsic scene characteristics from images. CVS (1978)
  • [20] Land, E.H., Mccann, J.J.: Lightness and retinex theory. JOSA 61(1) (1971) 1–11
  • [21] Shen, J., Tan, P., Lin, S.: Intrinsic image decomposition with non-local texture cues. CVPR (2008)
  • [22] Zhao, Q., Tan, P., Dai, Q., SHen, L., Wu, E., Lin, S.: A closed-form solution to retinex with non-local texture constraints. IEEE Trans. PAMI 34(7) (2012) 1437–1444
  • [23] Li, Y., Brown, M.S.: Single image layer separation using relative smoothness. CVPR (2004)
  • [24] Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM TOG 33(4) (2014)
  • [25] Bonneel, N., Sunkavalli, K., Tompkin, J., Sun, D., Paris, S., Pfister, H.: Interactive intrinsic video editing. ACM Trans. Graphics (SIGGRAPH ASIA) (2014)
  • [26] Wiess, Y.: Deriving intrinsic images from image sequences. ICCV (2001)
  • [27] Laffont, P.Y., Bousseau, A., Drettakis, G.: Rich intrinsic image decomposition of outdoor scenes from multiple views. IEEE TVCG 19(2) (2013) 1–11
  • [28] Kong, N., Gehler, P.V., Black, M.J.: Intrinsic video. ECCV (2014)
  • [29] Bousseau, A., Paris, S., Durand, F.: User-assisted intrinsic images. ACM TOG 28(5) (2009) 1–11
  • [30] Shen, J., Yang, X., Jia, Y.: Intrinsic image using optimization. CVPR (2011)
  • [31] Barron, J., Malik, J.: Shape, albedo, and illumination from a single image of an unknown object. CVPR (2012)
  • [32] Butler, D., Wulff, J., Stanley, G., Black, M.: A naturalistic open source movie for optical flow evaluation. ECCV (2012)
  • [33] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. PAMI 37(9) (2015) 1904–1916
  • [34] Perez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM TOG 22(3) (2003)
  • [35] Xu, L., Ren, J., Yan, Q., Liao, R., Jia, J.: Deep edge-aware filters. ICML (2015)
  • [36] Shen, X., Yan, Q., Xu, L., Ma, L., Jia, J.: Multispectral joint image restoration via optimizing a scale map. IEEE Trans. PAMI 31(9) (2015) 1582–1599
  • [37] Eigen, D., R, F.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. ICCV (2015)
  • [38] Alex, K., Ilya, S., E, H.: Imagenet classification with deep convolutional neural networks. NIPS (2012)
  • [39] Dong, C., Loy, C.C., He, K., Tang, X.:

    Image super-resolution using deep convolutional networks.

    IEEE Trans. PAMI 37(3) (2015) 597–610
  • [40] Online.:
  • [41] Online.:
  • [42] Online.:
  • [43] Online.:
  • [44] Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth and baseline evaluations for intrinsic image algorithms. ICCV (2009)
  • [45] Liu, M., Salzmann, M., He, X.: Discrete-continuous depth estimation from a single image. CVPR (2014)