The wide variety of images around us are the outcome of interactions between lighting, shapes and materials. In recent years, the advent of convolutional neural networks (CNNs) has led to significant advances in recovering shape using just a single image[1, 2]. In contrast, material estimation has not seen as much progress, which might be attributed to multiple causes. First, material properties can be more complex. Even discounting more complex global illumination effects, materials are represented by a spatially-varying bidirectional reflectance distribution function (SVBRDF), which is an unknown high-dimensional function that depends on exitant and incident lighting directions . Second, while large-scale synthetic and real datasets have been collected for shape estimation [4, 5], there is a lack of similar data for material estimation. Third, pixel observations in a single image contain entangled information from factors such as shape and lighting, besides material, which makes estimation ill-posed.
In this work, we present a practical material capture method that can recover an SVBRDF from a single image of a near-planar surface, acquired using the camera of an off-the-shelf consumer mobile phone, under unconstrained environment illumination. This is in contrast to conventional BRDF capture setups that usually require significant equipment and expense [6, 7]. We address this challenge by proposing a novel CNN architecture that is specifically designed to account for the physical form of BRDFs and the interaction of light with materials, which leads to a better learning objective. We also propose to use a novel dataset of SVBRDFs that has been designed for perceptual accuracy of materials. This is in contrast to prior datasets that are limited to homogeneous materials, or juxtapose material properties with other concepts such as object categories.
We introduce a novel CNN architecture that encodes the input image into a latent representation, which is decoded into components corresponding to surface normals, diffuse texture, and specular roughness. We propose a differentiable rendering layer that recombines the estimated components with a novel lighting direction. This gives us additional supervision from images of the material rendered under arbitrary lighting directions during training; only a single image is used at test time. We also observe that coarse classification of BRDFs into material meta-categories is an easier task, so we additionally include a material classifier to constrain the latent representation. The inferred BRDF parameters from the CNN are quite accurate, but we achieve further improvement using densely-connected conditional random fields (DCRFs) with novel unary and smoothness terms that reflect the properties of the underlying microfacet BRDF model. We train the entire framework in an end-to-end manner.
Our approach — using our novel architecture and SVBRDF dataset — can outperform the state-of-art. We demonstrate that we can further improve these results by leveraging a form of acquisition control that is present on virtually every mobile phone — the camera flash. We turn on the flash of the mobile phone camera during acquisition; our images are thus captured under a combination of unknown environment illumination and the flash. The flash illumination helps further improve our reconstructions. First, it minimizes shadows caused by occlusions. Second, it allows better observation of high-frequency specular highlights, which allows better characterization of material type and more accurate estimation. Third, it provides a relatively simple setup for acquisition that eases the burden on estimation and allows the use of better post-processing techniques.
In contrast to recent works such as  and  that can reconstruct BRDFs with stochastic textures, we can handle a much larger class of materials. Also, our results, both with and without flash, are a significant improvement over the recent method of Li et al.  even though our trained model is more compact. Our experiments demonstrate advantages over several baselines and prior works in quantitative comparisons, while also achieving superior qualitative results. In particular, the generalization ability of our network trained on the synthetic BRDF dataset is demonstrated by strong performance on real images, acquired in the wild, in both indoor and outdoor environments, using multiple different phone cameras. Given the estimated BRDF parameters, we also demonstrate applications such as material editing and relighting of novel shapes.
To summarize, we propose the following contributions:
A novel lightweight SVBRDF acquisition method that produces state-of-the-art reconstruction quality.
A CNN architecture that exploits domain knowledge for joint SVBRDF reconstruction and material classification.
Novel DCRF-based post-processing that accounts for the microfacet BRDF model to refine network outputs.
An SVBRDF dataset that is large-scale and specifically attuned to estimation of spatially-varying materials.
2 Related Work
BRDF Acquisition: The Bidirectional Reflection Distribution function (BRDF) is a 4-D function that characterizes how a surface reflects lighting from an incident direction toward an outgoing direction 
. Alternatively, BRDFs are represented using low-dimensional parametric models[11, 12, 13, 14]. In this work, we use a physically-based microfacet model  that our SVBRDF dataset uses.
Traditional methods for BRDF acquisition rely on densely sampling this 4-D space using expensive, calibrated acquisition systems [6, 7, 16]. Recent work has demonstrated that assuming BRDFs lie in a low-dimensional subspace allows for them to be reconstructed from a small set of measurements [17, 18]. However, these measurements still to be taken under controlled settings. We assume a single image captured under largely uncontrolled settings.
Photometric stereo-based methods recover shape and/or BRDFs from images. Some of these methods recover a homogeneous BRDF given one or both of the shape and illumination [19, 20, 21]. Chandraker et al. [22, 23, 24] utilize motion cues to jointly recover the shape and BRDF of objects from images under known directional illumination. Hui et al.  recover SVBRDFs and shape from multiple images under known illuminations. All of these methods require some form of calibrated acquisition; in contrast, we wish to capture SVBRDFs and normal maps “in-the-wild”.
Recent work has shown promising results for “in-the-wild” BRDF acquisition. Hui et al.  demonstrate that the collocated camera-light setup on mobile devices is sufficient to reconstuct SVBRDFs and normals. They need capture 30+ images and calibrate them to reconstruct SVBRDFs; we aim to do this from a single image. Aittala et al.  propose using a flash/no-flash image pair to reconstruct stochastic SVBRDFs and normals using a slow optimization-based scheme. Our method can handle a larger class of materials and is orders of magnitude faster.
Deep learning-based Material Estimation: Inspired by the success of deep learning for a variety of vision and graphics tasks, recent work has looked at CNN-based material recognition and estimation. Bell et al.  train a material parsing network using crowd-sourced labeled data. However, their material recongition is driven more by object context, rather than appearance. Liu et al.  demonstrate image-based material editing using a network trained to recover homogenous BRDFs. Methods have been proposed to decompose images into their intrinsic image components which are an intermediate representation for material and shape [29, 30, 31]. Rematas et al.  train a CNN to reconstruct the reflectance map – a convolution of the BRDF with the illumination – from a single image of a shape from a known class. In subsequent work, they disentangle the reflectance map into the BRDF and illumination . Neither of these methods handle SVBRDFs, nor do they recover fine surface normal details. Kim et al.  reconstruct a homegeneous BRDF by training a network to aggregate multi-view observations of an object of known shape .
Similar to us, Aittala et al.  and Li et al.  reconstruct SVBRDFs and surface normals from a single image of a near-planar surface. Aittala et al. use a neural style transfer-based optimization approach to iteratively estimate BRDF parameters, however, they can only handle stationary textures and there is no correspondence between the input image and the reconstructed BRDF 
. Li et al. use supervised learning to train a CNN to predict SVBRDF and normals from a single image captured under environment illumination. Their training set is small, which necessitates a self-augmentation method to generate training samples from unlabeled real data. Further, they train a different set of networks for each parameter (diffuse texture, normals, specular albedo and roughness) and each material type (wood, metal, plastic). We demonstrate that by using our novel CNN architecture, supervised training on a high-quality dataset and acquisition under flash illumination, we are able to (a) reconstruct all these parameters with a single network, (b) learn a latent representation that also enables material recognition and editing, (c) produce results that are significantly better qualitatively and quantitatively.
3 Acquisition Setup and SVBRDF Dataset
In this section, we describe the setup for single image SVBRDF acquisition and the dataset we use for learning.
Our goal is to reconstruct the spatially-varying BRDF of a near planar surface from a single image captured by a mobile phone with the flash turned on for illumination. We assume that the -axis of the camera is approximately perpendicular to the planar surface (we explicitly evaluate against this assumption in our experiments). For most mobile devices, the position of the flash light is usually very close to the position of the camera, which provides us a univariate sampling of a isotropic BRDF . We argue that by imaging with a collocated camera and point light, we can have additional constraints that yield better BRDF reconstructions compared to acquisition under just environment illumination.
Our surface appearance is represented by a microfacet parametric BRDF model . Let , , be the diffuse color, normal and roughness, respectively, at pixel . Our BRDF model is defined as:
where and are the view and light directions and
is the half angle vector. Given an observed image, captured under unknown illumination , we wish to recover the parameters , and for each pixel in the image. Please refer to the supplementary material for more details on the BRDF model.
We train our network on the Adobe Stock 3D Material dataset111https://stock.adobe.com/3d-assets, which contains 688 materials with high resolution () spatially-varying BRDFs. Part of the dataset is created by artists while others are captured using a scanner. We use 588 materials for training and 100 materials for testing. For data augmentation, we randomly crop 12, 8, 4, 2, 1 image patches of size 512, 1024, 2048, 3072, 4096. We resize the image patches to a size of for processing by our network. We flip patches along and axes and rotate them in increments of 45 degrees. Thus, for each material type, we have 270 image patches.222The total number of image patches for each material can be computed as . We randomly scale the diffuse color, normal and roughness for each image patch to prevent the network from overfitting and memorizing the materials. We manually segment the dataset into materials types. The distribution is shown in Table 2, with an example visualization of each material type in Figure 2. More details on rendering the dataset are in supplementary material.
4 Network Design for SVBRDF Estimation
In this section, we describe the components of our CNN designed for single-image SVBRDF estimation. The overall architecture is illustrated in Figure 3.
4.1 Considerations for Network Architecture
Single-image SVBRDF estimation is an ill-posed problem. Thus, we adopt a data-driven approach with a custom-designed CNN that reflects physical intuitions.
Our basic network architecture consists of a single encoder and three decoders which reconstruct the three spatially-varying BRDF parameters: diffuse color , normals and roughness . The intuition behind using a single encoder is that different BRDF parameters are correlated, thus, representations learned for one should be useful to infer the others, which allows significant reduction in the size of the network. The input to the network is an RGB image, augmented with the pixel coordinates as a fourth channel. We add the pixel coordinates since the distribution of light intensities is closely related to the location of pixels, for instance, the center of the image will usually be much brighter. Since CNNs are spatially invariant, we need the extra signal to let the network learn to behave differently for pixels at different locations. Skip links are added to connect the encoder and decoders to preserve details of BRDF parameters.
Another important consideration is that in order to model global effects over whole images like light intensity fall-off or large areas of specular highlights, it is necessary for the network to have a large receptive field. To this end, our encoder network has seven convolutional layers of stride 2, so that the receptive field of every output pixel covers the entire image.
4.2 Loss Functions for SVBRDF Estimation
For each BRDF parameter, we have an L2 loss for direct supervision. We now describe other losses for learning a good representation for SVBRDF estimation.
Since our eventual goal is to model the surface appearance, it is important to balance the contributions of different BRDF parameters. Therefore, we introduce a differentiable rendering layer that renders our BRDF model (Eqn. 1
) under the known input lighting. We add a reconstruction loss based on the difference between these renderings with the predicted parameters and renderings with ground-truth BRDF parameters. The gradient can be backpropagated through the rendering layer to train the network. In addition to rendering the image under the input lighting, we also render images under bynovel
lights. For each batch, we create novel lights by randomly sampling the the point light source on the upper hemisphere. This ensures that the network does not overfit to collocated illumination and is able to reproduce appearance under other light conditions. The final loss function for the encoder-decoder part of our network is:
where , , and are the L2 losses for diffuse, normal, roughness and rendered image predictions, respectively. Here, ’s are positive coefficients to balance the contributions of various terms, which are set to in our experiments.
Since we train on near planar surfaces, the majority of the normal directions are flat. Table 1
shows the normal distributions in our dataset. To prevent the network from over-smoothing the normals, we group the normal directions into different bins and for each bin we assign a different weight when computing the L2 error. This balance various normal directions in the loss function.
The distribution of BRDF parameters is closely related to the surface material type. However, training separate networks for different material types similar to  is expensive. Also the size of the network grows linearly with the number of material types, which limits utility. Instead, we propose a split-merge network with very little computational overhead.
Given the highest level of features extracted by the encoder, we send the feature to a classifier to predict its material type. Then we evaluate the BRDF parameters for each material type and use the classification results as weights (the output of softmax layer). This averages the prediction from different material types to obtain the final BRDF reconstruction results. Suppose we havechannels for BRDF parameters and material types. To output the BRDF reconstruction for each type of material, we only modify the last convolutional layer of the decoder so that the output channel will be instead of . In practice, we set to be , as shown in Table 2.
The classifier is trained together with the encoder and decoder from scratch, with the weights of each label set to be inversely proportional to the number of examples in Table 2 to balance different material types in the loss function. The overall loss function of our network with the classifier is
where is cross entropy loss and to limit the gradient magnitude.
4.3 Designing DCRFs for Refinement
The prediction of our base network is quite reasonable. However, accuracy may further be enhanced by post-processing through a DCRF (trained end-to-end).
Diffuse color refinement
For diffuse prediction, when capturing the image of specular materials, parts of the surface might be saturated by specular highlight. This can sometimes lead to artifacts in the diffuse color prediction since the network has to hallucinate the diffuse color from nearby pixels. To remove such artifacts, we incorporate a densely connected continuous conditional random field (DCRF)  to smooth the diffuse color prediction. Let be the diffuse color prediction of network at pixel , be its position and is the normalized diffuse RGB color of the input image. We use the normalized color of the input image to remove the influence of light intensity when measuring the similarity between two pixels. The energy function of the dense connected CRF that is minimized over for diffuse prediction is defined as:
Here are Gaussian smoothing kernels, while and are coefficients to balance the contribution of unary and smoothness terms. Notice that we have a spatially varying to allow different unary weights for different pixels. The intuition is that artifacts usually occur near the center of images with specular highlights. For those pixels, we should have lower unary weights so that the CRF learns to predict their diffuse color from nearby pixels.
Once we have the refined diffuse color, we can use it to improve the prediction of other BRDF parameters. To reduce the noise in normal prediction, we use a DCRF with two smoothness kernels. One is based on the pixel position while the other is a bilateral kernel based on the position of the pixel and the gradient of the diffuse color. The intuition is that pixels with similar diffuse color gradients often have similar normal directions. Let be the normal predicted by the network. The energy function for normal prediction is defined as
Since we use a collocated light source to illuminate the material, once we have the normal and diffuse color predictions, we can use them to estimate the roughness term by either grid search or using a gradient-based method. However, since the microfacet BRDF model is not convex nor monotonic with respect to the roughness term, there is no guarantee that we can find a global minimum. Also, due to noise from the normal and diffuse predictions, as well as environment lighting, it is difficult to get an accurate roughness prediction using optimization alone, especially when the glossiness in the image is not apparent. Therefore, we propose to combine the output of the network and the optimization method to get a more accurate roughness prediction. We use a DCRF with two unary terms, and , given by the network prediction and the coarse-to-fine grid search method of , respectively:
All DCRF coefficients are learned in an end-to-end manner using . Here, we have a different set of DCRF parameters for each material type to increase model capacity. During both training and testing, the classifier output is used to average the parameters from different material types, to determine the DCRF parameters. More implementation details are in supplementary material.
In this section, we demonstrate our method and compare it to baselines on a wide range of synthetic and real data.
Rendering synthetic training dataset
To create our synthetic data, we apply the SVBRDFs on planar surfaces and render them using Mitsuba  with the BRDF importance sampling suggested in . We choose a camera field of view of to mimic typical mobile phone cameras. To better model real-world lighting conditions, we render images under a combination of a dominant point light (flash) and an environment map. We use the environment maps used in 
, with random rotations. We sample the light source position from a Gaussian distribution centered at the camera to make the inference robust to differences in real-world mobile phones. We render linear images, though clamped toto mimic cameras with insufficient dynamic range. However, we still wish to reconstruct the full dynamic range of the SVBRDF parameters. To aid in this, we can render HDR images using in-our network rendering layer and compute reconstruction error w.r.t HDR ground truth images. In practice, this leads to unstable gradients in training; we mitigate this by applying a gamma of and minor clamping to when computing the image reconstruction loss. We find that this, in addition to our L2 losses on the SVBRDF parameters, allows us to hallucinate details from saturated images.
We use Adam optimizer  to train our network. We set when training the encoder and decoders and when training the classifier. The initial learning rate is set to be for the encoder, for the three decoders and
for the classifier. We cut down the learning rate by half in every two epochs. Since we find that the diffuse color and normal direction contribute much more to the final appearance, we first train their encoder-decoders for 15 epochs, then we fix the encoder and train the roughness decoder separately for 8 epochs. Next, we fix the network and train the parameters for the DCRFs, using Adam optimizer to update their coefficients.
5.1 Results on Synthetic Data
Figure 4 shows results of our network on our synthetic test dataset. We can observe that spatially varying surface normals, diffuse albedo and roughness are recovered at high quality, which allows relighting under novel light source directions that are very different from the input. To further demonstrate our BRDF reconstruction quality, in Figure 5, we show relighting results under different environment maps and point lights at oblique angles. Note that our relighting results closely match the ground truth even under different lighting conditions; this indicates the accuracy of our reconstructions.
We next perform quantitative ablation studies to evaluate various components of our network design and study comparisons to prior work.
Effects of material classifier and DCRF:
The ablation study summarized in Table 2 shows that adding the material classifier reduces the L2 error for SVBRDF and normal estimation, as well as rendering error. This validates the intuition that the network can exploit the correlation between BRDF parameters and material type to produce better estimates. We also observe that training the classifier together with the BRDF reconstruction network results in a material classification error of , which significantly improves over just our pure material classification network that achieves . This indicates that features trained for BRDF estimation are also useful for material recognition. In our experiments, incorporating the classifier without using its output to fuse BRDF reconstruction results does not improve BRDF estimation. Figure 6 shows the reconstruction result on a sample where the classifier and the DCRF qualitatively improve the BRDF estimation, especially for the diffuse albedo.
Effect of acquisition under point illumination
Next we evaluate the effect of using point illumination during acquisition. For this, we train and test two variants of our full network – one on images rendered under only environment illumination (-) and another on images illuminated by a point light besides environment illumination (-). Results are in Table 6 with qualitative visualizations in Figure 6. The model from  in Table 6, which is trained for environment lighting, performs slightly worse than our environment lighting network cls-env. But our network trained and evaluated on point and environment lighting, cls-pt, easily outperforms both. We argue this is because a collocated point light creates more consistent illumination across training and test images, while also capturing higher frequency information. Figure 7 illustrates this: the appearance of the same material under different environment lighting can significantly vary and the network has to be invariant to this, limiting reconstruction quality.
Relative effects of flash and environment light intensities
In Figure 8, we train and test on a range of relative flash intensities, showing our network works well for each. Note that as relative flash intensity decreases, errors increase, which justifies our use of flash light. Using flash and no-flash pairs can help remove environment lighting, but needs alignment of two images, which limits applicability.
5.2 Results on Real Data
To verify the generalizabity of our method to real data, we show results on real images captured with different mobile devices in both indoor and outdoor environments. We capture linear RAW images (with potentially clipped highlights) with the flash enabled, using the Adobe Lightroom Mobile app. The mobile phones were hand-held and the optical axis of the camera was only approximately perpendicular to the surfaces (see Figure 1).
Qualitative results with different mobile phones
Figure 9 presents SVBRDF and normal estimation results for real images captured with three different mobile devices: Huawei P9, Google Tango and iPhone 6s. We observe that even with a single image, our network successfully predicts the SVBRDF and normals, with images rendered using the predicted parameters appear very similar to the input. Also, the exact same network generalizes well to different mobile devices, which shows that our data augmentation successfully helps the network factor out variations across devices. For some materials with specular highlights, the network can hallucinate information lost due to saturation. The network can also reconstruct reasonable normals even for complex instances.
A failure case
We can edit the reconstructed SVBRDFs by transferring material properties. Figure 11 shows an example where we transfer BRDF properties across different material types and render in a novel lighting condition.
5.3 Further Comparisons with Prior Works
Comparison with two-shot BRDF method 
The two-shot method of  can only handle images with stationary texture while our method can reconstruct arbitrarily varying SVBRDFs. For a meaningful comparison, in Figure 13, we compare our method with  on a rendered stationary texture. We can see that even for this restrictive material type, the normal maps reconstructed by the two methods are quite similar, but the diffuse map reconstructed by our method is closer to ground truth. While  takes about hours to reconstruct a patch of size , our method requires seconds. The aligned flash and no-flash pair for  is not trivial to acquire (especially on mobile cameras with effects like rolling shutter), making our single image BRDF estimation more practical.
Comparison normals with environment light and photometric stereo
In Figure 13, we compare our normal map and the output of a single image SVBRDF reconstruction method under environment lighting  with photometric stereo . We observe that the normals reconstructed by our method are of higher quality than , with details comparable or sharper than photometric stereo.
The appendix provides further experiments and details, including:
Details of data augmentation, continuous DCRF and visualization of weights
Spherical renderings of estimated real spatially varying BRDFs
Visualization of SVBRDF estimation with respect to prediction error
Further qualitative results on synthetic and real data.
We have proposed a framework for acquiring spatially-varying BRDF using a single mobile phone image. Our solution uses a convolutional neural network whose architecture is specifically designed to capture various physical insights into the problem of BRDF estimation. We also propose a dataset that is larger and better-suited to the problem of material estimation as compared to prior ones, as well as simple acquisition settings that are nevertheless effective for SVBRDF estimation. Our network generalizes very well to real data, obtaining high-quality results in unconstrained test environments. A key goal for our work is to take accurate material estimation from expensive and controlled lab setups, into the hands of non-expert users with consumer devices, thereby opening the doors to new applications. Our future work will take the next step of acquiring SVBRDF with unknown shapes, as well as study the role of other semantic signals such as object categories in material estimation.
Eigen, D., Fergus, R.:
Predicting depth, surface normals and semantic labels with a common
multi-scale convolutional architecture.
In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2650–2658
-  Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In: European Conference on Computer Vision, Springer (2016) 628–644
-  Nicodemus, F.E.: Directional reflectance and emissivity of an opaque surface. Applied optics 4(7) (1965) 767–775
-  Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
-  Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. (2012)
-  Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co. (2000) 145–156
-  Marschner, S.R., Westin, S.H., Lafortune, E.P., Torrance, K.E., Greenberg, D.P.: Image-based brdf measurement including human skin. In: Rendering Techniques’ 99. Springer (1999) 131–144
-  Aittala, M., Weyrich, T., Lehtinen, J., et al.: Two-shot svbrdf capture for stationary materials. ACM Trans. Graph. 34(4) (2015) 110–1
-  Aittala, M., Aila, T., Lehtinen, J.: Reflectance modeling by neural texture synthesis. ACM Transactions on Graphics (TOG) 35(4) (2016) 65
-  Li, X., Dong, Y., Peers, P., Tong, X.: Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Trans. Graph. 36(4) (July 2017) 45:1–45:11
-  Blinn, J.F., Newell, M.E.: Texture and reflection in computer generated images. Communications of the ACM 19(10) (1976) 542–547
-  Cook, R.L., Torrance, K.E.: A reflectance model for computer graphics. ACM Transactions on Graphics (TOG) 1(1) (1982) 7–24
-  Ward, G.J.: Measuring and modeling anisotropic reflection. ACM Transactions on Graphics (TOG) 26(2) (1992) 265–272
-  Oren, M., Nayar, S.K.: Generalization of the lambertian model and implications for machine vision. International Journal on Computer Vision (IJCV) 14(3) (1995) 227–251
-  Burley, B.: Physically-based shading at disney. In: ACM SIGGRAPH 2012 Courses. (2012)
-  Matusik, W., Pfister, H., Brand, M., McMillan, L.: A data-driven reflectance model. ACM Transactions on Graphics (TOG) 22(3) (2003) 759–769
-  Nielsen, J.B., Jensen, H.W., Ramamoorthi, R.: On optimal, minimal brdf sampling for reflectance acquisition. ACM Transactions on Graphics (TOG) 34(6) (2015) 186
-  Xu, Z., Nielsen, J.B., Yu, J., Jensen, H.W., Ramamoorthi, R.: Minimal brdf sampling for two-shot near-field reflectance acquisition. ACM Transactions on Graphics (TOG) 35(6) (2016) 188
-  Romeiro, F., Vasilyev, Y., Zickler, T.: Passive reflectometry. In: European Conference on Computer Vision (ECCV). (2008)
-  Romeiro, F., Zickler, T.: Blind reflectometry. In: European Conference on Computer Vision (ECCV). (2010)
-  Oxholm, G., Nishino, K.: Shape and reflectance estimation in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(2) (2016) 376–389
-  Chandraker, M.: On shape and material recovery from motion. In: European Conference on Computer Vision, Springer (2014) 202–217
What camera motion reveals about shape with unknown brdf.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2171–2178
-  Chandraker, M.: The information available to a moving observer on shape with unknown, isotropic brdfs. IEEE transactions on pattern analysis and machine intelligence 38(7) (2016) 1283–1297
-  Hui, Z., Sankaranarayanan, A.C.: A dictionary-based approach for estimating shape and spatially-varying reflectance. In: International Conference on Computational Photography (ICCP). (2015)
-  Hui, Z., Sunkavalli, K., Lee, J.Y., Hadap, S., Wang, J., Sankaranarayanan, A.C.: Reflectance capture using univariate sampling of BRDFs. In: IEEE Intl. Conf. Computer Vision (ICCV). (2017)
-  Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. Computer Vision and Pattern Recognition (CVPR) (2015)
-  Liu, G., Ceylan, D., Yumer, E., Yang, J., Lien, J.M.: Material editing using a physically based rendering network. ICCV (2017)
-  Narihira, T., Maire, M., Yu, S.X.: Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2992–2992
-  Shelhamer, E., Barron, J.T., Darrell, T.: Scene intrinsics and depth from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. (2015) 37–44
-  Shi, J., Dong, Y., Su, H., Yu, S.X.: Learning non-lambertian object intrinsics across shapenet categories. arXiv preprint arXiv:1612.08510 (2016)
-  Rematas, K., Ritschel, T., Fritz, M., Gavves, E., Tuytelaars, T.: Deep reflectance maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 4508–4516
-  Georgoulis, S., Rematas, K., Ritschel, T., Fritz, M., Van Gool, L., Tuytelaars, T.: Delight-net: Decomposing reflectance maps into specular materials and natural illumination. arXiv preprint arXiv:1603.08240 (2016)
-  Kim, K., Gu, J., Tyree, S., Molchanov, P., Nießner, M., Kautz, J.: A lightweight approach for on-the-fly reflectance estimation. arXiv preprint arXiv:1705.07162 (2017)
-  Ristovski, K., Radosavljevic, V., Vucetic, S., Obradovic, Z.: Continuous conditional random fields for efficient regression in large fully connected graphs. In: AAAI. (2013)
-  Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. arXiv preprint arXiv:1704.02157 (2017)
-  Jakob, W.: Mitsuba renderer (2010) http://www.mitsuba-renderer.org.
-  Karis, B., Games, E.: Real shading in unreal engine 4. Proc. Physically Based Shading Theory Practice (2013)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Ristovski, K., Radosavljevic, V., Vucetic, S., Obradovic, Z.: Continuous conditional random fields for efficient regression in large fully connected graphs. In: AAAI. (2013)
Appendix 0.A Further Experimental Analysis
Error distribution on test set
To provide better intuition into our quantitative results, we plot the distributions of prediction errors for diffuse albedo (), normals (), roughness () and relighting () in Figure 14. Then, we sort the BRDF reconstruction results in the test set according to and illustrate the estimation and relighting quality for a random material picked from various percentiles of the above error distribution. The qualitative comparison is shown in Figure 15.
Even though our network is trained end-to-end, we observe physically meaningful trends in Figure 14. For instance, the materials that correspond to lower error percentiles tend to have flat normals, uniform diffuse color and wide specular lobes. On the other hand, materials with higher errors tend to have more complex normals, stronger local variations in diffuse color and roughness, or more prominent highlights. This demonstrates the benefits of our network design which considers the underlying problem structure. We also observe that normal and diffuse color estimates are quite accurate even at error percentiles higher than , which contributes to reasonable relighting results under novel lighting even at high error percentiles.
Appendix 0.B Further Results on Real Data
Comparison with photometric stereo as reference
In Figure 16, we compare the normals estimated by our method with that of , using the normal map from photometric stereo as reference. In the main paper, we use the photometric stereo method of . Here, we instead use a simpler but more robust method. We acquire images of a material sample under different directional point light sources. We abandon the 5 most brightest observations and 5 darkest observations and use the rest for a Lambertian photometric stereo. We find such a method to be quite robust to shadows, as well as the effects of complex BRDF such as glossiness or specularity. We observe that our method is able to capture very fine details in the normal map, in particular, better than the method of . For instance, note the detail within the grooves of the material in the first and third rows. This demonstrates the efficacy of the proposed method for normal and SVBRDF estimation.
Real data results in unconstrained environments
In Figure 17, we show several more examples of surface normal and BRDF estimation with real data using the proposed method. The images are acquired in unconstrained settings with the camera flash enabled, for several different material types derived from wood flooring, tiles, carpets and so on. In all rows, the mobile phone is hand-held and only approximately parallel to the surface. In each case, we observe that the recovered normals, as well as the diffuse albedo and specular components of the spatially-varying BRDF appear qualitatively correct. In some cases, such as the second row of the first column, we observe that even very tight specular lobes are well-estimated, as evident from the lobe’s compactness in the relighted image. The first row is captured by iPhone 6s, the second row by Huawei P9 and the last three rows by Lenovo Phab 2. Even though we never calibrate the mobile phone, our network generalizes very well to the new device.
Another visualization for relighting
For another visualization of the normal and BRDF estimation on real data, we render the estimated material on a sphere illuminated under an oblique lighting direction that is very different from the input lighting. Recall that we only use an approximately planar patch of material as input. The BRDF estimation and relighted sphere are illustrated in Figure 18. We observe that the appearance of the sphere even under a novel lighting direction is quite reasonable.
Appendix 0.C Microfacet BRDF Model
We use the microfacet BRDF model proposed in . Let , , be the diffuse color, normal and roughness, respectively, at pixel and be its intensity observed by the camera. Our BRDF model is defined as
where and are the view and light directions, while is the half angle vector. Further, , and are the distribution, Fresnel and geometric terms, respectively, which are defined as
with the specular reflectance at normal incidence. For a dielectric material, the value of is determined by the index of refraction :
For a conductor material, it is determined by the index of refraction and the absorption coefficient :
When rendering our dataset, we set for metal and for other kinds of materials. Figure 19 shows an example of smooth aluminum material rendered with and . We observe that the material rendered with has a much larger area of specular highlight, which matches appearances of metals in practice.
Appendix 0.D Details of Continuous DCRFs
We use continuous densely connected conditional random fields (DCRFs) for post-processing to remove artifacts caused by saturated highlights and noise in the prediction of the neural network [40, 36]. We customize the DCRFs to better suit our problem of spatially-varying BRDF reconstruction. The distinguishing factor for our DCRF construction is the design of spatially varying weight maps that allow incorporating domain specific knowledge into the CRF inference. In the following, we will discuss the design and the intuition behind the usage of the weight map, as well as the details of training and inference for the DCRF.
Weight Maps of DCRFs
We first discuss the DCRF for diffuse albedo prediction. Its energy function is defined as
Here, the coefficient is spatially varying. A larger indicates greater confidence in the prediction from the neural network. Since we use a colocated point light source for illumination, an observation is that saturations caused by the specular highlight are usually in the middle of the image. Another observation is that since the flash illumination is white in color, the saturated pixels are usually white, which means the minimum of their RGB values will be large. Therefore, for regions near the center of the image or regions with specular highlights, we should have a smaller unary weight so that the DCRF may smooth out the artifacts. Based on these two observations, we define the weight map for the unary term as
where is the minimum of the three color channels at pixel :
Here, and are two learnable parameters. We set and at the beginning of the training process. We set and through the whole training process. Figure 20 shows examples of the weight map for diffuse albedo prediction.
For normal prediction, we do not observe such strong correlation between the prediction error and the position or intensity of the image. Therefore, we just set a uniform weight for every pixel in the image. The energy function is defined as
where , and are learnable parameters that trade-off relative confidences in the unary, a pairwise smoothness prior and a prior on correlation between normals and albedo boundaries.
Finally, for roughness prediction, the energy function is defined as
where is the prediction from the network and is the prediction from a grid search. We find that the prediction from grid search is usually only accurate near the glossy regions, which means these regions should have a larger . Therefore, we define the weight map to be
where is constant across the whole image. Both and can be learned through back propagating the gradient. We set and .
Hyperparameters for Training And Inference
In order to increase the capacity of the DCRF model, we learn different sets of BRDF parameters for each type of material. During both inference and training time, we average the DCRF coefficients according to the output of our material classifier. Let be the DCRF coefficients for one material. To enhance the robustness of our method, we re-parameterize the coefficients as
We clip the DCRF coefficients to always be positive. We use the Adam optimizer to optimize the coefficients. The learning rate is set to and we reduce it by half after every 2000 iterations. We adopt the method in 
to train our DCRF model. The batch size is set to 32. We train the DCRF for diffuse albedo prediction over 4000 iterations and the DCRF for roughness and normal prediction over 3000 iterations. The standard deviations of Gaussian smooth kernels for the three DCRFs are shown in Table3.
|Gaussian Kernels of DCRF for Diffuse Albedo|
|Gaussian Kernels of DCRF for Normal Map|
|Gaussian Kernels of DCRF for Roughness Map|
Appendix 0.E Details of Data Augmentation
In experiments, besides rotating and cropping the original high resolution spatially-varying materials, another important data augmentation is to scale the BRDF parameters for each patch before rendering them into images. For diffuse albedo, we uniformly sample scale coefficients in the range to . For normal map, we sample the scale coefficients in the same way, apply the coefficients to the and components, then normalize the normal vector to be of unit length. For roughness, we sample the scale coefficients from a Gaussian distribution centered at , with standard deviation equal to . Empirically, we observe that such data augmentation can greatly improve the generalization ability of the network. For example, simply scaling the roughness parameter for each patch decreases the validation error for roughness prediction by .