Reconstructing the 3D geometry and reflectance properties of an object from 2D images has been a long-standing problem in computer vision and graphics, with applications including 3D visualization, relighting, and augmented and virtual reality. Traditionally this has been accomplished using complex acquisition systems[5, 18, 42, 46, 57] or multi-view stereo (MVS) methods [14, 41] applied to dense image sets [35, 50]
. The acquisition requirements for these methods significantly limits their practicality. Recently, deep neural networks have been proposed for material estimation from a single or a few images. However, many of these methods are restricted to estimating the spatially-varying BRDF (SVBRDF) of planar samples[11, 16, 32]. Li et al.  demonstrate shape and reflectance reconstruction from a single image, but their reconstruction quality is limited by their single image input.
Our goal is to enable practical and high-quality shape and appearance acquisition. To this end, we propose using a simple capture setup: a sparse set of six cameras—placed at one vertex and the centers of the adjoining faces of a regular icosahedron, forming a cone—with collocated point lighting (Fig. 2 left). Capturing six images should allow for better reconstruction compared to single image methods. However, at such wide baselines, the captured images have few correspondences and severe occlusions, making it challenging to fuse information across viewpoints.
As illustrated in Fig. 2, we propose a two-stage approach to address this problem. First, we design multi-view geometry and reflectance estimation networks that regress the 2D depth, normals and reflectance for each input view by robustly aggregating information across all sparse viewpoints. We estimate the depth for each input view using a deep multi-view stereo network [51, 54] (Sec. 3.1). Because of our sparse capture, these depth maps contain errors and cannot be used to accurately align the images to estimate per-vertex BRDFs [35, 57]. Instead, we use these depth maps to warp the images to one viewpoint and use a novel deep multi-view reflectance estimation network to estimate per-pixel normals and reflectance (parameterized by diffuse albedo, specular albedo and roughness in a simplified Disney BRDF model ) for that viewpoint (Sec. 3.2
). This network extracts features from the warped images, aggregates them across viewpoints using max-pooling, and decodes the pooled features to estimate the normals and SVBRDF for that viewpoint. This approach to aggregate multi-view information leads to more robust reconstruction than baseline approaches like a U-Net architecture, and we use it to recover normals and reflectance for each view.
Second, we propose a novel method to fuse these per-view estimates into a single mesh with per-vertex BRDFs using optimization in a learnt reflectance space. First, we use Poisson reconstruction  to construct a mesh from the estimated per-view depth and normal maps (Sec. 3.3). Each mesh vertex has multiple reflectance parameters corresponding to each per-view reflectance map, and we fuse these estimates to reconstruct object geometry and reflectance that will accurately reproduce the input images
. Instead of optimizing the per-vertex reflectance parameters, which leads to outliers and spatial discontinuities, we optimize thethe latent features of our multi-view reflectance estimation network (Sec. 3.4
). We pass these latent features to the reflectance decoder to construct per-view SVBRDFs, fuse them using per-vertex blending weights, and render them to compute the photometric error for all views. This entire pipeline is differentiable, allowing us to backpropagate this error and iteratively update the reflectance latent features and per-vertex weights till convergence. This process refines the reconstruction to best match the specific captured images, while leveraging the priors learnt by our reflectance estimation network.
We train our networks with a large-scale synthetic dataset comprised of procedurally generated shapes with complex SVBRDFs [51, 53] and rendered using a physically-based renderer. While our method is trained with purely synthetic data, it generalizes well to real scenes. This is illustrated in Figs. 1 and 8, where we are able to reconstruct real objects with complex geometry and non-Lambertian reflectance. Previous state-of-the-art methods, when applied to sparse input images for such objects, produce incomplete, noisy geometry and erroneous reflectance estimates (Figs. 4 and 7). In contrast, our work is the first to reconstruct detailed geometry and high-quality reflectance from sparse multi-view inputs, allowing us to render photorealistic images under novel view and lighting.
2 Related Works
3D reconstruction. To reconstruct 3D geometry from image sets, traditional methods [15, 29, 41] find correspondences between two or more images utilizing specific image features. Such methods are sensitive to illumination changes, non-Lambertian reflectance and textureless surfaces. The existence of multiple points with similar matching costs also require these methods to have a large number of images to get high-quality reconstructions (we refer the interested readers to  for more details). In contrast, our method reconstructs high-quality geometry for complex real scenes from an order of magnitude fewer images.
Recently, numerous learning-based methods have been proposed to reconstruct 3D shape using various geometric representations, including regular volumes [22, 37, 48], point clouds [1, 43] and depth maps [19, 54]. These methods cannot produce high-resolution 3D meshes. We extend recent learning-based MVS frameworks [51, 54] to estimate depth from sparse multi-view images of objects with complex reflectance. We combine this depth with estimated surface normals to reconstruct 3D meshes with fine details.
SVBRDF acquisition. SVBRDF acquisition is a challenging task that often requires a dense input image set [13, 35, 50]. Many methods utilize sophisticated hardware  or light patterns [18, 23, 42]. Reconstruction from sparse images has been demonstrated for planar objects [3, 32, 52], and known geometry . In contrast, we reconstruct the geometry and complex reflectance of arbitrary objects from a sparse set of six input images.
Photometric stereo methods have been proposed to reconstruct arbitrary shape and SVBRDFs [4, 17]; however, they focus on single-view reconstruction and require hundreds of images. Recent works [20, 35] utilize images captured by a collocated camera-light setup for shape and SVBRDF estimation. In particular, Nam et al.  capture more than sixty images and use multi-view reconstruction and physics-based optimization to recover geometry and reflectance. In contrast, by designing novel deep networks, we are able to reconstruct objects from only six images.
Learning-based methods have been applied for normal and SVBRDF acquisition. Deep photometric stereo methods reconstruct surface normals from tens to hundreds of images [7, 8] but they do not address reflectance or 3D geometry estimation. Most deep SVBRDF acquisition methods are designed for planar samples [2, 11, 12, 16, 31, 32]. Some recent multi-image SVBRDF estimation approaches pool latent features from multiple views  and use latent feature optimization  but they only handle planar objects. Li et al.  predict depth and SVBRDF from a single image; however, a single input does not provide enough information to accurately reconstruct geometry and reflectance. By capturing just six images, our approach generates significantly higher quality results.
Our goal is to accurately reconstruct the geometry and SVBRDF of an object with a simple acquisition setup. Recent work has utilized collocated point illumination for reflectance estimation from a sparse set of images [2, 3, 11, 32]; such lighting minimizes shadows and induces high-frequency effects like specularities, making reflectance estimation easier. Similarly, Xu et al.  demonstrate novel view synthesis from sparse multi-view images of a scene captured under a single point light.
Motivated by this, we utilize a similar capture system as Xu et al.—six cameras placed at one vertex of a regular icosahedron and the centers of the five faces adjoining that vertex. Unlike their use of a single point light for all images, we capture each image under a point light (nearly) collocated with the corresponding camera (see Fig. 2 left). The setup is calibrated giving us a set of input images, with the corresponding camera calibration. This wide baseline setup—with an angle of between the center and boundary views—makes it possible to image the entire object with a small set of cameras. In the following, we describe how we reconstruct an object from these sparse input images.
3.1 Multi-View Depth Prediction
Traditional MVS methods depend on hand-crafted features such as Harris descriptors to find correspondence between views. Such features are not robust to illumination changes or non-Lambertian surfaces, making them unusable for our purposes. In addition, due to the sparse inputs and large baselines, parts of the object may be visible in as few as two views. These factors cause traditional MVS methods to fail at finding accurate correspondences, and thus fail to reconstruct high-quality geometry.
Instead, we make use of a learning-based method to estimate the depth. Given the input images , we estimate the depth map for view . Similar to recent works on learning-based MVS [21, 51, 54], our network consists of two components: a feature extractor and a correspondence predictor . The feature extractor is a 2D U-Net  that extracts a -channel feature map for each image . To estimate the depth map at , we warp the feature maps of all views to view using a set of pre-defined depth levels, and build a 3D plane sweep volume 
by calculating the variance of feature maps over views. The 3D volume is further fed to the correspondence predictor
that is a 3D U-Net to predict the probability of each depth level. We calculate the depth as a probability-weighted sum of all depth levels. The training loss is defined as theloss between predicted depths and ground truth depths. By learning the feature representations and correspondence, the proposed framework is more robust to illumination changes and specularities, thus producing more accurate pixel-wise depth predictions than traditional methods.
While such networks are able to produce reasonable depth, the recovered depth has errors in textureless regions. To further improve the accuracy, we add a guided filter module  to the network, which includes a guided map extractor as well as a guided layer . Let the initial depth prediction at view be . The guided map extractor takes image as input and learns a guidance map . The final depth map is estimated as:
The training loss is defined as the distance between predicted depths and ground truth depths. All components are trained jointly in an end-to-end manner.
3.2 Multi-View Reflectance Prediction
Estimating surface reflectance from sparse images is a highly under-constrained problem. Previous methods either assume geometry is known [2, 3, 32, 11] or can be reconstructed with specific devices  or MVS . In our case, accurate geometry cannot be reconstructed from sparse inputs with traditional MVS methods. While our learning-based MVS method produces reasonable depth maps, they too have errors, making it challenging to use them to align the images and estimate per-vertex SVBRDF. Instead, for each input image , we first estimate its corresponding normals, , and SVBRDF, represented by diffuse albedo , specular roughness and specular albedo .
To estimate the SVBRDF at view , we warp all input images to this view using predicted depths
. One approach for multi-view SVBRDF estimation could be to feed this stack of warped images to a convolutional neural network like the commonly used U-Net[32, 39]. However, the inaccuracies in the depth maps lead to misalignments in the warped images, especially in occluded regions, and this architecture is not robust to these issues.
We propose a novel architecture that is robust to depth inaccuracies and occlusions. As shown in Fig. 3, our network comprises a Siamese encoder , , and a decoder, , with four branches for the four SVBRDF components. To estimate the SVBRDF at a reference view , the encoder processes pairs of inputs, each pair including image as well as the warped image , where we warp image at view to the reference view using the predicted depth . To handle potential occlusions, directly locating occluded regions in the warped images using predicted depths and masking them out is often not feasible due to inaccurate depths. Instead we keep the occluded regions in the warped images and include the depth information in the inputs, allowing the network to learn which parts are occluded.
To include the depth information, we draw inspiration from the commonly used shadow mapping technique . The depth input consists of two components: for each pixel in view , we calculate its depths in view ; we also sample its depth from the depth map by finding its projections on view . Intuitively if is larger than , then the pixel is occluded in view ; otherwise it is not occluded. In addition, for each pixel in the reference view , we also include the lighting directions of the light at view , as well as the lighting direction of the light at view , denoted as . We assume a point light model here. Since the light is collocated with the camera, by including the lighting direction we are also including the viewing direction of each pixel in the inputs. All directions are in the coordinate system of the reference view. Such cues are critical for networks to infer surface normals using photometric information. Therefore, the input for a pair of views and is:
The input contains channels in total, and there are a total of such inputs. We feed all the inputs to the encoder network and get the intermediate features . All these intermediate features are aggregated with a max-pooling layer yielding a common feature representation for view , :
is fed to the decoder to predict each SVBRDF component for view :
Compared to directly stacking all warped images together, our proposed network architecture works on pairs of input images and aggregates features across views using a max-pooling layer. The use of max-pooling makes the network more robust to occlusions and misalignments caused by depth inaccuracies and produces more accurate results (see Tab. 2). It also makes the network invariant to the number and order of the input views, a fact that could be utilized for unstructured capture setups. The training loss of the network is defined as:
where the first four terms are the losses for each SVBRDF component, and is the loss between input images and rendered images generated with our predictions.
3.3 Geometry Reconstruction
The previous multi-view depth and SVBRDF estimation networks give us per-view depth and normal maps at full-pixel resolution. We fuse these per-view estimates to reconstruct a single 3D geometry for the object. We first build a point cloud from the depth maps, by generating 3D points from each pixel in every per-view depth map. For each point, we also get its corresponding normal from the estimated normal maps. Given this set of 3D points with surface normals, we perform a Poisson reconstruction  to reconstruct the fused 3D geometry. The initial point clouds may contain outliers due to inaccuracies in the depth maps. To get rid of undesired structures in the output geometry, we generate a coarse initial geometry by setting the depth of the spatial octree in Poisson reconstruction to —corresponding to an effective voxel resolution of . We refine this initial geometry in the subsequent stage. Compared to learning-based 3D reconstruction methods that directly generate geometry (voxel grids [24, 38], implicit functions [36, 40] or triangle meshes ) from images, this approach generalizes to arbitrary shapes and produces more detailed reconstructions.
3.4 SVBRDF and Geometry Refinement
Given the initial coarse geometry as well as the per-view SVBRDF predictions, we aim to construct a detailed 3D mesh with per-vertex BRDFs. For each vertex, a trivial way to get its BRDF is to blend the predicted SVBRDFs across views using pre-defined weights such as the dot product of the viewing directions and surface normals. However, this leads to blurry results (Fig. 5), due to the inconsistencies in the estimated SVBRDFs and the geometry. Also note that our SVBRDF predictions are computed from a single feed-forward network pass, and are not guaranteed to reproduce the captured input images exactly because the network has been trained to minimize the reconstruction loss on the entire training set and not this specific input sample.
We address these two issues with a novel rendering-based optimization that estimates per-vertex BRDFs that minimize the error between rendering the predicted parameters and the captured images. Because of the sparse observations, independently optimizing per-vertex BRDFs leads to artifacts such as outliers and spatial discontinuities, as shown in Fig. 5. Classic inverse rendering methods address this using hand-crafted priors. Instead, we optimize the per-view feature maps that are initially predicted from our SVBRDF encoder ( Eqn. 4). These latent features, by virtue of the training process, capture the manifold of object reflectances, and generate spatially coherent per-view SVBRDFs when passed through the decoder, (Eqn. 5). Optimizing in this feature space allows us to adapt the reconstruction to the input images, while leveraging the priors learnt by our multi-view SVBRDF estimation network.
Per-vertex BRDF and color. For each vertex , we represent its BRDF as a weighted average of the BRDF predictions from multiple views:
where is the corresponding pixel position of at view , represents the SVBRDF prediction at from view by processing via the decoder network , and are the per-vertex view blending weights. The rendered color of at view is calculated as:
where is the lighting direction and also the viewing direction of vertex at view , and is the rendering equation. We assume a point light source collocated with the camera (which allows us to ignore shadows), and only consider direct illumination in the rendering equation.
Per-view warping. Vertex can be projected onto view using the camera calibration; we refer to this projection as . However, the pixel projections onto multiple views might be inconsistent due to inaccuracies in the reconstructed geometry. Inspired by Zhou et al. , we apply a non-rigid warping to each view to better align the projections. In particular, for each input view, we use a grid with control points ( in our experiments) to construct a smooth warping field over the image plane. Let
be the translation vectors of control points at view. The resulting pixel projection, , is given by:
where returns the bilinear weight for a control point at pixel location .
SVBRDF optimization. We optimize per-view latent features , per-vertex blending weights and per-view warping fields to reconstruct the final SVBRDFs. The photometric consistency loss between the rendered colors and ground truth colors for all vertices is given by:
We clamp the rendered colors to the range of before calculating the loss. To prevent the non-rigid warping from drifting, we also add an regularizer to penalize the norm of the translation vectors:
Therefore the final energy function for the optimization is:
We set to , and optimize the energy function with Adam optimizer  with a learning rate of .
Geometry optimization. We use the optimized per-vertex normal, , to update the geometry of the object by re-solving the Poisson equation (Sec. 3.3). Unlike the initial geometry reconstruction, we set the depth of the spatial octree to —corresponding to a voxel resolution of —to better capture fine-grained details of the object. We use this updated geometry in subsequent SVBRDF optimization iterations. We update the geometry once for every iterations of SVBRDF optimization, and we perform iterations for the SVBRDF optimization.
Per-vertex refinement. The bottleneck in our multi-view SVBRDF network—that we use as our reflectance representation—may cause a loss of high-frequency details in the predicted SVBRDFs. We retrieve these details back by directly optimizing the BRDF parameters of each vertex to minimizing the photometric loss in Eqn. (3.4). Note that after the previous optimization, the estimated BRDFs have already converged to good results and the rendered images are very close to the input images. Therefore, in this stage, we use a small learning rate (), and perform the optimization for a small number () of iterations.
4 Implementation and Results
Training data. We follow Xu et al.  and procedurally generate complex scenes by combining to primitive shapes such as cylinders and cubes displaced by random height maps. We generate training and testing scenes. We divide the high-quality materials from the Adobe Stock dataset111https://stock.adobe.com/search/3d-assets into a training and testing set, and use them to texture the generated scenes separately. For each scene, following the setup discussed in Sec. 1, we render the 6 input view images with a resolution of using a custom Optix-based global illumination renderer with 1000 samples per pixel. We also render the ground truth depth, normals, and SVBRDF components for each view.
Network architecture. For depth estimation, we use a 2D U-Net architecture  for the feature extractor, , and guidance map extractor, . Both networks have downsampling/upsampling blocks. The correspondence predictor is a 3D U-Net with downsampling/upsampling blocks. For multi-view SVBRDF estimation, both the encoder and decoder are 2D CNNs, with downsampling layers in and upsampling layers in . Note that we do not use skip connections in the SVBRDF network; this forces the latent feature to learn a meaningful reflectance space and allows us to optimize it in our refinement step. We use group normalization  in all networks. We use a differentiable rendering layer that computes local shading under point lighting without considering visibility or global illumination. This is a reasonable approximation in our collocated lighting setup. For more details, please refer to the supplementary document.
Training details. All the networks are trained with the Adam optimizer  for epochs with a learning rate of . The depth estimation networks are trained on cropped patches of with a batch size of , and the SVBRDF estimation networks are trained on cropped patches with a batch size of . Training took around four days on NVIDIA Titan 2080Ti GPUs.
Run-time. Our implementaion has not been optimized for the best timing efficiency. In practice, our method takes around minutes for full reconstruction from images with a resolution of , where most of the time is for geometry fusion and optimization.
4.1 Evaluation on Synthetic Data
We evaluate our max-pooling-based multi-view SVBRDF estimation network on our synthetic test set. In particular, we compare it with a baseline U-Net (with 5 downsampling/upsampling blocks) that takes a stack of all the coarsely aligned images ( in Eqn. 2) as input for its encoder, and skip connections from the encoder to the four SVBRDF decoders. This architecture has been widely used for SVBRDF estimation [11, 32, 33]. As can be seen in Tab. 2, while our diffuse albedo prediction is slightly () worse than the U-Net we significantly outperform it in specular albedo, roughness and normal predictions, with , and lower loss respectively. This is in spite of not using skip-connections in our network (to allow for optimization later in our pipeline). We also compare our results with the state-of-the-art single-image shape and SVBRDF estimation method of Li et al. . Unsurprisingly, we outperform them significantly, demonstrating the usefulness of aggregating multi-view information.
|Li et al. ||0.0227||0.1075||0.0661||—|
4.2 Evaluation on Real Captured Data
We evaluate our method on real data captured using a gantry with a FLIR camera and a nearly collocated light to mimic our capture setup. Please refer to the supplementary material for additional results.
Evaluation of geometry reconstruction. Our framework combines our predicted depths and normals to reconstruct the initial mesh. Figure 4 shows the comparison between our reconstructed mesh and the mesh from COLMAP, a state-of-the-art multi-view stereo framework . From such sparse inputs and low-texture surfaces, COLMAP is not able to find reliable correspondence across views, which results in a noisy, incomplete 3D mesh. In contrast, our initial mesh is already more complete and detailed, as a result of our more accurate depths and normals. Our joint optimization further refines the per-vertex normals and extracts fine-scale detail in the object geometry.
Evaluation of SVBRDF optimization. We compare our SVBRDF and geometry optimization scheme (Sec. 3.4) with averaging the per-view predictions using weights based on the angle between the viewpoint and surface normal, as well as this averaging followed by per-vertex optimization. From Fig. 5 we can see that the weighted averaging produces blurry results. Optimizing the per-vertex BRDFs brings back detail but also has spurious discontinuities in appearance because of the lack of any regularization. In contrast, our latent-space optimization method recovers detailed appearance without these artifacts.
Comparisons against Nam et al.  We also compare our work with the state-of-the-art geometry and reflectance reconstruction method of Nam et al. Their work captures 60+ images of an object with a handheld camera under collocated lighting; they first use COLMAP  to reconstruct the coarse shape and use it to bootstrap a physics-based optimization process to recover per-vertex normals and BRDFs. COLMAP cannot generate complete meshes from our sparse inputs (see Fig. 4). Therefore, we provided our input images, camera calibration, and initial geometry to the authors who processed this data. As can be seen in Fig. 6, our final reconstructed geometry has significantly more details than their final optimized result in spite of starting from the same initialization. Since they use a different BRDF representation than ours, making direct SVBRDF comparisons difficult, in Fig. 7 we compare renderings of the reconstructed object under novel lighting and viewpoint. These results show that they cannot handle our sparse input and produce noise, erroneous reflectance (Cat scene) or are unable to recover the specular highlights of highly specular objects (Cactus) scene. In comparison, our results have significantly higher visual fidelity. Please refer to the supplementary video for more renderings.
More results on real data. Figure 8 shows results from our method on additional real scenes. We can see here that our method can reconstruct detailed geometry and appearance for objects with a wide variety of complex shapes and reflectance. Comparing renderings of our estimates under novel camera and collocated lighting against ground truth captured photographs demonstrates the accuracy of our reconstructions. We can also photorealistically render these objects under novel environment illumnination. Please refer to the supplementary document and video for more results.
Limitations. Our method might fail to handle highly non-convex objects, where some parts are visible in as few as a single view and there are no correspondence cues to infer correct depth. In addition, we do not consider global illumination in SVBRDF optimization. While it is a reasonable approximation in most cases, it might fail in some particular scenes with strong inter-reflections. For future work, it would be interesting to combine our method with physics-based differentiable rendering [30, 55] to handle these complex light transport effects.
We have proposed a learning-based framework to reconstruct the geometry and appearance of an arbitrary object from a sparse set of just six images. We predict per-view depth using learning-based MVS, and design a novel multi-view reflectance estimation network that robustly aggregates information from our sparse views for accurate normal and SVBRDF estimation. We further propose a novel joint optimization in latent feature space to fuse and refine our multi-view predictions. Unlike previous methods that require densely sampled images, our method produces high-quality reconstructions from a sparse set of images, and presents a step towards practical appearance capture for 3D scanning and VR/AR applications.
Acknowledgements This work was supported in part by NSF grant 1617234, ONR grants N000141712687, N000141912293, Adobe and the UC San Diego Center for Visual Computing.
-  (2018) Learning representations and generative models for 3d point clouds. In ICML, pp. 40–49. Cited by: §2.
-  (2016-07) Reflectance modeling by neural texture synthesis. ACM Trans. Graph. 35 (4), pp. 65:1–65:13. External Links: Cited by: §2, §3.2, §3.
-  (2015-07) Two-shot svbrdf capture for stationary materials. ACM Transactions on Graphics 34 (4), pp. 110:1–110:13. External Links: Cited by: §2, §3.2, §3.
-  (2008) Photometric stereo with non-parametric and spatially-varying reflectance. In CVPR, pp. 1–8. Cited by: §2.
-  (2018) Simultaneous acquisition of polarimetric SVBRDF and normals.. ACM Transactions on Graphics 37 (6), pp. 268–1. Cited by: §1.
-  (2012) Physically-based shading at disney. In ACM SIGGRAPH 2012 Courses, SIGGRAPH ’12, pp. 10:1–10:7. Cited by: Appendix A.
-  (2018) Self-calibrating deep photometric stereo networks. In ECCV, Cited by: §2.
-  (2018) PS-fcn: a flexible learning framework for photometric stereo. In ECCV, Cited by: §2.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, pp. 539–546. Cited by: §3.2.
-  (1996) A space-sweep approach to true multi-image matching. In CVPR, pp. 358–363. Cited by: §3.1.
-  (2018) Single-image SVBRDF capture with a rendering-aware deep network. ACM Transactions on Graphics 37 (4), pp. 128. Cited by: §1, §2, §3.2, §3, §4.1.
-  (2019-07) Flexible svbrdf capture with a multi-image deep network. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38 (4). Cited by: §2.
-  (2014) Appearance-from-motion: recovering spatially varying surface reflectance under unknown lighting. ACM Transactions on Graphics 33 (6), pp. 193. Cited by: §2.
-  (2015) Multi-view stereo: a tutorial. Foundations and Trends® in Computer Graphics and Vision 9 (1-2), pp. 1–148. Cited by: §1.
-  (2009) Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence 32 (8), pp. 1362–1376. Cited by: §2.
-  (2019) Deep inverse rendering for high-resolution SVBRDF estimation from an arbitrary number of images. ACM Transactions on Graphics 38 (4), pp. 134. Cited by: §1, §2.
-  (2009) Shape and spatially-varying brdfs from photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1060–1071. Cited by: §2.
-  (2010) A coaxial optical scanner for synchronous acquisition of 3D geometry and surface reflectance. ACM Transactions on Graphics 29 (4), pp. 99. Cited by: §1, §2, §3.2.
-  (2018) DeepMVS: learning multi-view stereopsis. In CVPR, pp. 2821–2830. Cited by: §2.
-  (2017) Reflectance capture using univariate sampling of brdfs. In ICCV, pp. 5362–5370. Cited by: §2.
-  (2019) DPSNet: end-to-end deep plane sweep stereo. ICLR. Cited by: §3.1.
-  (2017) SurfaceNet: an end-to-end 3D neural network for multiview stereopsis. In ICCV, pp. 2307–2315. Cited by: §2.
Efficient reflectance capture using an autoencoder.. ACM Transactions on Graphics 37 (4), pp. 127–1. Cited by: §2.
-  (2017) Learning a multi-view stereo machine. In NIPS, pp. 365–376. Cited by: §3.3.
-  Real shading in unreal engine 4. Cited by: Appendix A, §1.
-  (2006) Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, Vol. 7. Cited by: §1.
-  (2013) Screened poisson surface reconstruction. ACM Transactions on Graphics 32 (3), pp. 29. Cited by: §3.3.
-  (2014) Adam: a method for stochastic optimization. ICLR. Cited by: §3.4, §4.
-  (2016) Shading-aware multi-view stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 222:1–222:11. Cited by: §4.2.
-  (2017-07) Modeling surface appearance from a single photograph using self-augmented convolutional neural networks. ACM Trans. Graph. 36 (4), pp. 45:1–45:11. Cited by: §2.
-  (2018) Materials for masses: SVBRDF acquisition with a single mobile phone image. In ECCV, pp. 72–87. Cited by: §1, §2, §2, §3.2, §3.2, §3, §4.1.
-  (2018) Learning to reconstruct shape and spatially-varying reflectance from a single image. In SIGGRAPH Asia 2018, pp. 269. Cited by: Appendix C, Figure 13, Figure 14, Figure 15, Figure 16, §1, §2, §4.1, Table 1.
-  (2003-07) A data-driven reflectance model. ACM Transactions on Graphics 22 (3), pp. 759–769. Cited by: §2.
-  (2018) Practical SVBRDF acquisition of 3D objects with unstructured flash photography. In SIGGRAPH Asia 2018, pp. 267. Cited by: Appendix D, Figure 10, §1, §1, §2, §2, §3.2, Figure 6, Figure 7, §4.2.
-  (2019) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, pp. 165–174. Cited by: §3.3.
-  (2018) Matryoshka networks: predicting 3d geometry via nested shape layers. In CVPR, pp. 1936–1944. Cited by: §2.
-  (2017) OctnetFusion: learning depth fusion from data. In 2017 International Conference on 3D Vision, pp. 57–66. Cited by: §3.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §3.1, §3.2, §4.
-  (2019-10) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, Cited by: §3.3.
-  (2016) Pixelwise view selection for unstructured multi-view stereo. In ECCV, Cited by: §1, §2, §4.2, §4.2.
-  (2013) Acquiring reflectance and shape from continuous spherical harmonic illumination. ACM Transactions on graphics 32 (4), pp. 109. Cited by: §1, §2.
-  (2018) MVPNet: multi-view point regression networks for 3d object reconstruction from a single image. arXiv preprint arXiv:1811.09410. Cited by: §2.
-  (2018) Pixel2Mesh: generating 3D mesh models from single rgb images. In ECCV, pp. 52–67. Cited by: §3.3.
-  (1978) Casting curved shadows on curved surfaces. In SIGGRAPH, Vol. 12, pp. 270–274. Cited by: §3.2.
-  (2015) Simultaneous localization and appearance estimation with a consumer rgb-d camera. IEEE Transactions on visualization and computer graphics 22 (8), pp. 2012–2023. Cited by: §1.
-  (2018) Fast end-to-end trainable guided filter. In CVPR, pp. 1838–1847. Cited by: Appendix B, §3.1.
-  (2017) Marrnet: 3d shape reconstruction via 2.5 d sketches. In NIPS, pp. 540–550. Cited by: §2.
-  (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: Appendix B, §4.
-  (2016) Recovering shape and spatially-varying surface reflectance under unknown illumination. ACM Transactions on Graphics 35 (6), pp. 187. Cited by: §1, §2.
-  (2019) Deep view synthesis from sparse photometric images. ACM Transactions on Graphics 38 (4), pp. 76. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.
-  (2016) Minimal brdf sampling for two-shot near-field reflectance acquisition. ACM Transactions on Graphics 35 (6), pp. 188. Cited by: §2.
-  (2018) Deep image-based relighting from optimal sparse samples. ACM Transactions on Graphics 37 (4), pp. 126. Cited by: §1.
-  (2018) MVSNet: depth inference for unstructured multi-view stereo. In ECCV, pp. 767–783. Cited by: §1, §2, §3.1.
-  (2019) A differential theory of radiative transfer. ACM Trans. Graph. 38 (6). Cited by: §4.2.
-  (2014) Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics 33 (4), pp. 155. Cited by: §3.4.
-  (2016) Sparse-as-possible SVBRDF acquisition. ACM Transactions on Graphics 35 (6), pp. 189. Cited by: §1, §1, §2.
Appendix A BRDF Model
We use a simplified version of the Disney BRDF model  proposed by Karis et al. . Let , , , be the diffuse albedo, normal, roughness and specular albedo respectively, and be the light and view direction, and be their half vector. Our BRDF model is defined as:
where , and are the normal distribution, fresnel and geometric terms respectively. These terms are defined as follows:
Appendix B Network Architecture
We have talked about the motivations, design and core components of our depth prediction network and SVBRDF prediction network in Sec. 3.1 and Sec. 3.2 in the paper. We now introduce the network architectures in detail as shown in Fig. 9.
Depth prediction network. As discussed in Sec. 3.1 in the paper, the depth prediction network consists of three parts: the feature extractor , the correspondence predictor and the guidance map extractor . The feature extractor and the correspondence predictor are used to predict the initial depth map ; the guidance map extractor is applied to refine using a guided filter  to obtain the final depth . Figure 9 shows the details of these sub-networks in the first row.
We use the feature extractor and the correspondence predictor to regress the initial depth, similar to . In particular, the feature extractor is a 2D U-Net that consists of multiple downsampling and upsampling convolutional layers with skip links, group normalization (GN) 
layers and ReLU activation layers; it extracts per-view image feature maps with 16 channels.
To predict the depth at reference view , we uniformly sample 128 frontal parallel depth planes at depth in front of that view within a pre-defined depth range that covers the target object we want to capture. We project the feature maps from all views onto every depth plane at view using homography-based warping to construct the plane sweep volume of view . We then build a cost volume by calculating the variance of the warped feature maps over views at each plane. The correspondence predictor is a 3D U-Net that processes this cost volume; it has multiple downsampling and upsampling 3D convolutional layers with skip links, GN layers and ReLU layers. The output of is a 1-channel volume, and we apply soft-max on this volume across the depth planes to obtain the per-plane depth probability maps of the depth planes; these maps indicate the probability of the depth of a pixel being the depth of each plane. A depth map is then regressed by linearly combining the per-plane depth values weighted by the per-plane depth probability maps:
We apply the guidance map extractor to refine the initial depth . is a 2D U-Net that outputs a 1-channel feature map. We use the output feature map as a guidance map to filter the initial depth and obtain the final depth .
SVBRDF prediction network. We have discussed the SVBRDF prediction network in Sec. 3.2, and shown the overall architecture, input and output in Fig. 2 and Fig. 3 of the paper. We now introduce the details of the encoder and the SVBRDF decoder in Fig. 9
(bottom row). Specifically, the encoder consists of a set of convolutional layers, followed by GN and ReLU layers; multiple convolutional layers with a stride ofare used to downsample the feature maps three times. The decoder upsamples the feature maps three times with nearest-neighbor upsampling, and applies convolutional layers, GN and ReLU layers to process the feature maps at each upsampling level. As discussed in Sec. 3.2 of the paper, we apply four decoders with the same architecture, which are connected with the same encoder, to regress three BRDF components and the normal map at each input view.
Appendix C Comparison on SVBRDF Prediction
In Sec. 4.1 and Tab. 1 of the paper, we have shown quantitative comparisons on synthetic data between our network, the naïve U-Net and a single-image SVBRDF prediction network proposed by Li et al. . We now demonstrate qualitative comparisons between these methods on both synthetic and real examples in Fig. 13, Fig. 14, Fig. 15 and Fig. 16. From these figures, we can see that the naïve U-Net produces noisy normals and the single-view method  produces normals with very few details, whereas our predicted normals are of much higher quality, especially in regions where there are serious occlusions (indicated by the red arrow). In contrast, as reflected by the comparison on synthetic data in Fig. 13 and Fig. 14, our predictions are more accurate and more consistent with the ground truth than the other methods. These results demonstrate that our novel network architecture (see Sec. 3.2 in the paper) allows for effective aggregation of multi-view information and leads to high-quality per-view SVBRDF estimation.
Appendix D Comparison on Geometry Reconstruction
In Fig. 6 of the paper, we compare our optimized geometry against the optimized result from Nam et al.  that uses the same initial geometry as ours. We show additional comparisons on real data in Fig. 10. Similar to the comparison in the paper, our optimized geometry is of much higher quality than Nam et al. with more fine-grained details and fewer artifacts.
Appendix E Additional Ablation Study
In this section, we demonstrate additional experiments to justify the design choices in our pipeline, including input variants of the SVBRDF estimation network, non-rigid warping and per-vertex refinement.
Network inputs. Our SVBRDF network considers the input image (), the warped images (), the light/viewing (which are collocated) direction maps ( and ), and the depth maps ( and ) as inputs (please refer to Sec. 3.2 in the paper for details of these input components). We verify the effectiveness of using these inputs by training and comparing multiple networks with different subsets of the inputs. In particular, we compare our full model against a network that uses only the warped image , a network that considers both and the reference image , a network that uses the reference image, warped image and the depth, and a network that uses the reference image, warped image, and the viewing directions. Table. 2 shows the quantitative comparisons between these networks on the synthetic testing set. The network using a pair of images (, ) improves the accuracy for most of the terms over the one that uses only the warped image (), which reflects the benefit of involving multi-view cues in the encoder network. On top of the image inputs, the two networks that involve additional depth information (, ) and the viewing directions (, ) both obtain better performance than the image-only versions, which leverage visibility cues and photometric cues from the inputs respectively. Our full model is able to leverage both cues from multi-view inputs and achieves the best performance.
Per-view warping. Due to potential inaccuracies in the geometry, the pixel colors of a vertex from different views may not be consistent. Directly minimizing the difference between the rendered color and the pixel color of each view will lead to ghosting artifacts, as shown in Fig. 11. To solve this problem, we propose to apply a non-rigid warping to each view. From Fig. 11 we can see that non-rigid warping can effectively tackle the misalignments and leads to sharper edges.
Per-vertex refinement. As shown in Fig. 12, the image rendered using estimated SVBRDF without per-vertex refinement loses high-frequency details such as the tiny spots on the pumpkin, due to the existence of the bottleneck in our SVBRDF network. In contrast, the proposed per-vertex refinement can successfully recover these details and reproduces more faithful appearance of the object.