1 Introduction
Reconstructing the 3D geometry and reflectance properties of an object from 2D images has been a longstanding problem in computer vision and graphics, with applications including 3D visualization, relighting, and augmented and virtual reality. Traditionally this has been accomplished using complex acquisition systems
[5, 18, 42, 46, 57] or multiview stereo (MVS) methods [14, 41] applied to dense image sets [35, 50]. The acquisition requirements for these methods significantly limits their practicality. Recently, deep neural networks have been proposed for material estimation from a single or a few images. However, many of these methods are restricted to estimating the spatiallyvarying BRDF (SVBRDF) of planar samples
[11, 16, 32]. Li et al. [33] demonstrate shape and reflectance reconstruction from a single image, but their reconstruction quality is limited by their single image input.Our goal is to enable practical and highquality shape and appearance acquisition. To this end, we propose using a simple capture setup: a sparse set of six cameras—placed at one vertex and the centers of the adjoining faces of a regular icosahedron, forming a cone—with collocated point lighting (Fig. 2 left). Capturing six images should allow for better reconstruction compared to single image methods. However, at such wide baselines, the captured images have few correspondences and severe occlusions, making it challenging to fuse information across viewpoints.
As illustrated in Fig. 2, we propose a twostage approach to address this problem. First, we design multiview geometry and reflectance estimation networks that regress the 2D depth, normals and reflectance for each input view by robustly aggregating information across all sparse viewpoints. We estimate the depth for each input view using a deep multiview stereo network [51, 54] (Sec. 3.1). Because of our sparse capture, these depth maps contain errors and cannot be used to accurately align the images to estimate pervertex BRDFs [35, 57]. Instead, we use these depth maps to warp the images to one viewpoint and use a novel deep multiview reflectance estimation network to estimate perpixel normals and reflectance (parameterized by diffuse albedo, specular albedo and roughness in a simplified Disney BRDF model [25]) for that viewpoint (Sec. 3.2
). This network extracts features from the warped images, aggregates them across viewpoints using maxpooling, and decodes the pooled features to estimate the normals and SVBRDF for that viewpoint. This approach to aggregate multiview information leads to more robust reconstruction than baseline approaches like a UNet architecture
[39], and we use it to recover normals and reflectance for each view.Second, we propose a novel method to fuse these perview estimates into a single mesh with pervertex BRDFs using optimization in a learnt reflectance space. First, we use Poisson reconstruction [26] to construct a mesh from the estimated perview depth and normal maps (Sec. 3.3). Each mesh vertex has multiple reflectance parameters corresponding to each perview reflectance map, and we fuse these estimates to reconstruct object geometry and reflectance that will accurately reproduce the input images
. Instead of optimizing the pervertex reflectance parameters, which leads to outliers and spatial discontinuities, we optimize the
the latent features of our multiview reflectance estimation network (Sec. 3.4). We pass these latent features to the reflectance decoder to construct perview SVBRDFs, fuse them using pervertex blending weights, and render them to compute the photometric error for all views. This entire pipeline is differentiable, allowing us to backpropagate this error and iteratively update the reflectance latent features and pervertex weights till convergence. This process refines the reconstruction to best match the specific captured images, while leveraging the priors learnt by our reflectance estimation network.
We train our networks with a largescale synthetic dataset comprised of procedurally generated shapes with complex SVBRDFs [51, 53] and rendered using a physicallybased renderer. While our method is trained with purely synthetic data, it generalizes well to real scenes. This is illustrated in Figs. 1 and 8, where we are able to reconstruct real objects with complex geometry and nonLambertian reflectance. Previous stateoftheart methods, when applied to sparse input images for such objects, produce incomplete, noisy geometry and erroneous reflectance estimates (Figs. 4 and 7). In contrast, our work is the first to reconstruct detailed geometry and highquality reflectance from sparse multiview inputs, allowing us to render photorealistic images under novel view and lighting.
2 Related Works
3D reconstruction. To reconstruct 3D geometry from image sets, traditional methods [15, 29, 41] find correspondences between two or more images utilizing specific image features. Such methods are sensitive to illumination changes, nonLambertian reflectance and textureless surfaces. The existence of multiple points with similar matching costs also require these methods to have a large number of images to get highquality reconstructions (we refer the interested readers to [15] for more details). In contrast, our method reconstructs highquality geometry for complex real scenes from an order of magnitude fewer images.
Recently, numerous learningbased methods have been proposed to reconstruct 3D shape using various geometric representations, including regular volumes [22, 37, 48], point clouds [1, 43] and depth maps [19, 54]. These methods cannot produce highresolution 3D meshes. We extend recent learningbased MVS frameworks [51, 54] to estimate depth from sparse multiview images of objects with complex reflectance. We combine this depth with estimated surface normals to reconstruct 3D meshes with fine details.
SVBRDF acquisition. SVBRDF acquisition is a challenging task that often requires a dense input image set [13, 35, 50]. Many methods utilize sophisticated hardware [34] or light patterns [18, 23, 42]. Reconstruction from sparse images has been demonstrated for planar objects [3, 32, 52], and known geometry [57]. In contrast, we reconstruct the geometry and complex reflectance of arbitrary objects from a sparse set of six input images.
Photometric stereo methods have been proposed to reconstruct arbitrary shape and SVBRDFs [4, 17]; however, they focus on singleview reconstruction and require hundreds of images. Recent works [20, 35] utilize images captured by a collocated cameralight setup for shape and SVBRDF estimation. In particular, Nam et al. [35] capture more than sixty images and use multiview reconstruction and physicsbased optimization to recover geometry and reflectance. In contrast, by designing novel deep networks, we are able to reconstruct objects from only six images.
Learningbased methods have been applied for normal and SVBRDF acquisition. Deep photometric stereo methods reconstruct surface normals from tens to hundreds of images [7, 8] but they do not address reflectance or 3D geometry estimation. Most deep SVBRDF acquisition methods are designed for planar samples [2, 11, 12, 16, 31, 32]. Some recent multiimage SVBRDF estimation approaches pool latent features from multiple views [12] and use latent feature optimization [16] but they only handle planar objects. Li et al. [33] predict depth and SVBRDF from a single image; however, a single input does not provide enough information to accurately reconstruct geometry and reflectance. By capturing just six images, our approach generates significantly higher quality results.
3 Algorithm
Our goal is to accurately reconstruct the geometry and SVBRDF of an object with a simple acquisition setup. Recent work has utilized collocated point illumination for reflectance estimation from a sparse set of images [2, 3, 11, 32]; such lighting minimizes shadows and induces highfrequency effects like specularities, making reflectance estimation easier. Similarly, Xu et al. [51] demonstrate novel view synthesis from sparse multiview images of a scene captured under a single point light.
Motivated by this, we utilize a similar capture system as Xu et al.—six cameras placed at one vertex of a regular icosahedron and the centers of the five faces adjoining that vertex. Unlike their use of a single point light for all images, we capture each image under a point light (nearly) collocated with the corresponding camera (see Fig. 2 left). The setup is calibrated giving us a set of input images, with the corresponding camera calibration. This wide baseline setup—with an angle of between the center and boundary views—makes it possible to image the entire object with a small set of cameras. In the following, we describe how we reconstruct an object from these sparse input images.
3.1 MultiView Depth Prediction
Traditional MVS methods depend on handcrafted features such as Harris descriptors to find correspondence between views. Such features are not robust to illumination changes or nonLambertian surfaces, making them unusable for our purposes. In addition, due to the sparse inputs and large baselines, parts of the object may be visible in as few as two views. These factors cause traditional MVS methods to fail at finding accurate correspondences, and thus fail to reconstruct highquality geometry.
Instead, we make use of a learningbased method to estimate the depth. Given the input images , we estimate the depth map for view . Similar to recent works on learningbased MVS [21, 51, 54], our network consists of two components: a feature extractor and a correspondence predictor . The feature extractor is a 2D UNet [39] that extracts a channel feature map for each image . To estimate the depth map at , we warp the feature maps of all views to view using a set of predefined depth levels, and build a 3D plane sweep volume [10]
by calculating the variance of feature maps over views. The 3D volume is further fed to the correspondence predictor
that is a 3D UNet to predict the probability of each depth level. We calculate the depth as a probabilityweighted sum of all depth levels. The training loss is defined as the
loss between predicted depths and ground truth depths. By learning the feature representations and correspondence, the proposed framework is more robust to illumination changes and specularities, thus producing more accurate pixelwise depth predictions than traditional methods.While such networks are able to produce reasonable depth, the recovered depth has errors in textureless regions. To further improve the accuracy, we add a guided filter module [47] to the network, which includes a guided map extractor as well as a guided layer . Let the initial depth prediction at view be . The guided map extractor takes image as input and learns a guidance map . The final depth map is estimated as:
(1) 
The training loss is defined as the distance between predicted depths and ground truth depths. All components are trained jointly in an endtoend manner.
3.2 MultiView Reflectance Prediction
Estimating surface reflectance from sparse images is a highly underconstrained problem. Previous methods either assume geometry is known [2, 3, 32, 11] or can be reconstructed with specific devices [18] or MVS [35]. In our case, accurate geometry cannot be reconstructed from sparse inputs with traditional MVS methods. While our learningbased MVS method produces reasonable depth maps, they too have errors, making it challenging to use them to align the images and estimate pervertex SVBRDF. Instead, for each input image , we first estimate its corresponding normals, , and SVBRDF, represented by diffuse albedo , specular roughness and specular albedo .
To estimate the SVBRDF at view , we warp all input images to this view using predicted depths
. One approach for multiview SVBRDF estimation could be to feed this stack of warped images to a convolutional neural network like the commonly used UNet
[32, 39]. However, the inaccuracies in the depth maps lead to misalignments in the warped images, especially in occluded regions, and this architecture is not robust to these issues.We propose a novel architecture that is robust to depth inaccuracies and occlusions. As shown in Fig. 3, our network comprises a Siamese encoder [9], , and a decoder, , with four branches for the four SVBRDF components. To estimate the SVBRDF at a reference view , the encoder processes pairs of inputs, each pair including image as well as the warped image , where we warp image at view to the reference view using the predicted depth . To handle potential occlusions, directly locating occluded regions in the warped images using predicted depths and masking them out is often not feasible due to inaccurate depths. Instead we keep the occluded regions in the warped images and include the depth information in the inputs, allowing the network to learn which parts are occluded.
To include the depth information, we draw inspiration from the commonly used shadow mapping technique [45]. The depth input consists of two components: for each pixel in view , we calculate its depths in view ; we also sample its depth from the depth map by finding its projections on view . Intuitively if is larger than , then the pixel is occluded in view ; otherwise it is not occluded. In addition, for each pixel in the reference view , we also include the lighting directions of the light at view , as well as the lighting direction of the light at view , denoted as . We assume a point light model here. Since the light is collocated with the camera, by including the lighting direction we are also including the viewing direction of each pixel in the inputs. All directions are in the coordinate system of the reference view. Such cues are critical for networks to infer surface normals using photometric information. Therefore, the input for a pair of views and is:
(2) 
The input contains channels in total, and there are a total of such inputs. We feed all the inputs to the encoder network and get the intermediate features . All these intermediate features are aggregated with a maxpooling layer yielding a common feature representation for view , :
(3)  
(4) 
is fed to the decoder to predict each SVBRDF component for view :
(5) 
Compared to directly stacking all warped images together, our proposed network architecture works on pairs of input images and aggregates features across views using a maxpooling layer. The use of maxpooling makes the network more robust to occlusions and misalignments caused by depth inaccuracies and produces more accurate results (see Tab. 2). It also makes the network invariant to the number and order of the input views, a fact that could be utilized for unstructured capture setups. The training loss of the network is defined as:
(6) 
where the first four terms are the losses for each SVBRDF component, and is the loss between input images and rendered images generated with our predictions.
3.3 Geometry Reconstruction
The previous multiview depth and SVBRDF estimation networks give us perview depth and normal maps at fullpixel resolution. We fuse these perview estimates to reconstruct a single 3D geometry for the object. We first build a point cloud from the depth maps, by generating 3D points from each pixel in every perview depth map. For each point, we also get its corresponding normal from the estimated normal maps. Given this set of 3D points with surface normals, we perform a Poisson reconstruction [27] to reconstruct the fused 3D geometry. The initial point clouds may contain outliers due to inaccuracies in the depth maps. To get rid of undesired structures in the output geometry, we generate a coarse initial geometry by setting the depth of the spatial octree in Poisson reconstruction to —corresponding to an effective voxel resolution of . We refine this initial geometry in the subsequent stage. Compared to learningbased 3D reconstruction methods that directly generate geometry (voxel grids [24, 38], implicit functions [36, 40] or triangle meshes [44]) from images, this approach generalizes to arbitrary shapes and produces more detailed reconstructions.
3.4 SVBRDF and Geometry Refinement
Given the initial coarse geometry as well as the perview SVBRDF predictions, we aim to construct a detailed 3D mesh with pervertex BRDFs. For each vertex, a trivial way to get its BRDF is to blend the predicted SVBRDFs across views using predefined weights such as the dot product of the viewing directions and surface normals. However, this leads to blurry results (Fig. 5), due to the inconsistencies in the estimated SVBRDFs and the geometry. Also note that our SVBRDF predictions are computed from a single feedforward network pass, and are not guaranteed to reproduce the captured input images exactly because the network has been trained to minimize the reconstruction loss on the entire training set and not this specific input sample.
We address these two issues with a novel renderingbased optimization that estimates pervertex BRDFs that minimize the error between rendering the predicted parameters and the captured images. Because of the sparse observations, independently optimizing pervertex BRDFs leads to artifacts such as outliers and spatial discontinuities, as shown in Fig. 5. Classic inverse rendering methods address this using handcrafted priors. Instead, we optimize the perview feature maps that are initially predicted from our SVBRDF encoder ( Eqn. 4). These latent features, by virtue of the training process, capture the manifold of object reflectances, and generate spatially coherent perview SVBRDFs when passed through the decoder, (Eqn. 5). Optimizing in this feature space allows us to adapt the reconstruction to the input images, while leveraging the priors learnt by our multiview SVBRDF estimation network.
Pervertex BRDF and color. For each vertex , we represent its BRDF as a weighted average of the BRDF predictions from multiple views:
(7) 
where is the corresponding pixel position of at view , represents the SVBRDF prediction at from view by processing via the decoder network , and are the pervertex view blending weights. The rendered color of at view is calculated as:
(8) 
where is the lighting direction and also the viewing direction of vertex at view , and is the rendering equation. We assume a point light source collocated with the camera (which allows us to ignore shadows), and only consider direct illumination in the rendering equation.
Perview warping. Vertex can be projected onto view using the camera calibration; we refer to this projection as . However, the pixel projections onto multiple views might be inconsistent due to inaccuracies in the reconstructed geometry. Inspired by Zhou et al. [56], we apply a nonrigid warping to each view to better align the projections. In particular, for each input view, we use a grid with control points ( in our experiments) to construct a smooth warping field over the image plane. Let
be the translation vectors of control points at view
. The resulting pixel projection, , is given by:(9) 
where returns the bilinear weight for a control point at pixel location .
SVBRDF optimization. We optimize perview latent features , pervertex blending weights and perview warping fields to reconstruct the final SVBRDFs. The photometric consistency loss between the rendered colors and ground truth colors for all vertices is given by:
We clamp the rendered colors to the range of before calculating the loss. To prevent the nonrigid warping from drifting, we also add an regularizer to penalize the norm of the translation vectors:
(10) 
Therefore the final energy function for the optimization is:
(11) 
We set to , and optimize the energy function with Adam optimizer [28] with a learning rate of .
Geometry optimization. We use the optimized pervertex normal, , to update the geometry of the object by resolving the Poisson equation (Sec. 3.3). Unlike the initial geometry reconstruction, we set the depth of the spatial octree to —corresponding to a voxel resolution of —to better capture finegrained details of the object. We use this updated geometry in subsequent SVBRDF optimization iterations. We update the geometry once for every iterations of SVBRDF optimization, and we perform iterations for the SVBRDF optimization.
Pervertex refinement. The bottleneck in our multiview SVBRDF network—that we use as our reflectance representation—may cause a loss of highfrequency details in the predicted SVBRDFs. We retrieve these details back by directly optimizing the BRDF parameters of each vertex to minimizing the photometric loss in Eqn. (3.4). Note that after the previous optimization, the estimated BRDFs have already converged to good results and the rendered images are very close to the input images. Therefore, in this stage, we use a small learning rate (), and perform the optimization for a small number () of iterations.
4 Implementation and Results
Training data. We follow Xu et al. [51] and procedurally generate complex scenes by combining to primitive shapes such as cylinders and cubes displaced by random height maps. We generate training and testing scenes. We divide the highquality materials from the Adobe Stock dataset^{1}^{1}1https://stock.adobe.com/search/3dassets into a training and testing set, and use them to texture the generated scenes separately. For each scene, following the setup discussed in Sec. 1, we render the 6 input view images with a resolution of using a custom Optixbased global illumination renderer with 1000 samples per pixel. We also render the ground truth depth, normals, and SVBRDF components for each view.
Network architecture. For depth estimation, we use a 2D UNet architecture [39] for the feature extractor, , and guidance map extractor, . Both networks have downsampling/upsampling blocks. The correspondence predictor is a 3D UNet with downsampling/upsampling blocks. For multiview SVBRDF estimation, both the encoder and decoder are 2D CNNs, with downsampling layers in and upsampling layers in . Note that we do not use skip connections in the SVBRDF network; this forces the latent feature to learn a meaningful reflectance space and allows us to optimize it in our refinement step. We use group normalization [49] in all networks. We use a differentiable rendering layer that computes local shading under point lighting without considering visibility or global illumination. This is a reasonable approximation in our collocated lighting setup. For more details, please refer to the supplementary document.
Training details. All the networks are trained with the Adam optimizer [28] for epochs with a learning rate of . The depth estimation networks are trained on cropped patches of with a batch size of , and the SVBRDF estimation networks are trained on cropped patches with a batch size of . Training took around four days on NVIDIA Titan 2080Ti GPUs.
Runtime. Our implementaion has not been optimized for the best timing efficiency. In practice, our method takes around minutes for full reconstruction from images with a resolution of , where most of the time is for geometry fusion and optimization.
4.1 Evaluation on Synthetic Data
We evaluate our maxpoolingbased multiview SVBRDF estimation network on our synthetic test set. In particular, we compare it with a baseline UNet (with 5 downsampling/upsampling blocks) that takes a stack of all the coarsely aligned images ( in Eqn. 2) as input for its encoder, and skip connections from the encoder to the four SVBRDF decoders. This architecture has been widely used for SVBRDF estimation [11, 32, 33]. As can be seen in Tab. 2, while our diffuse albedo prediction is slightly () worse than the UNet we significantly outperform it in specular albedo, roughness and normal predictions, with , and lower loss respectively. This is in spite of not using skipconnections in our network (to allow for optimization later in our pipeline). We also compare our results with the stateoftheart singleimage shape and SVBRDF estimation method of Li et al. [33]. Unsurprisingly, we outperform them significantly, demonstrating the usefulness of aggregating multiview information.
Diffuse  Normal  Roughness  Specular  
Naive UNet  0.0060  0.0336  0.0359  0.0125 
Ours  0.0061  0.0304  0.0275  0.0086 
Li et al. [33]  0.0227  0.1075  0.0661  — 
Ours ()  0.0047  0.0226  0.0257  0.0083 
4.2 Evaluation on Real Captured Data
We evaluate our method on real data captured using a gantry with a FLIR camera and a nearly collocated light to mimic our capture setup. Please refer to the supplementary material for additional results.
Evaluation of geometry reconstruction. Our framework combines our predicted depths and normals to reconstruct the initial mesh. Figure 4 shows the comparison between our reconstructed mesh and the mesh from COLMAP, a stateoftheart multiview stereo framework [41]. From such sparse inputs and lowtexture surfaces, COLMAP is not able to find reliable correspondence across views, which results in a noisy, incomplete 3D mesh. In contrast, our initial mesh is already more complete and detailed, as a result of our more accurate depths and normals. Our joint optimization further refines the pervertex normals and extracts finescale detail in the object geometry.
Evaluation of SVBRDF optimization. We compare our SVBRDF and geometry optimization scheme (Sec. 3.4) with averaging the perview predictions using weights based on the angle between the viewpoint and surface normal, as well as this averaging followed by pervertex optimization. From Fig. 5 we can see that the weighted averaging produces blurry results. Optimizing the pervertex BRDFs brings back detail but also has spurious discontinuities in appearance because of the lack of any regularization. In contrast, our latentspace optimization method recovers detailed appearance without these artifacts.
Comparisons against Nam et al. [35] We also compare our work with the stateoftheart geometry and reflectance reconstruction method of Nam et al. Their work captures 60+ images of an object with a handheld camera under collocated lighting; they first use COLMAP [41] to reconstruct the coarse shape and use it to bootstrap a physicsbased optimization process to recover pervertex normals and BRDFs. COLMAP cannot generate complete meshes from our sparse inputs (see Fig. 4). Therefore, we provided our input images, camera calibration, and initial geometry to the authors who processed this data. As can be seen in Fig. 6, our final reconstructed geometry has significantly more details than their final optimized result in spite of starting from the same initialization. Since they use a different BRDF representation than ours, making direct SVBRDF comparisons difficult, in Fig. 7 we compare renderings of the reconstructed object under novel lighting and viewpoint. These results show that they cannot handle our sparse input and produce noise, erroneous reflectance (Cat scene) or are unable to recover the specular highlights of highly specular objects (Cactus) scene. In comparison, our results have significantly higher visual fidelity. Please refer to the supplementary video for more renderings.
More results on real data. Figure 8 shows results from our method on additional real scenes. We can see here that our method can reconstruct detailed geometry and appearance for objects with a wide variety of complex shapes and reflectance. Comparing renderings of our estimates under novel camera and collocated lighting against ground truth captured photographs demonstrates the accuracy of our reconstructions. We can also photorealistically render these objects under novel environment illumnination. Please refer to the supplementary document and video for more results.
Limitations. Our method might fail to handle highly nonconvex objects, where some parts are visible in as few as a single view and there are no correspondence cues to infer correct depth. In addition, we do not consider global illumination in SVBRDF optimization. While it is a reasonable approximation in most cases, it might fail in some particular scenes with strong interreflections. For future work, it would be interesting to combine our method with physicsbased differentiable rendering [30, 55] to handle these complex light transport effects.
5 Conclusion
We have proposed a learningbased framework to reconstruct the geometry and appearance of an arbitrary object from a sparse set of just six images. We predict perview depth using learningbased MVS, and design a novel multiview reflectance estimation network that robustly aggregates information from our sparse views for accurate normal and SVBRDF estimation. We further propose a novel joint optimization in latent feature space to fuse and refine our multiview predictions. Unlike previous methods that require densely sampled images, our method produces highquality reconstructions from a sparse set of images, and presents a step towards practical appearance capture for 3D scanning and VR/AR applications.
Acknowledgements This work was supported in part by NSF grant 1617234, ONR grants N000141712687, N000141912293, Adobe and the UC San Diego Center for Visual Computing.
References
 [1] (2018) Learning representations and generative models for 3d point clouds. In ICML, pp. 40–49. Cited by: §2.
 [2] (201607) Reflectance modeling by neural texture synthesis. ACM Trans. Graph. 35 (4), pp. 65:1–65:13. External Links: ISSN 07300301, Link, Document Cited by: §2, §3.2, §3.
 [3] (201507) Twoshot svbrdf capture for stationary materials. ACM Transactions on Graphics 34 (4), pp. 110:1–110:13. External Links: ISSN 07300301, Link, Document Cited by: §2, §3.2, §3.
 [4] (2008) Photometric stereo with nonparametric and spatiallyvarying reflectance. In CVPR, pp. 1–8. Cited by: §2.
 [5] (2018) Simultaneous acquisition of polarimetric SVBRDF and normals.. ACM Transactions on Graphics 37 (6), pp. 268–1. Cited by: §1.
 [6] (2012) Physicallybased shading at disney. In ACM SIGGRAPH 2012 Courses, SIGGRAPH ’12, pp. 10:1–10:7. Cited by: Appendix A.
 [7] (2018) Selfcalibrating deep photometric stereo networks. In ECCV, Cited by: §2.
 [8] (2018) PSfcn: a flexible learning framework for photometric stereo. In ECCV, Cited by: §2.
 [9] (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, pp. 539–546. Cited by: §3.2.
 [10] (1996) A spacesweep approach to true multiimage matching. In CVPR, pp. 358–363. Cited by: §3.1.
 [11] (2018) Singleimage SVBRDF capture with a renderingaware deep network. ACM Transactions on Graphics 37 (4), pp. 128. Cited by: §1, §2, §3.2, §3, §4.1.
 [12] (201907) Flexible svbrdf capture with a multiimage deep network. Computer Graphics Forum (Proceedings of the Eurographics Symposium on Rendering) 38 (4). Cited by: §2.
 [13] (2014) Appearancefrommotion: recovering spatially varying surface reflectance under unknown lighting. ACM Transactions on Graphics 33 (6), pp. 193. Cited by: §2.
 [14] (2015) Multiview stereo: a tutorial. Foundations and Trends® in Computer Graphics and Vision 9 (12), pp. 1–148. Cited by: §1.
 [15] (2009) Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence 32 (8), pp. 1362–1376. Cited by: §2.
 [16] (2019) Deep inverse rendering for highresolution SVBRDF estimation from an arbitrary number of images. ACM Transactions on Graphics 38 (4), pp. 134. Cited by: §1, §2.
 [17] (2009) Shape and spatiallyvarying brdfs from photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1060–1071. Cited by: §2.
 [18] (2010) A coaxial optical scanner for synchronous acquisition of 3D geometry and surface reflectance. ACM Transactions on Graphics 29 (4), pp. 99. Cited by: §1, §2, §3.2.
 [19] (2018) DeepMVS: learning multiview stereopsis. In CVPR, pp. 2821–2830. Cited by: §2.
 [20] (2017) Reflectance capture using univariate sampling of brdfs. In ICCV, pp. 5362–5370. Cited by: §2.
 [21] (2019) DPSNet: endtoend deep plane sweep stereo. ICLR. Cited by: §3.1.
 [22] (2017) SurfaceNet: an endtoend 3D neural network for multiview stereopsis. In ICCV, pp. 2307–2315. Cited by: §2.

[23]
(2018)
Efficient reflectance capture using an autoencoder.
. ACM Transactions on Graphics 37 (4), pp. 127–1. Cited by: §2.  [24] (2017) Learning a multiview stereo machine. In NIPS, pp. 365–376. Cited by: §3.3.
 [25] Real shading in unreal engine 4. Cited by: Appendix A, §1.
 [26] (2006) Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, Vol. 7. Cited by: §1.
 [27] (2013) Screened poisson surface reconstruction. ACM Transactions on Graphics 32 (3), pp. 29. Cited by: §3.3.
 [28] (2014) Adam: a method for stochastic optimization. ICLR. Cited by: §3.4, §4.
 [29] (2016) Shadingaware multiview stereo. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
 [30] (2018) Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 222:1–222:11. Cited by: §4.2.
 [31] (201707) Modeling surface appearance from a single photograph using selfaugmented convolutional neural networks. ACM Trans. Graph. 36 (4), pp. 45:1–45:11. Cited by: §2.
 [32] (2018) Materials for masses: SVBRDF acquisition with a single mobile phone image. In ECCV, pp. 72–87. Cited by: §1, §2, §2, §3.2, §3.2, §3, §4.1.
 [33] (2018) Learning to reconstruct shape and spatiallyvarying reflectance from a single image. In SIGGRAPH Asia 2018, pp. 269. Cited by: Appendix C, Figure 13, Figure 14, Figure 15, Figure 16, §1, §2, §4.1, Table 1.
 [34] (200307) A datadriven reflectance model. ACM Transactions on Graphics 22 (3), pp. 759–769. Cited by: §2.
 [35] (2018) Practical SVBRDF acquisition of 3D objects with unstructured flash photography. In SIGGRAPH Asia 2018, pp. 267. Cited by: Appendix D, Figure 10, §1, §1, §2, §2, §3.2, Figure 6, Figure 7, §4.2.
 [36] (2019) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, pp. 165–174. Cited by: §3.3.
 [37] (2018) Matryoshka networks: predicting 3d geometry via nested shape layers. In CVPR, pp. 1936–1944. Cited by: §2.
 [38] (2017) OctnetFusion: learning depth fusion from data. In 2017 International Conference on 3D Vision, pp. 57–66. Cited by: §3.3.
 [39] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §1, §3.1, §3.2, §4.
 [40] (201910) PIFu: pixelaligned implicit function for highresolution clothed human digitization. In ICCV, Cited by: §3.3.
 [41] (2016) Pixelwise view selection for unstructured multiview stereo. In ECCV, Cited by: §1, §2, §4.2, §4.2.
 [42] (2013) Acquiring reflectance and shape from continuous spherical harmonic illumination. ACM Transactions on graphics 32 (4), pp. 109. Cited by: §1, §2.
 [43] (2018) MVPNet: multiview point regression networks for 3d object reconstruction from a single image. arXiv preprint arXiv:1811.09410. Cited by: §2.
 [44] (2018) Pixel2Mesh: generating 3D mesh models from single rgb images. In ECCV, pp. 52–67. Cited by: §3.3.
 [45] (1978) Casting curved shadows on curved surfaces. In SIGGRAPH, Vol. 12, pp. 270–274. Cited by: §3.2.
 [46] (2015) Simultaneous localization and appearance estimation with a consumer rgbd camera. IEEE Transactions on visualization and computer graphics 22 (8), pp. 2012–2023. Cited by: §1.
 [47] (2018) Fast endtoend trainable guided filter. In CVPR, pp. 1838–1847. Cited by: Appendix B, §3.1.
 [48] (2017) Marrnet: 3d shape reconstruction via 2.5 d sketches. In NIPS, pp. 540–550. Cited by: §2.
 [49] (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: Appendix B, §4.
 [50] (2016) Recovering shape and spatiallyvarying surface reflectance under unknown illumination. ACM Transactions on Graphics 35 (6), pp. 187. Cited by: §1, §2.
 [51] (2019) Deep view synthesis from sparse photometric images. ACM Transactions on Graphics 38 (4), pp. 76. Cited by: Appendix B, §1, §1, §2, §3.1, §3, §4.
 [52] (2016) Minimal brdf sampling for twoshot nearfield reflectance acquisition. ACM Transactions on Graphics 35 (6), pp. 188. Cited by: §2.
 [53] (2018) Deep imagebased relighting from optimal sparse samples. ACM Transactions on Graphics 37 (4), pp. 126. Cited by: §1.
 [54] (2018) MVSNet: depth inference for unstructured multiview stereo. In ECCV, pp. 767–783. Cited by: §1, §2, §3.1.
 [55] (2019) A differential theory of radiative transfer. ACM Trans. Graph. 38 (6). Cited by: §4.2.
 [56] (2014) Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics 33 (4), pp. 155. Cited by: §3.4.
 [57] (2016) Sparseaspossible SVBRDF acquisition. ACM Transactions on Graphics 35 (6), pp. 189. Cited by: §1, §1, §2.
Appendix A BRDF Model
We use a simplified version of the Disney BRDF model [6] proposed by Karis et al. [25]. Let , , , be the diffuse albedo, normal, roughness and specular albedo respectively, and be the light and view direction, and be their half vector. Our BRDF model is defined as:
(12) 
where , and are the normal distribution, fresnel and geometric terms respectively. These terms are defined as follows:
Appendix B Network Architecture
We have talked about the motivations, design and core components of our depth prediction network and SVBRDF prediction network in Sec. 3.1 and Sec. 3.2 in the paper. We now introduce the network architectures in detail as shown in Fig. 9.
Depth prediction network. As discussed in Sec. 3.1 in the paper, the depth prediction network consists of three parts: the feature extractor , the correspondence predictor and the guidance map extractor . The feature extractor and the correspondence predictor are used to predict the initial depth map ; the guidance map extractor is applied to refine using a guided filter [47] to obtain the final depth . Figure 9 shows the details of these subnetworks in the first row.
We use the feature extractor and the correspondence predictor to regress the initial depth, similar to [51]. In particular, the feature extractor is a 2D UNet that consists of multiple downsampling and upsampling convolutional layers with skip links, group normalization (GN) [49]
layers and ReLU activation layers; it extracts perview image feature maps with 16 channels.
To predict the depth at reference view , we uniformly sample 128 frontal parallel depth planes at depth in front of that view within a predefined depth range that covers the target object we want to capture. We project the feature maps from all views onto every depth plane at view using homographybased warping to construct the plane sweep volume of view . We then build a cost volume by calculating the variance of the warped feature maps over views at each plane. The correspondence predictor is a 3D UNet that processes this cost volume; it has multiple downsampling and upsampling 3D convolutional layers with skip links, GN layers and ReLU layers. The output of is a 1channel volume, and we apply softmax on this volume across the depth planes to obtain the perplane depth probability maps of the depth planes; these maps indicate the probability of the depth of a pixel being the depth of each plane. A depth map is then regressed by linearly combining the perplane depth values weighted by the perplane depth probability maps:
(13) 
We apply the guidance map extractor to refine the initial depth . is a 2D UNet that outputs a 1channel feature map. We use the output feature map as a guidance map to filter the initial depth and obtain the final depth .
SVBRDF prediction network. We have discussed the SVBRDF prediction network in Sec. 3.2, and shown the overall architecture, input and output in Fig. 2 and Fig. 3 of the paper. We now introduce the details of the encoder and the SVBRDF decoder in Fig. 9
(bottom row). Specifically, the encoder consists of a set of convolutional layers, followed by GN and ReLU layers; multiple convolutional layers with a stride of
are used to downsample the feature maps three times. The decoder upsamples the feature maps three times with nearestneighbor upsampling, and applies convolutional layers, GN and ReLU layers to process the feature maps at each upsampling level. As discussed in Sec. 3.2 of the paper, we apply four decoders with the same architecture, which are connected with the same encoder, to regress three BRDF components and the normal map at each input view.Appendix C Comparison on SVBRDF Prediction
In Sec. 4.1 and Tab. 1 of the paper, we have shown quantitative comparisons on synthetic data between our network, the naïve UNet and a singleimage SVBRDF prediction network proposed by Li et al. [33]. We now demonstrate qualitative comparisons between these methods on both synthetic and real examples in Fig. 13, Fig. 14, Fig. 15 and Fig. 16. From these figures, we can see that the naïve UNet produces noisy normals and the singleview method [33] produces normals with very few details, whereas our predicted normals are of much higher quality, especially in regions where there are serious occlusions (indicated by the red arrow). In contrast, as reflected by the comparison on synthetic data in Fig. 13 and Fig. 14, our predictions are more accurate and more consistent with the ground truth than the other methods. These results demonstrate that our novel network architecture (see Sec. 3.2 in the paper) allows for effective aggregation of multiview information and leads to highquality perview SVBRDF estimation.
Appendix D Comparison on Geometry Reconstruction
In Fig. 6 of the paper, we compare our optimized geometry against the optimized result from Nam et al. [35] that uses the same initial geometry as ours. We show additional comparisons on real data in Fig. 10. Similar to the comparison in the paper, our optimized geometry is of much higher quality than Nam et al. with more finegrained details and fewer artifacts.
Appendix E Additional Ablation Study
In this section, we demonstrate additional experiments to justify the design choices in our pipeline, including input variants of the SVBRDF estimation network, nonrigid warping and pervertex refinement.
Network input  Diffuse  Normal  Roughness  Specular 
0.0081  0.0456  0.0379  0.0098  
0.0071  0.0363  0.0304  0.0109  
0.0063  0.0321  0.0306  0.0098  
0.0061  0.0304  0.0299  0.0093  
Ours full  0.0061  0.0304  0.0275  0.0086 
Network inputs. Our SVBRDF network considers the input image (), the warped images (), the light/viewing (which are collocated) direction maps ( and ), and the depth maps ( and ) as inputs (please refer to Sec. 3.2 in the paper for details of these input components). We verify the effectiveness of using these inputs by training and comparing multiple networks with different subsets of the inputs. In particular, we compare our full model against a network that uses only the warped image , a network that considers both and the reference image , a network that uses the reference image, warped image and the depth, and a network that uses the reference image, warped image, and the viewing directions. Table. 2 shows the quantitative comparisons between these networks on the synthetic testing set. The network using a pair of images (, ) improves the accuracy for most of the terms over the one that uses only the warped image (), which reflects the benefit of involving multiview cues in the encoder network. On top of the image inputs, the two networks that involve additional depth information (, ) and the viewing directions (, ) both obtain better performance than the imageonly versions, which leverage visibility cues and photometric cues from the inputs respectively. Our full model is able to leverage both cues from multiview inputs and achieves the best performance.
Perview warping. Due to potential inaccuracies in the geometry, the pixel colors of a vertex from different views may not be consistent. Directly minimizing the difference between the rendered color and the pixel color of each view will lead to ghosting artifacts, as shown in Fig. 11. To solve this problem, we propose to apply a nonrigid warping to each view. From Fig. 11 we can see that nonrigid warping can effectively tackle the misalignments and leads to sharper edges.
Pervertex refinement. As shown in Fig. 12, the image rendered using estimated SVBRDF without pervertex refinement loses highfrequency details such as the tiny spots on the pumpkin, due to the existence of the bottleneck in our SVBRDF network. In contrast, the proposed pervertex refinement can successfully recover these details and reproduces more faithful appearance of the object.
Comments
There are no comments yet.