Deep Reflectance Volumes: Relightable Reconstructions from Multi-View Photometric Images

07/20/2020 ∙ by Sai Bi, et al. ∙ 7

We present a deep learning approach to reconstruct scene appearance from unstructured images captured under collocated point lighting. At the heart of Deep Reflectance Volumes is a novel volumetric scene representation consisting of opacity, surface normal and reflectance voxel grids. We present a novel physically-based differentiable volume ray marching framework to render these scene volumes under arbitrary viewpoint and lighting. This allows us to optimize the scene volumes to minimize the error between their rendered images and the captured images. Our method is able to reconstruct real scenes with challenging non-Lambertian reflectance and complex geometry with occlusions and shadowing. Moreover, it accurately generalizes to novel viewpoints and lighting, including non-collocated lighting, rendering photorealistic images that are significantly better than state-of-the-art mesh-based methods. We also show that our learned reflectance volumes are editable, allowing for modifying the materials of the captured scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

page 12

page 14

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Capturing a real scene and re-rendering it under novel lighting conditions and viewpoints is one of the core challenges in computer vision and graphics. This is classically done by reconstructing the 3D scene geometry, typically in the form of a mesh, and computing per-vertex colors or reflectance parameters, to support arbitrary re-rendering. However, 3D reconstruction methods like multi-view stereo are prone to errors in textureless and non-Lambertian regions

[37, 47], and accurate reflectance acquisition usually requires dense, calibrated capture using sophisticated devices [5, 56].

Recent works have proposed learning-based approaches to capture scene appearance. One class of methods use surface-based representations [15, 20] but are restricted to specific scene categories and cannot synthesize photo-realistic images. Other methods bypass explicit reconstruction, instead focusing on relighting [59] or view synthesis sub-problems [31, 57].

Figure 1:

Given a set of images taken using a mobile phone with flashlight (sampled images are shown in (a)), our method learns a volume representation of the captured object by estimating the opacity volume, normal volume (b) and reflectance volumes such as albedo (c) and roughness (d). Our volume representation enables free navigation of the object under arbitrary viewpoints and novel lighting conditions (e).

Our goal is to make high-quality scene acquisition and rendering practical with off-the-shelf devices under mildly controlled conditions. We use a set of unstructured images captured around a scene by a single mobile phone camera with flash illumination in a dark room. This practical setup acquires multi-view images under collocated viewing and lighting directions—referred to as photometric images [57]. While the high-frequency appearance variation in these images (due to sharp specular highlights and shadows) can result in low-quality mesh reconstruction from state-of-the-art methods (see Fig. 3), we show that our method can accurately model the scene and realistically reproduce complex appearance information like specularities and occlusions.

At the heart of our method is a novel, physically-based neural volume rendering framework. We train a deep neural network that simultaneously learns the geometry and

reflectance

of a scene as volumes. We leverage a decoder-like network architecture, where an encoding vector together with the corresponding network parameters are learned during a per-scene optimization (training) process. Our network decodes a volumetric scene representation consisting of opacity, normal, diffuse color and roughness volumes, which model the global geometry, local surface orientations and spatially-varying reflectance parameters of the scene, respectively. These volumes are supplied to a differentiable rendering module to render images with collocated light-view settings at training time, and arbitrary light-view settings at inference time (see Fig. 

2).

We base our differentiable rendering module on classical volume ray marching approaches with opacity (alpha) accumulation and compositing [24, 53]. In particular, we compute point-wise shading using local normal and reflectance properties, and accumulate the shaded colors with opacities along each marching ray of sight. Unlike the opacity used in previous view synthesis work [31, 63] that is only accumulated along view directions, we propose to learn global scene opacity that can be accumulated from both view and light directions. As shown in Fig. 1, we demonstrate that our scene opacity can be effectively learned and used to compute accurate hard shadows under novel lighting, despite the fact that the training process never observed images with shadows that are taken under non-collocated view-light setups. Moreover, different from previous volume-based works [31, 63] that learn a single color at each voxel, we reconstruct per-voxel reflectance and handle complex materials with high glossiness. Our neural rendering framework thus enables rendering with complex view-dependent and light-dependent shading effects including specularities, occlusions and shadows. We compare against a state-of-the-art mesh-based method [37], and demonstrate that our method is able to achieve more accurate reconstructions and renderings (see Fig. 3). We also show that our approach supports scene material editing by modifying the reconstructed reflectance volumes (see Fig. 8). To summarize, our contributions are:

  1. A practical neural rendering framework that reproduces high-quality geometry and appearance from unstructured mobile phone flash images and enables view synthesis, relighting, and scene editing.

  2. A novel scene appearance representation using opacity, normal and reflectance volumes.

  3. A physically-based differentiable volume rendering approach based on deep priors that can effectively reconstruct the volumes from input flash images.

2 Related Works

Geometry reconstruction. There is a long history in reconstructing 3D geometry from images using traditional structure from motion and multi-view stereo (MVS) pipelines [13, 25, 47]. Recently deep learning techniques have also been applied to 3D reconstruction with various representations, including volumes [18, 45], point clouds [1, 42, 52], depth maps [16, 60] and implicit functions [10, 35, 40]. We aim to model scene geometry for realistic image synthesis, for which mesh-based reconstruction [23, 32, 38] is the most common way in many applications [6, 37, 44, 62]. However, it remains challenging to reconstruct accurate meshes for challenging scenes where there are textureless regions and thin structures, and it is hard to incorporate a mesh into a deep learning framework [26, 30]; the few mesh-based deep learning works [15, 20] are limited to category-specific reconstruction and cannot produce photo-realistic results. Instead, we leverage a physically-based opacity volume representation that can be easily embedded in a deep learning system to express scene geometry of arbitrary shapes.

Reflectance acquisition. Reflectance of real materials is classically measured using sophisticated devices to densely acquire light-view samples [12, 33], which is impractical for common users. Recent works have improved the practicality with fewer samples [39, 58] and more practical devices (mobile phones) [2, 3, 17, 28]; however, most of them focus on flat planar objects. A few single-view techniques based on photometric stereo [4, 14] or deep learning [29] are able to handle arbitrary shape, but they merely recover limited single-view scene content. To recover complete shape with spatially varying BRDF from multi-view inputs, previous works usually rely on a pre-reconstructed initial mesh and images captured under complex controlled setups to reconstruct per-vertex BRDFs [7, 21, 54, 56, 64]. While a recent work [37] uses a mobile phone for practical acquisition like ours, it still requires MVS-based mesh reconstruction, which is ineffective for challenging scenes with textureless, specular and thin-structure regions. In contrast, we reconstruct spatially varying volumetric reflectance via deep network based optimization; we avoid using any initial geometry and propose to jointly reconstruct geometry and reflectance in a holistic framework.

Relighting and view synthesis. Image-based techniques have been extensively explored in graphics and vision to synthesize images under novel lighting and viewpoint without explicit complete reconstruction [8, 11, 27, 43]. Recently, deep learning has been applied to view synthesis and most methods leverage either view-dependent volumes [49, 57, 63] or canonical world-space volumes [31, 48] for geometric-aware appearance inference. We extend them to a more general physically-based volumetric representation which explicitly expresses both geometry and reflectance, and enables relighting with view synthesis. On the other hand, learning-based relighting techniques have also been developed. Purely image-based methods are able to relight scenes with realistic specularities and soft shadows from sparse inputs, but unable to reproduce accurate hard shadows [19, 50, 59, 61]; some other methods [9, 44] propose geometry-aware networks and make use of pre-acquired meshes for relighting and view synthesis, and their performance is limited by the mesh reconstruction quality. A work [36] concurrent to ours models scene geometry and appearance by reconstructing a continuous radiance field for pure view synthesis. In contrast, Deep Reflectance Volumes explicitly express scene geometry and reflectance, and reproduce accurate high-frequency specularities and hard shadows. Ours is the first comprehensive neural rendering framework that enables both relighting and view synthesis with complex shading effects.

3 Rendering with Deep Reflectance Volumes

Unlike a mesh that is comprised of points with complex connectivity, a volume is a regular 3D grid, suitable for convolutional operations. Volumes have been widely used in deep learning frameworks for 3D applications [55, 60]. However, previous neural volumetric representations have only represented pixel colors; this can be used for view synthesis [31, 63], but does not support relighting or scene editing. Instead, we propose to jointly learn geometry and reflectance (i.e. material parameters) volumes to enable broader rendering applications including view synthesis, relighting and material editing in a comprehensive framework. Deep Reflectance Volumes are learned from a deep network and used to render images in a fully differentiable end-to-end process as shown in Fig. 2. This is made possible by a new differentiable volume ray marching module, which is motivated by physically-based volume rendering. In this section, we introduce our volume rendering method and volumetric scene representation. We discuss how we learn these volumes from unstructured images in Sec. 4.

3.1 Volume rendering overview

In general, volume rendering is governed by the physically-based volume rendering equation (radiative transfer equation) that describes the radiance that arrives at a camera [34, 41]:

(1)

This equation integrates emitted, , and in-scattered, , light contributions along the ray starting at camera position in the direction . Here, represents distance along the ray, and is the corresponding 3D point. is the transmittance factor that governs the loss of light along the line segment between and :

(2)

where is the extinction coefficient at location on the segment. The in-scattered contribution is defined as:

(3)

in which is a unit sphere, is a generalized (unnormalized) phase function that expresses how light scatters at a point in the volume, and is the incoming radiance that arrives at from direction .

In theory, fully computing requires multiple-scattering computation using Monte Carlo methods [41], which is computationally expensive and unsuitable for deep learning techniques. We consider a simplified case with a single point light, single scattering and no volumetric emission. The transmittance between the scattering location and the point light is handled the same way as between the scattering location and camera. The generalized phase function becomes a reflectance function which computes reflected radiance at using its local surface normal and the reflectance parameters of a given surface reflectance model. Therefore, Eqn. 1 and Eqn. 3 can be simplified and written concisely as [24, 34]:

(4)

where is the light position, corresponds to the direction from to , still represents the transmittance from the scattering point to the camera , the term (that was implicitly involved in Eqn. 3) is the transmittance from the light to and expresses light extinction before scattering, and represents the light intensity arriving at without considering light extinction.

Figure 2: We propose Deep Reflectance Volume representation to capture scene geometry and appearance, where each voxel consists of opacity , normal and reflectance (material coefficients) . During rendering, we perform ray marching through each pixel and accumulate contributions from each point along the ray. Each contribution is calculated using the local normal, reflectance and lighting information. We accumulate opacity from both the camera and the light

to model the light transport loss in both occlusions and shadows. To predict such a volume, we start from an encoding vector, and decode it into a volume using a 3D convolutional neural network; thus the combination of the encoding vector and network weights is the unknown variable being optimized (trained). We train on images captured with collocated camera and light by enforcing a loss function between rendered images and training images.

3.2 A discretized, differentiable volume rendering module

To make volume rendering practical in a learning framework, we further approximate Eqn. 4 by turning it into a discretized version, which can be evaluated by ray marching [24, 34, 53]. This is classically expressed using opacity compositing, where opacity is used to represent the transmittance with fixed ray marching step size . Points are sequentially sampled along a given ray, from the camera position, as:

(5)

The radiance and opacity along this path, , are recursively accumulated until exits the volume as:

(6)
(7)
(8)

Here, computes the reflected radiance from the reflectance function and the incoming light, represents the accumulated opacity from the camera to point , and corresponds to in Eqn 4. represents the accumulated opacity from the light —i.e., in Eqn. 4—and requires a separate accumulation process over samples along the ray, similar to Eqn. 7:

(9)
(10)

In this rendering process (Eqn. 5-10), a scene is represented by an opacity volume , a normal volume and a BRDF volume ; together, these express the geometry and reflectance of the scene, and we refer to them as Deep Reflectance Volumes. The simplified opacity volume is essentially one minus the transmission (depending on the physical extinction coefficient ) over a ray segment of a fixed step size ; this means that is dependent on .

Our physically-based ray marching is fully differentiable, so it can be easily incorporated in a deep learning framework and backpropagated through. With this rendering module, we present a neural rendering framework that simultaneously learns scene geometry and reflectance from captured images.

We support any differentiable reflectance model and, in practice, use the simplified Disney BRDF model [22] that is parameterized by diffuse albedo and specular roughness (please refer to the supplementary materials for more details). Our opacity volume is a general geometry representation, accounting for both occlusions (view opacity accumulation in Eqn. 7) and shadows (light opacity accumulation in Eqn. 10). We illustrate our neural rendering with ray marching in Fig. 2. Note that, because our acquisition setup has collocated camera and lighting, becomes equivalent to during training, thus requiring only one-pass opacity accumulation from the camera. However, the learned opacity can still be used for re-rendering under any non-collocated lighting with two-pass opacity accumulation.

Note that while alpha compositing-based rendering functions have been used in previous work on view synthesis, their formulations are not physically-based [31] and are simplified versions that don’t model lighting [49, 63]. In contrast, our framework is physically-based and models single-bounce light transport with complex reflectance, occlusions and shadows.

4 Learning Deep Reflectance Volumes

4.1 Overview

Given a set of images of a real scene captured under multiple known viewpoints with collocated lighting, we propose to use a neural network to reconstruct a Deep Reflectance Volume representation of a real scene. Similar to Lombardi et al. [31], our network starts from a 512-channel deep encoding vector that encodes scene appearance; in contrast to their work, where this volume only represents RGB colors, we decode a vector to an opacity volume , normal volume and reflectance volume for rendering. Moreover, our scene encoding vector is not predicted by any network encoder; instead, we jointly optimize for a scene encoding vector and scene-dependent decoder network.

Our network infers the geometry and reflectance volumes in a transformed 3D space with a learned warping function . During training, our network learns the warping function , and the geometry and reflectance volumes , , , where the subscript refers to a volume in the warped space. The corresponding world-space scene representation is expressed by , where is , or

. In particular, we use bilinear interpolation to fetch a corresponding value at an arbitrary position

in the space from the discrete voxel values. We propose a decoder-like network, which learns to decode the warping function and the volumes from the deep scene encoding vector. We use a rendering loss between rendered and captured images as well as two regularizing terms.

4.2 Network architecture

Geometry and reflectance. To decode the geometry and reflectance volumes (, ,

), we use upsampling 3D convolutional operations to 3D-upsample the deep scene encoding vector to a multi-channel volume that contains the opacity, normal and reflectance. In particular, we use multiple transposed convolutional layers with stride 2 to upsample the volume, each of which is followed by a LeakyRelu activation layer. The network regresses an 8-channel

volume that includes , and —one channel for opacity , three channels for normal , and four channels for reflectance (three for albedo and one for roughness). These volumes express the scene geometry and reflectance in a transformed space, which can be warped to the world space for ray marching.

Warping function. To increase the effective resolution of the volume, we learn an affine-based warping function similar to [31]. The warping comprises a global warping and a spatially-varying warping. The global warping is represented by an affine transformation matrix . The spatially varying warping is modeled in the inverse transformation space, which is represented by six basis affine matrices and a 16-channel volume that contains spatially-varying linear weights of the 16 basis matrices. Specifically, given a world-space position , the complete warping function maps it into a transformed space by:

(11)

where represents the normalized weight of the th warping basis at . Here, each global or local basis affine transformation matrix is composed of rotation, translation and scale parameters, which are optimized during the training process. Our network decodes the weight volume

from the deep encoding vector using a multi-layer perceptron network with fully connected layers.

4.3 Loss function and training details

Loss function. Our network learns the scene volumes using a rendering loss computed using the differentiable ray marching process discussed in Sec. 3. During training, we randomly sample pixels from the captured images and do the ray marching (using known camera calibration) to get the rendered pixel colors of pixel ; we supervise them with the ground truth colors in the captured images using a loss. In addition, we also apply regularization terms from additional priors similar to [31]. We only consider opaque objects in this work and enforce the accumulated opacity along any camera ray (see Eqn. 7, here denotes a pixel and reflects the final step that exits the volume) to be either 0 or 1, corresponding to a background or foreground pixel, respectively. We also regularize the per-voxel opacity to be sparse over the space by minimizing the spatial gradients of the logarithmic opacity. Our total loss function is given by:

(12)

Here, the first part reflects the data term, the second regularizes the accumulated and the third regularizes the spatial sparsity.

Training details. We build our volume as a cube located at . During training, we randomly sample pixels from captured images for each training batch, and perform ray marching through the volume using a step size of . Initially, we set ; we increase these weights to , after iterations, which helps remove the artifacts in the background and recover sharp boundaries.

5 Results

In this section we show our results on real captured scenes. We first introduce our acquisition setup and data pre-processing. Then we compare against the state-of-the-art mesh-based appearance acquisition method, followed by a detailed analysis of the experiments. We also demonstrate material editing results with our approach. Please refer to the supplementary materials for video results.

Data acquisition. Our approach learns the volume representation in a scene dependent way from images with collocated view and light; this requires adequately dense input images well distributed around a target scene to learn complete appearance. Such data can be practically acquired by shooting a video using a handheld cellphone; we show one result using this practical handheld setup in Fig. 4

. For other results, we use a robotic arm to automatically capture more uniformly distributed images around scenes for convenience and thorough evaluations; this allows us to evaluate the performance of our method with different numbers of input images that are roughly uniformly distributed as shown in Tab. 

6. In the robotic arm setups, we mount a Samsung Galaxy Note 8 cellphone to the robotic arm and capture about 480 images using its camera and the built-in flashlight in a dark room; we leave out a subset of 100 images for validation purposes and use the others for training. We use the same phone to capture a -minute video of the object in Captain and select one image for training for every frames, which effectively gives us training images.

Data pre-processing. Our captured objects are roughly located around the center of the images. We select one fixed rectangular region around the center that covers the object across all frames and use it to crop the images as input for training. The resolution of the cropped training images fed to our network ranges from to . Note that we do not use a foreground mask for the object. Our method leverages the regularization terms in training (see Sec. 4.3), which automatically recovers a clean background. We calibrate the captured images using structure from motion (SfM) in COLMAP [46] to get the camera intrinsic and extrinsic parameters. Since SfM may fail to register certain views, the actual number of training images varies from 300 to 385 in different scenes. We estimate the center and bounding box of the captured object with the sparse reconstructions from SfM. We translate the center of the object to the origin and scale it to fit into the cube.

Implementation and timing.

We implement our system (both neural network and differentiable volume rendering components) using PyTorch. We train our network using four NVIDIA 2080Ti RTX GPUs for about two days (about 450000 iterations; though 200000 iterations for 1 day typically already converges to good results, see Fig. 

8). At inference time, we directly render the scene from the reconstructed volumes without the network. It takes about 0.8s to render a image under collocated view and light. For non-collocated view and light, the rendering requires connecting each shading point to the light source with additional light-dependent opacity accumulation, which is very expensive if done naively. To facilitate this process, we perform ray marching from the light’s point of view and precompute the accumulated opacity at each spatial position of the volume. During rendering, the accumulated opacity for the light ray can be directly sampled from the precomputed volume. By doing so, our final rendering under arbitrary light and view takes about s.

Figure 3: Comparisons with mesh-based reconstruction. We show renderings of the captured object under both collocated (column 2, 3) and non-collocated (column 4, 5) camera and light. We compare our volume-based neural reconstruction against a state-of-the-art method [37] that reconstructs mesh and per-vertex BRDFs. Nam et al. [37] fails to handle such challenging cases and recovers inaccurate geometry and appearance. In contrast our method produces photo-realistic results.

Comparisons with mesh-based reconstruction. We use a practical acquisition setup where we capture unstructured images using a mobile phone with its built-in flashlight on in a dark room. Such a mildly controlled acquisition setup is rarely supported by previous works [7, 21, 56, 57, 59, 64]. Therefore, we compare with the state-of-the-art method proposed by Nam et al.  [37] for mesh-based geometry and reflectance reconstruction, that uses the same cellphone setup as ours to reconstruct a mesh with per-vertex BRDFs, and supports both relighting and view synthesis. Figure 3 shows comparisons on renderings under both collocated and non-collocated view-light conditions. The comparison results are generated from the same set of input images, and we requested the authors of [37] run their code on our data and compared on the rendered images provided by the authors. Please refer to the supplementary materials for video comparisons.

As shown in Fig. 3, our results are significantly better than the mesh-based method in terms of both geometry and reflectance. Note that, Nam et al. [37] leverage a state-of-the-art MVS method [47] to reconstruct the initial mesh from captured images and performs an optimization to further refine the geometry; this however still fails to recover the accurate geometry in texture-less, specular and thin-structured regions in those challenging scenes, which leads to seriously distorted shapes in Pony, over-smoothness and undesired structures in House, and degraded geometry in Girl. Our learning-based volumetric representation avoids these mesh-based issues and models the scene geometry accurately with many details. Moreover, it is also very difficult for the classical per-vertex BRDF optimization in [37] to recover high-frequency specularities, which leads to over-diffuse appearance in most of the scenes; this is caused by the lack of constraints for the high-frequency specular effects, which appear in very few pixels in limited input views. In contrast, our optimization is driven by our novel neural rendering framework with deep network priors, which effectively correlates the sparse specularities in different regions through network connections and recovers realistic specularities and other appearance effects.

Figure 4: Additional results on real scenes. We show renderings under novel view and lighting conditions. Our method is able to handle scenes with multiple objects (top two rows) and model the complex occlusions between them. Our method can also generate high-quality results from casual handheld video captures (third row), which demonstrates the practicability of our approach.
25 50 100 200 385 PSNR 25.33 26.36 26.95 27.85 28.13 SSIM 0.70 0.73 0.75 0.80 0.81
Figure 5: We evaluate the performance of our method on the House scene with different numbers of training images. Although we use all images in our final experiments, our method is able to achieve comparable performance with as few as 200 images for this challenging scene.
House Cartoon  [48] 0.786/25.81 0.532/16.34 Ours 0.896/30.44 0.911/29.14
Figure 6: We compare against DeepVoxels on synthesizing novel views under collocated lights and report the PSNR/SSIM scores. The results show that our method generates more accurate renderings. Note that we retrain our model with a resolution of for a fair comparison.

Comparison on synthesizing novel views. We also make a comparison on synthesizing novel views under collocated lights against a view synthesis method DeepVoxels [48], which encodes view-dependent appearance in a learnt 3D-aware neural representation. Note that DeepVoxels does not support relighting. As shown in Fig. 6, our method is able to generate renderings of higher quality with higher PSNR/SSIM scores. In contrast, DeepVoxels fails to reason about the complex geometry in our real scenes, thus resulting in degraded image quality. Please refer to the supplementary materials for visual comparison results.

Additional results. We show additional relighting and view synthesis results of complex real scenes in Fig. 4. Our method is able to handle scenes with multiple objects, as shown in scene Cartoon and Animals. Our volumetric representation can accurately model complex occlusions between objects and reproduce realistic cast shadows under novel lighting, which are never observed by our network during the training process. In the Captain scene, we show the result generated from handheld mobile phone captures. We select frames from the video at fixed intervals as training data. Despite the potential existence of motion blur and non-uniform coverage, our method is able to generate high-quality results, which demonstrates the robustness and practicality of our approach. Please refer to the supplementary materials for video results.

Evaluation of the number of inputs. Our method relies on an optimization over adequate input images that capture the scene appearance across different view/light directions. We evaluate how our reconstruction degrades with the decrease of training images on the House scene. We uniformly select a subset of views from the full training images and train our model on them. We evaluate the trained model on the test images, and report the SSIMs and PSNRs in Fig. 6. As we can see from the results, there is an obvious performance drop when there are fewer than 100 training images due to insufficient constraints. On the other hand, while we use the full images for our final results, our method in fact achieves comparable performance with only for this scene, as reflected by their close PSNRs and SSIMs.

Figure 7: We compare our deep prior based optimization against direct optimization of the volume and warping function without using networks. Direct optimization converges significantly slower than our method, which demonstrates the effectiveness of regularization by the networks.
Figure 8: Our approach supports intuitive editing of the material properties of a captured object. In this example we decrease the roughness of the object to make it look like glossy marble instead of plastic.

Comparison with direct optimization. Our neural rendering leverages a “deep volume prior” to drive the volumetric optimization process. To justify the effectiveness of this design, we compare with a naive method that directly optimizes the parameters in each voxel and the warping parameters using the same loss function. We show the optimization progress in Fig. 8. Note that, the naive method converges significantly slower than ours, where the independent voxel-wise optimization without considering across-voxel correlations cannot properly disentangle the ambiguous information in the captured images; yet, our deep optimization is able to correlate appearance information across the voxels with deep convolutions, which effectively minimizes the reconstruction loss.

Material editing. Our method learns explicit volumes with physical meaning to represent the reflectance of real scenes. This enables broad image synthesis applications like editing the materials of captured scenes. We show one example in Fig. 8, where we successfully make the scene glossier by decreasing the learned roughness in the volume. Note that, the geometry and colors are still preserved in the scene, while novel specularities are introduced which are not part of the material appearance in the scene. This example illustrates that our network disentangles the geometry and reflectance of the scene in a reasonable way, thereby enabling sub-scene component editing without influencing other components.

Limitations. We reconstruct the deep reflectance volumes with a resolution of , which is restricted by available GPU memory. While we have applied a warping function to increase the actual utilization of the volume space, and demonstrated that it is able to generate compelling results on complex real scenes, it may fail to fully reproduce the geometry and appearance of scenes with highly complex surface normal variations and texture details. Increasing the volume resolution may resolve this issue. In the future, it would also be interesting to investigate how to efficiently apply sparse representations such as octrees in our framework to increase the capacity of our volume representation. The current reflectance model we are using is most appropriate for opaque surfaces. Extensions to other materials like hair, fur or glass could be potentially addressed by applying other reflectance models in our neural rendering framework.

6 Conclusion

We have presented a novel approach to learn a volume representation that models both geometry and reflectance of complex real scenes. We predict per-voxel opacity, normal, and reflectance from unstructured multi-view mobile phone captures with the flashlight. We also introduce a physically-based differentiable rendering module to enable renderings of the volume under arbitrary viewing and lighting directions. Our method is practical, and supports novel view synthesis, relighting and material editing, which has significant potential benefits in scenarios such as 3D visualization and VR/AR applications.

Acknowledgements. We thank Giljoo Nam for help with the comparisons. This work was supported in part by ONR grants N000141712687, N000141912293, N000142012529, NSF grant 1617234, Adobe, the Ronald L. Graham Chair and the UC San Diego Center for Visual Computing.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3D point clouds. In ICML, pp. 40–49. Cited by: §2.
  • [2] M. Aittala, T. Aila, and J. Lehtinen (2016-07) Reflectance modeling by neural texture synthesis. ACM Transaction on Graphics 35 (4), pp. 65:1–65:13. External Links: ISSN 0730-0301 Cited by: §2.
  • [3] M. Aittala, T. Weyrich, and J. Lehtinen (2015-07) Two-shot svbrdf capture for stationary materials. ACM Transactions on Graphics 34 (4), pp. 110:1–110:13. External Links: ISSN 0730-0301 Cited by: §2.
  • [4] N. Alldrin, T. Zickler, and D. Kriegman (2008) Photometric stereo with non-parametric and spatially-varying reflectance. In CVPR, pp. 1–8. Cited by: §2.
  • [5] S. Baek, D. S. Jeon, X. Tong, and M. H. Kim (2018) Simultaneous acquisition of polarimetric SVBRDF and normals.. ACM Transactions on Graphics 37 (6). Cited by: §1.
  • [6] S. Bi, N. K. Kalantari, and R. Ramamoorthi (2017) Patch-based optimization for image-based texture mapping.. ACM Transaction on Graphics 36 (4). Cited by: §2.
  • [7] S. Bi, Z. Xu, K. Sunkavalli, D. Kriegman, and R. Ramamoorthi (2020) Deep 3d capture: geometry and reflectance from sparse multi-view images. In CVPR, Cited by: §2, §5.
  • [8] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen (2001) Unstructured lumigraph rendering. In SIGGRAPH, Cited by: §2.
  • [9] Z. Chen, A. Chen, G. Zhang, C. Wang, Y. Ji, K. N. Kutulakos, and J. Yu (2020-06) A neural rendering framework for free-viewpoint relighting. In CVPR, Cited by: §2.
  • [10] Z. Chen and H. Zhang (2018) Learning implicit fields for generative shape modeling. arXiv preprint arXiv:1812.02822. Cited by: §2.
  • [11] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar (2000) Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 145–156. Cited by: §2.
  • [12] S. C. Foo (1997) A gonioreflectometer for measuring the bidirectional reflectance of material for use in illumination computation. Ph.D. Thesis, Citeseer. Cited by: §2.
  • [13] Y. Furukawa and J. Ponce (2009) Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence 32 (8), pp. 1362–1376. Cited by: §2.
  • [14] D. B. Goldman, B. Curless, A. Hertzmann, and S. M. Seitz (2009) Shape and spatially-varying brdfs from photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (6), pp. 1060–1071. Cited by: §2.
  • [15] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) A papier-mâché approach to learning 3D surface generation. In CVPR, pp. 216–224. Cited by: §1, §2.
  • [16] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) DeepMVS: learning multi-view stereopsis. In CVPR, pp. 2821–2830. Cited by: §2.
  • [17] Z. Hui, K. Sunkavalli, J. Lee, S. Hadap, J. Wang, and A. C. Sankaranarayanan (2017) Reflectance capture using univariate sampling of brdfs. In ICCV, Cited by: §2.
  • [18] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang (2017) SurfaceNet: an end-to-end 3D neural network for multiview stereopsis. In ICCV, pp. 2307–2315. Cited by: §2.
  • [19] Y. Kanamori and Y. Endo (2018) Relighting humans: occlusion-aware inverse rendering for full-body human images. ACM Transactions on Graphics 37 (6), pp. 1–11. Cited by: §2.
  • [20] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386. Cited by: §1, §2.
  • [21] K. Kang, C. Xie, C. He, M. Yi, M. Gu, Z. Chen, K. Zhou, and H. Wu (2019) Learning efficient illumination multiplexing for joint capture of reflectance and shape.. Cited by: §2, §5.
  • [22] B. Karis and E. Games Real shading in unreal engine 4. Cited by: Appendix 0.A, §3.2.
  • [23] M. Kazhdan, M. Bolitho, and H. Hoppe (2006) Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, Vol. 7. Cited by: §2.
  • [24] J. Kniss, S. Premoze, C. Hansen, P. Shirley, and A. McPherson (2003) A model for volume lighting and modeling. IEEE transactions on visualization and computer graphics 9 (2), pp. 150–162. Cited by: §1, §3.1, §3.2.
  • [25] K. N. Kutulakos and S. M. Seitz (2000) A theory of shape by space carving. ICCV 38 (3). Cited by: §2.
  • [26] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, and M. Pollefeys (2017) From point clouds to mesh using regression. In ICCV, pp. 3893–3902. Cited by: §2.
  • [27] M. Levoy and P. Hanrahan (1996) Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, Cited by: §2.
  • [28] Z. Li, K. Sunkavalli, and M. Chandraker (2018) Materials for masses: SVBRDF acquisition with a single mobile phone image. In ECCV, Cited by: §2.
  • [29] Z. Li, Z. Xu, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker (2018) Learning to reconstruct shape and spatially-varying reflectance from a single image. In SIGGRAPH Asia 2018, pp. 269. Cited by: §2.
  • [30] Y. Liao, S. Donne, and A. Geiger (2018) Deep marching cubes: learning explicit surface representations. In CVPR, pp. 2916–2925. Cited by: §2.
  • [31] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics 38 (4), pp. 65. Cited by: §1, §1, §2, §3.2, §3, §4.1, §4.2, §4.3.
  • [32] W. E. Lorensen and H. E. Cline (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM siggraph computer graphics 21 (4), pp. 163–169. Cited by: §2.
  • [33] W. Matusik, H. Pfister, M. Brand, and L. McMillan (2003-07) A data-driven reflectance model. ACM Transactions on Graphics 22 (3), pp. 759–769. Cited by: §2.
  • [34] N. Max (1995) Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1 (2). Cited by: §3.1, §3.1, §3.2.
  • [35] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2018) Occupancy networks: learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828. Cited by: §2.
  • [36] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. External Links: 2003.08934 Cited by: §2.
  • [37] G. Nam, J. H. Lee, D. Gutierrez, and M. H. Kim (2018) Practical SVBRDF acquisition of 3D objects with unstructured flash photography. In SIGGRAPH Asia 2018, Cited by: Figure 12, Appendix 0.F, §1, §1, §2, §2, Figure 3, §5, §5.
  • [38] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, ISMAR ’11. External Links: ISBN 978-1-4577-2183-0 Cited by: §2.
  • [39] J. B. Nielsen, H. W. Jensen, and R. Ramamoorthi (2015) On optimal, minimal brdf sampling for reflectance acquisition. ACM Transactions on Graphics 34 (6), pp. 1–11. Cited by: §2.
  • [40] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In CVPR, Cited by: §2.
  • [41] J. Novák, I. Georgiev, J. Hanika, and W. Jarosz (2018) Monte carlo methods for volumetric light transport simulation. In Computer Graphics Forum, Vol. 37. Cited by: §3.1, §3.1.
  • [42] D. Paschalidou, O. Ulusoy, C. Schmitt, L. Van Gool, and A. Geiger (2018) Raynet: learning volumetric 3d reconstruction with ray potentials. In CVPR, Cited by: §2.
  • [43] P. Peers, D. K. Mahajan, B. Lamond, A. Ghosh, W. Matusik, R. Ramamoorthi, and P. Debevec (2009) Compressive light transport sensing. ACM Transactions on Graphics 28 (1), pp. 3. Cited by: §2.
  • [44] J. Philip, M. Gharbi, T. Zhou, A. A. Efros, and G. Drettakis (2019) Multi-view relighting using a geometry-aware network. ACM Transactions on Graphics 38 (4). Cited by: §2, §2.
  • [45] S. R. Richter and S. Roth (2018) Matryoshka networks: predicting 3D geometry via nested shape layers. In CVPR, pp. 1936–1944. Cited by: §2.
  • [46] J. L. Schönberger and J. Frahm (2016) Structure-from-motion revisited. In CVPR, Cited by: §5.
  • [47] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise view selection for unstructured multi-view stereo. In ECCV, Cited by: §1, §2, §5.
  • [48] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer (2019) Deepvoxels: learning persistent 3D feature embeddings. In CVPR, Cited by: Figure 11, Appendix 0.E, §2, Figure 6, §5.
  • [49] P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng, and N. Snavely (2019) Pushing the boundaries of view extrapolation with multiplane images. In CVPR, pp. 175–184. Cited by: §2, §3.2.
  • [50] T. Sun, J. T. Barron, Y. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevec, and R. Ramamoorthi (2019) Single image portrait relighting. ACM Transactions on Graphics (Proceedings SIGGRAPH). Cited by: §2.
  • [51] B. Walter, S. R. Marschner, H. Li, and K. E. Torrance (2007) Microfacet models for refraction through rough surfaces.. Rendering techniques. Cited by: Appendix 0.A.
  • [52] J. Wang, B. Sun, and Y. Lu (2018) MVPNet: multi-view point regression networks for 3D object reconstruction from a single image. arXiv preprint arXiv:1811.09410. Cited by: §2.
  • [53] C. M. Wittenbrink, T. Malzbender, and M. E. Goss (1998) Opacity-weighted color interpolation, for volume sampling. In Proceedings of the 1998 IEEE symposium on Volume visualization, pp. 135–142. Cited by: §1, §3.2.
  • [54] H. Wu, Z. Wang, and K. Zhou (2015) Simultaneous localization and appearance estimation with a consumer rgb-d camera. IEEE Transactions on visualization and computer graphics 22 (8). Cited by: §2.
  • [55] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §3.
  • [56] R. Xia, Y. Dong, P. Peers, and X. Tong (2016) Recovering shape and spatially-varying surface reflectance under unknown illumination. ACM Transactions on Graphics 35 (6). Cited by: §1, §2, §5.
  • [57] Z. Xu, S. Bi, K. Sunkavalli, S. Hadap, H. Su, and R. Ramamoorthi (2019) Deep view synthesis from sparse photometric images. ACM Transactions on Graphics 38 (4), pp. 76. Cited by: §1, §1, §2, §5.
  • [58] Z. Xu, J. B. Nielsen, J. Yu, H. W. Jensen, and R. Ramamoorthi (2016) Minimal brdf sampling for two-shot near-field reflectance acquisition. ACM Transactions on Graphics 35 (6), pp. 188. Cited by: §2.
  • [59] Z. Xu, K. Sunkavalli, S. Hadap, and R. Ramamoorthi (2018) Deep image-based relighting from optimal sparse samples. ACM Transactions on Graphics 37 (4), pp. 126. Cited by: §1, §2, §5.
  • [60] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) MVSNet: depth inference for unstructured multi-view stereo. In ECCV, pp. 767–783. Cited by: §2, §3.
  • [61] H. Zhou, S. Hadap, K. Sunkavalli, and D. W. Jacobs (2019) Deep single-image portrait relighting. In ICCV, Cited by: §2.
  • [62] Q. Zhou and V. Koltun (2014) Color map optimization for 3D reconstruction with consumer depth cameras. ACM Transactions on Graphics 33 (4), pp. 155. Cited by: §2.
  • [63] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely (2018) Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics 37 (4), pp. 1–12. Cited by: §1, §2, §3.2, §3.
  • [64] Z. Zhou, G. Chen, Y. Dong, D. Wipf, Y. Yu, J. Snyder, and X. Tong (2016) Sparse-as-possible SVBRDF acquisition. ACM Transactions on Graphics 35 (6). Cited by: §2, §5.

Appendix 0.A BRDF Model

Essentially any differentiable BRDF model can be incorporated in our framework to model the appearance of real-world objects. In this paper we apply a version of the microfacet BRDF model proposed by Walter et al. [51], with simplifications introduced by Karis [22]. Let , be the view and light direction, , , be the normal, diffuse albedo and roughness. Our BRDF model is defined as:

(13)

where , and are the normal distribution, fresnel and geometric terms respectively. These terms are defined as follows:

where we set as suggested in [22]. Correspondingly, the final reflected radiance in Eqn. 8 in the paper is computed as:

(14)

where and are the diffuse albedo and roughness at .

Appendix 0.B Network Architecture

Fig. 9 shows an overview of our network architecture. Our network starts from a -channel encoding vector initialized using random samples from a normal distribution. The encoding vector first goes through two fully connected layers and then is fed to different decoders to predict the global warping parameters, spatially varying warping parameters, and the template volume. The global warping parameters consist of a 3-channel scaling vector, a 3-channel translation vector and a 4-channel rotation vector represented as a quaternion. The spatially varying parameters consist of warping bases and a weight volume . Similar to the global warping, each warping basis is composed of a scaling, a translation and a rotation. The weight volume has channels and a resolution of , which encodes the spatially varying weight of each basis. Finally, the template volume has a resolution of ; it has channels with 1 channel for opacity, 3 channels for normal, 3 channels for diffuse albedo and channel for roughness. We also transform the albedo and roughness to the range of and normalize the predicted normal vectors.

Figure 9: Our network architecture.
Pony Girl House Disney Animals Captain
min 0.60 0.82 1.22 0.25 0.29 0.68
max 10.00 9.19 9.54 10.09 9.28 14.52
mean 5.35 7.66 5.83 7.25 6.49 6.92
Table 1: The minimum, maximum and average angles (in degrees) between the test views in the supplementary video and their nearest training views.

Appendix 0.C Testing Specifications

In the supplementary video, we show renderings of the captured object under novel viewpoints and lighting. Note that our training images are captured with collocated light and camera, and the relighting results in the video demonstrate that our volumetric representation can generalize to novel lighting conditions. In Tab. 1, we report the minimum, maximum and average angles between the test views in the video and their nearest training views. Such a large angle difference also shows that our deep reflectance volume generalizes well to novel views.

Figure 10: Comparison with ground truth on relighting under environment illumination. The environment map used for rendering is shown at the bottom.
Figure 11: Comparison against Sitzmann et al. [48] on synthesizing novel views under collocated lights. Our method is able to generate high-quality results with fewer artifacts.
Figure 12: Geometry reconstructed from Nam et al. [37].

Appendix 0.D Results on Synthetic Data

In addition to the real captures, we also evaluate our method on a synthetic dataset where we render a synthetic scene from multiple viewpoints under collocated camera and light. We compare our view synthesis and relighting results with the ground truth renderings. Please check the supplementary video for comparisons.

By linearly combining the relit images under each light corresponding to pixels of an environment map, our method also supports rendering of the scene under novel environment illumination. In Fig. 10 we demonstrate our environment map relighting result and compare it to the ground truth renderings with a physically-based renderer. From the figure we can see that our method can generate visually plausible results.

Appendix 0.E Comparison on View Synthesis

In Fig. 11 we show a visual comparison against the method proposed by Sitzmann et al. [48] on synthesizing novel views under collocated lights. Sitzmann et al. learn a 3D-aware neural representation to encode the view-dependent appearance of captured scenes. Their method cannot model the complex geometry and appearance of our real scenes. As we can see from the result, Sitzmann et al. cannot synthesize novel views correctly and generates distorted images with undesired structures. In contrast, our method is able to produce images of much higher quality.

Appendix 0.F Mesh-Based Appearance Acquisition

In Fig. 12 we show the optimized geometry from Nam et al. [37]. They leverage the state-of-the-art multi-view stereo (MVS) framework to get an initial geometry, and further perform an optimization to refine it; however they still fail to recover the faithful geometry for such challenging scenes where there are textureless and thin-structured regions, thus resulting in degraded quality in reproduced appearance, as shown in the supplementary video.