Modeling a real scene from captured images and reproducing its appearance under novel conditions is a central problem in computer graphics and vision. This has traditionally been accomplished by using 3D reconstruction and inverse rendering methods to recover scene geometry and reflectance (Zhou et al., 2013; Nam et al., 2018). However, this is an extremely challenging task, and even state-of-the-art methods generate inaccurate reconstructions that produce images with significant artifacts when rendered.
More recently, many approaches have been proposed that circumvent the problem of explicit reconstruction, and instead estimate a “neural” scene representations that can be combined with an appropriate differentiable rendering method to generate novel images (see (Tewari et al., 2020) for a recent survey). One line of work in this space combines neural scene representations with classical ray marching—a volume rendering approach that is naturally differentiable—to achieve realistic rendering without requiring any pre-acquired 3D geometry (Lombardi et al., 2019; Mildenhall et al., 2020; Sitzmann et al., 2019b). However, these methods are mostly designed for view synthesis and do not model scene appearance as a function of reflectance or lighting. As a result, they do not allow for tasks such as relighting or scene editing. While ray marching can be used with discrete volumes with explicit per-voxel BRDFs (Bi et al., 2020a) to enable both relighting and view synthesis, an explicit discretized volume representation is highly restricted by its fixed resolution, and cannot reproduce high-frequency appearance details like fine textures and sharp boundaries.
In this work, we propose a novel scene representation that we refer to as Neural Reflectance Fields. Unlike previous work that models scene color (Lombardi et al., 2019) or radiance (Mildenhall et al., 2020), neural reflectance fields account for both scene geometry and reflectance. This allows us to combine neural reflectance fields with a ray-marching framework (see Fig. 2) to render images under arbitrary view and lighting. Moreover, the whole pipeline is differentiable allowing us to pose the problem of appearance acquisition as one of optimizing for a neural reflectance field that, when rendered, will match the captured scene images. Based on this, we capture multiple images around the scene with a cellphone camera and its built-in flash, similar to the acquisition in recent work on material acquisition (Li et al., 2018a; Deschaintre et al., 2018), relighting (Xu et al., 2018) and inverse rendering (Nam et al., 2018). This practical setup yields unstructured multi-view images under collocated point illumination and captures complex high-frequency scene appearance. As we illustrate in Fig. 1, neural reflectance fields can be reconstructed from such “simple” inputs and allow for the photo-realistic rendering of complex real scenes under novel viewpoints and lighting conditions (that are arbitrarily different from the captured collocated lighting).
Neural Reflectance Fields are a continuous function neural representation that implicitly models both scene geometry and reflectance. We represent them using a deep multi-layer perceptron (MLP) that can regress reflectance properties, normals, and volume density at a given 3D scene point. This representation can be combined with a differentiable ray marching framework—based on classical physically based volume rendering (Kniss et al., 2003; Novák et al., 2018; Max, 1995). In particular, we march rays from the viewpoint through each pixel, and along each marching ray sample points where we compute shading using the regressed normal and reflectance properties at sampled shading points. This shading is modulated with transmittance (computed from regressed volume density), and accumulated along the ray to compute radiance.
We utilize the transmittance not only along the camera ray but also along the light ray to model light effects like shadows for complex real scenes (see Fig. 1). Naively computing the light transmittance requires marching a rays from all the points sampled along the camera ray to the light, making it intractable both for reconstruction and rendering. Instead, we note that the collocated nature of our input data simplifies this for us because it only requires us to evaluate transmittance along the identical camera-light ray, thus allowing us to efficiently fit neural reflectance fields to image data. To further speed up re-rendering under arbitrary light and view positions, we pre-compute a light transmittance volume at adaptively sampled points, enabling efficient shadow rendering.
Our entire rendering pipeline is general and can support any network that can map a 3D point to rendering parameters and any differentiable reflectance model. For example, in Fig. 1, 5, 7, we demonstrate that we can accurately model the appearance of a diverse set of real scenes, including scenes with intricate geometry, highly specular reflectance, furry objects, and human portraits. These results are significantly better than the state-of-the-art mesh-based reconstruction method (Nam et al., 2018) and discrete volume-based representations (Bi et al., 2020a) (see Fig. 4).
Moreover, because our representation is designed to work with a physically-based volume renderer, it can in fact be naturally incorporated into modern rendering engines, like Mitsuba (Jakob, 2010). This allows us to compose neural reflectance fields with traditional 3D models (with explicit meshes and BRDFs) and capture light transport interactions between these disparate scene elements (see Fig. 9). This is something that has not been demonstrated by previous neural reconstruction methods, and in our opinion, represents an important step towards building neural capture and rendering approaches that can be incorporated into traditional 3D design workflows.
In summary, our main contributions are:
[noitemsep,topsep=0pt,wide, labelwidth=!, labelindent=0pt]
A novel neural reflectance field representation that models both scene geometry and reflectance,
A physically-based ray marching scheme that can render neural reflectance fields under any view and lighting,
A method to reconstruct neural reflectance fields from unstructured flash images, and
Applications of this representation to tasks like view synthesis, relighting, and scene composition.
2. Related work
Neural scene representations.
Previous work has applied deep neural networks to many 3D tasks with scene geometry modeled by various representations, such as volumes (Ji et al., 2017; Richter and Roth, 2018), point clouds (Qi et al., 2017), implicit functions (Mescheder et al., 2018; Sitzmann et al., 2019b), etc. Reflectance modeling has also been explored with neural networks (Kuznetsov et al., 2019; Vicini et al., 2019; Rainer et al., 2019). We present the novel neural reflectance field that models both geometry and reflectance in a real scene.
Thies et al. (2019) apply neural textures for realistic image synthesis, but require a pre-acquired mesh as input. Many previous works aim to do view synthesis without any known geometry. Multiplane images have been used in small-baseline view synthesis (Zhou et al., 2018; Srinivasan et al., 2019); however, such a view-dependent representation only supports limited viewing range and requires special fusion techniques to extend the range (Mildenhall et al., 2019). Recent works leverage view-independent volumes, which are able to handle complex view-dependent effects (Sitzmann et al., 2019a; Lombardi et al., 2019). Our neural reflectance field models complete scene appearance; in addition to view synthesis, ours can also be used for other applications such as relighting.
Recently, ray marching has been used to train many neural scene representations for view synthesis without any ground-truth 3D representations (Mildenhall et al., 2020; Lombardi et al., 2019; Sitzmann et al., 2019b). Lombardi et al. (2019) apply ray marching in a discrete volume with a warping field for view synthesis. To make it generalizable to relighting, Bi et al. (2020a) reconstruct discrete reflectance volumes with explicit per-voxel BRDFs; however, the fixed resolution of the discrete volume limits the appearance details in the rendering. In contrast, we leverage a continuous functional neural representation and achieve much better results (see Fig. 4). Mildenhal et al. (2020) present a neural radiance field, which also represents a scene as a continuous function with a MLP. However, their representation only supports view synthesis by directly rendering radiance from a new viewpoint under fixed illumination. We leverage a novel reflectance-aware ray marching framework and learn to regress multiple decomposed shading components, which enables relighting and many other images synthesis applications.
Geometry and reflectance capture.
Classically, modeling and re-rendering a real scene requires full reconstruction of its geometry and reflectance. From captured images, scene geometry is usually reconstructed by structure-from-motion and multi-view stereo (MVS) (Kutulakos and Seitz, 2000; Esteban and Schmitt, 2004; Furukawa and Ponce, 2009; Schönberger and Frahm, 2016; Schönberger et al., 2016)
, which have recently been extended using deep learning techniques(Yao et al., 2018, 2019; Chen et al., 2019; Cheng et al., 2019).
Reflectance acquisition traditionally requires sophisticated devices to sample the light-view space (Foo, 1997; Matusik et al., 2003; Nielsen et al., 2015; Xu et al., 2016; Kang et al., 2018, 2019). Recently, many works use a practical device – a modern cellphone that has a camera and a built-in flash light – and capture flash images to acquire spatially varying BRDFs (Aittala et al., 2016, 2015; Hui et al., 2017; Nam et al., 2018)
. While such a device only acquires reflectance samples under collocated light and view, with enough samples, it is still sufficient to reconstruct many standard analytic reflectance models that are governed by the half-angle vector(Nam et al., 2018; Hui et al., 2017). More recently, deep learning methods have made BRDF acquisition with a single flash image possible (Li et al., 2018a, b; Deschaintre et al., 2018). Bi et al. (2020b) extend the single-view case to a structured multi-view configuration, and reconstruct meshes with per-vertex BRDFs from only six images.
We aim to model geometry and appearance of complex real scenes from multi-view unstructured flash images. From such inputs, Nam et al. (2018) leverage an initial mesh from MVS and reconstruct per-vertex BRDFs via traditional optimization. However, it is very difficult for traditional mesh-based methods to recover challenging thin structures and sharp specularities of complex real scenes. In this work, we address these issues by proposing a novel neural reflectance field to implicitly model the scene’s geometry and reflectance, bypassing explicit mesh reconstruction. Our approach achieves photo-realistic rendering results with high-frequency appearance details that are significantly better than previous works.
Relighting and view synthesis.
Scene acquisition and rendering can be also achieved using image-based techniques without explicit reconstruction (Levoy and Hanrahan, 1996; Debevec et al., 2000). Recently, many learning based view synthesis methods have been presented (Zhou et al., 2018; Hedman et al., 2018; Srinivasan et al., 2017; Xu et al., 2019; Mildenhall et al., 2020). We extend the ray marching in the view synthesis works to a more general reflectance-aware ray marching framework, which can also be used to do relighting. Learning-based relighting methods have also been presented (Ren et al., 2015; Xu et al., 2018), which are able to reproduce challenging appearance effects. Many techniques regress images under novel lighting from sparse inputs without any explicit geometry reasoning (Xu et al., 2018; Zhou et al., 2019; Sun et al., 2019), but are unable to recover accurate hard shadows. Philip et al. (2019) require a mesh from MVS for shadowing computation. Our network learns to regress volume density to model detailed scene geometry. Our ray marching considers light transmittance in ray integration, which recovers challenging hard shadows.
Note that, previous image-based techniques often send viewing (Mildenhall et al., 2020; Lombardi et al., 2019) or lighting (Xu et al., 2018; Sun et al., 2019) information as additional inputs to the network, and compute challenging view- or light- dependent shading effects through the network processing. In contrast, we leverage classical reflectance models to regularize the learning process; our neural reflectance field is independent of the viewing and lighting directions, and we use the regressed reflectance and normal to compute shading under any lighting and viewpoint. Our approach can properly model scene appearance and reproduce challenging view-dependent and light-dependent shading effects.
3. Reflectance-aware Ray Marching
While differentiable ray marching has been used in recent works (Lombardi et al., 2019; Mildenhall et al., 2020), these methods focus on view synthesis and only consider view-dependent effects. We utilize classical reflectance models in a more general ray marching framework (see Fig. 2) that also models lighting and enables relighting and other re-rendering applications. Our reflectance-aware rendering framework is differentiable and can be easily combined with deep learning to learn scene appearance. In this section, we first discuss the underlying rendering equation that governs our volume rendering (Sec. 3.1), and then introduce our ray marching framework that numerically computes the equation in a differentiable way (Sec. 3.2).
3.1. Rendering equation
In general, for non-emissive and non-absorptive volumes, physically-based volume rendering is governed by the volume rendering equation (Novák et al., 2018) that computes the radiance at point in direction :
Here, represents the 1D location on a ray traced in the volume, represents the 3D point at , the point typically represents the camera location, and represents the scattered light at along .
is the extinction coefficient that indicates the probability density of medium particles; we refer toas volume density in this paper. represents the transmittance factor which determines the loss of light along the ray from to .
Eqn. 1 computes the radiance that arrives at by integrating the modulated in-scattered light along the ray,
where is a unit sphere, is a phase function that governs light scattering, and is the incident radiance arriving at from direction .
Note that previous work (Mildenhall et al., 2020) directly encodes without considering any form of Eqn. 3; this assumes fixed illumination and only works for view synthesis. In contrast, we consider single-bounce direct illumination under a single point light source to approximate . Inspired by (Max, 1995), we compute with an explicit reflectance term that assumes the role of a phase function:
where represents a differentiable reflectance model with parameters , is the local surface shading normal, and represents the incident radiance as in Eqn. 3. When only considering direct illumination from a point light source, is determined by the intensity of the light source and the loss of light due to extinction through the volume:
where is the transmittance from the light to the shading point, and represents the light intensity with the consideration of distance attenuation. Here, denotes the position of the point light source, and thus corresponds to the direction of the vector .
Our equation considers complete one-bounce camera-volume-light paths in the light transport. Unlike previous work that only considers the view transmittance or opacity between the shading point and the camera (Mildenhall et al., 2020; Lombardi et al., 2019), we also explicitly express the light transmittance () from the point to the light, which allows us to render realistic shadows under different point light sources. Essentially, instead of modeling scene radiance as is in previous view synthesis work (Mildenhall et al., 2020), we decouple the multiple factors (, , ) that are embedded in , and explicitly model the scene reflectance parameters in , thus allowing for reflectance-aware rendering for both view synthesis and relighting with realistic shading and shadowing effects.
3.2. Ray marching
We use ray marching to numerically estimate the continuous integral in Eqn. 6 similarly to prior work on volume rendering (Max, 1995; Kniss et al., 2003); this is illustrated in Fig. 2. Specifically, we march rays from the camera center through each pixel on the image plane and sample a sequence of shading points on each ray. The rendering equation can be estimated by:
where represents the ray step size at point . Here, we omit the other parameters (, , , ) in for brevity. Here, is also an integral (Eqn. 2) and can be numerically evaluated by:
The transmittance can be similarly evaluated, but it requires sampling another sequence of points on an additional ray marched from the light source to the shading point :
Naively computing Eqn. 9 for Eqn. 7 would require marching a large number of light rays for all shading points on all camera rays. Instead, we leverage a collocated light source and camera setup (where the camera and light rays are the same) to avoid this during training; this is described in Sec. 4.2. At inference time we precompute an adaptive transmittance volume to efficiently approximate Eqn. 9 under any point light; this is described in Sec. 4.3.
Equations 7, 8, 9 express our reflectance-aware ray marching framework. Given a camera and a point light , the framework computes the radiance of any marching ray through a scene from the volume density , normal , and reflectance properties of the points in the scene. (Bi et al., 2020a) utilizes a rendering equation similar to ours (Eqn. 6), but they leverage classical opacity accumulation – where the opacity is given by – to numerically evaluate the integral, which only supports a fixed step size for ray marching. In contrast, we utilize volume density for numerical estimation, which is more general and allows the step sizes to vary across the shading points. This enables applying better adaptive sampling strategies to distribute the shading points along both camera and light rays (See 4.2 and Sec. 4.3). In addition, since volume density is standard in Monte Carlo based volume rendering, the model learned from our ray marching framework can be also used in standard rendering engines for general graphics applications (See. Fig. 9).
Our ray marching framework supports any differentiable reflectance model , which makes the full rendering process trivially differentiable. In this work, we demonstrate most results using a classical analytic BRDF (Karis and Games, 2013) for , which models the reflectance of opaque surfaces with a diffuse albedo and a specular roughness. We also show results with hair/fur reflectance models (Kajiya and Kay, 1989) that model the appearance of furry objects, demonstrating the generality of this formulation. Our reflectance-aware ray marching framework can potentially be combined with any module that is able to provide the rendering properties (, , ) of an arbitrary point in the scene. In this work, we use a neural network to regress the necessary rendering properties.
4. Neural reflectance fields
We now present our neural reflectance field representation that uses deep fully connected networks to model scene geometry and reflectance (Sec. 4.1). As shown in Fig. 2, this network can be used in conjunction with the reflectance-aware ray marching scheme described previously. We show how it can effectively trained from cellphone flash images (Sec. 4.2). We also present an adaptive transmittance volume for light transmittance precomputation, enabling efficient rendering under any novel light and view positions with realistic shadows (Sec. 4.3).
Given a reflectance model with parameters, a neural reflectance field outputs a -dimensional vector—comprising volume density (1-D), normal (3-D) and reflectance properties (-D)—at any 3-D position in a scene. In practice, we use a microfacet BRDF model (Walter et al., 2007) where comprises diffuse albedo and specular roughness, though we also demonstrate an extension using a fur reflectance model (Kajiya and Kay, 1989)
. We parameterize neural reflectance fields using an MLP with 14 fully connected layers and ReLU activation layers. Please refer to the supplementary material for the detailed architecture of our network.
Inspired by (Rahaman et al., 2018; Vaswani et al., 2017; Mildenhall et al., 2020), we use a frequency-based positional encoding of a given 3D location . In particular, given each dimension of the 3D point, we map the scalar value to
where represents the highest frequency level ( in our experiments). These are input to the MLP to regress the scene properties at the encoding , and .
Unlike (Mildenhall et al., 2020), which also use a positional encoding of the viewing direction, we only use the 3D location as input, inferring view- (and light-) independent scene appearance properties. This is possible because we separate out these factors and compute shading, viewing, and lighting information in our ray marching framework (Eqn. 6, 7); this allows neural reflectance fields to be directly plugged into it for high-quality rendering.
4.2. Learning neural reflectance fields from flash images
We now describe how we can use neural reflectance fields to reconstruct the appearance of a real-world scene from images. Each neural reflectance field is fit to a specific scene via a training process. Since our whole rendering process (the representation and the ray marching) is differentiable, we train the neural reflectance field network to minimize the error between rendered images and captured images of the scene.
Collocated light and view.
In particular, we capture flash images with collocated light and view to train our networks. Such images can be easily captured by a cellphone with a camera and a flash. The collocated setting leads to in our training. One key benefit of using collocated light and view is that the view transmittance and the light transmittance become equal in Eqn. 7. This avoids marching an additional ray towards the light at every shading point, which would make training intractable. This capture setup thus has the advantage of making both acquisition and training practical. However, this also means that our input images represent an extremely sparse sampling of scene appearance across the view-light space. In fact, we have no samples of the scene for lighting from any non-zero angle with respect to the camera. In spite of this, we show that we can reconstruct high-quality scene appearance and render images under arbitrary view and even non-collocated lighting.
Adaptive sampling for camera rays.
To optimize the point sampling in ray marching, for each scene, we train two networks—a coarse and a fine neural reflectance field—and render using a coarse-to-fine adaptive sampling procedure. Inspired by (Mildenhall et al., 2020), we first sample a sparse set of points on each marching ray with stratified sampling to compute a distribution function using the coarse network, then sample a dense set of points from the distribution function to compute the final radiance value using the fine network.
In particular, we divide each full ray segment into bins and randomly sample a point from each bin to get stratified samples. From these points, we can compute the radiance from the coarse-level network for the ray using Eqn. 7. As a side product, we can also produce corresponding per-point contribution weights
The weight essentially describes how visible the point at
is to the camera. We construct a piece-wise constant probability distribution by normalizing the per-point weightsand then sample points from this distribution, which adaptively selects new samples according to the visibility information gathered from the coarse neural reflectance field. We then use all sampled points to compute the final radiance with Eqn. 7 using the rendering parameters from the fine reflectance field. This coarse-to-fine adaptive sampling effectively distributes more sampled points in the regions that contribute most to the rendering integral, allowing for accurate shading computation with high-frequency details.
4.3. Efficient rendering under novel light and view
While our neural reflectance field is learned from flash images with collocated light and view, the learned representation can be directly used to render the scene with single-scattering effects using any light and view positions with Eqn. 7. However, accurately computing at inference time under novel non-collocated light and view (unlike training) is extremely computationally expensive. Therefore, we propose to pre-compute an adaptive transmittance volume to effectively approximate .
Adaptive sampling for light transmittance volume.
Inspired by the classical shadow map technique (Stamminger and Drettakis, 2002; Williams, 1978) in rasterization, we use our learned neural reflectance field to compute a transmittance volume similar to (Lokovic and Veach, 2000) for fast light transmittance computation. Specifically, we place a virtual image plane in front of the point light source towards the scene and march a ray through each pixel, analogously to what is classically done for ray marching from the camera. Similar to the adaptive sampling for camera rays described in Sec. 4.2, we use the two trained networks (a coarse and a fine network) to perform adaptive sampling. We first utilize the coarse representation to compute a visibility-aware distribution function using sparse points sampled from stratified bins; we then sample dense points from the distribution. We combine the samples from both passes and compute their light transmittance, resulting in a transmittance volume that adapts to the visibility information inferred from the coarse network. This adaptive transmittance volume is illustrated in Fig. 3.
We do ray marching from the viewpoint to render an image under any point light source from any viewpoint using the learned network and the pre-computed adaptive transmittance volume. At any given shading point, we locate the nearest sampled points and then linearly interpolate the transmittance volume to get the required light transmittance, similar to(Lokovic and Veach, 2000). This allows for realistic shadowing effects to be well recovered in our results when doing relighting.
We also apply coarse-to-fine sampling on the camera rays for the rendering at inference, as we described in Sec. 4.2 at training. Basically, at inference, we apply coarse-to-fine adaptive sampling in ray marching from both the light and the camera, which achieves efficient light transmittance computation and effective final image synthesis.
As noted before, during training, our network only sees images that are captured under collocated light and view and do not have any shadows. Yet, our method is able to learn a volume density that meaningfully expresses scene geometry. This allows us to synthesize high-quality relighting and view synthesis results with realistic shadows, specularities and other appearance effects under novel, non-collocated light and view, as illustrated in Figs. 1, 4, 5.
As discussed in Sec. 4.2, we reconstruct neural reflectance fields from images captured under collocated view and lighting. Such data can be practically acquired by shooting a video using a handheld cellphone with flash. We show acquisition and rendering results of one human portrait using this handheld setup in Fig. 7; in this case, we selected 150 frames from the video as input. To facilitate the data acquisition, for other results, we use a robotic arm holding a cellphone to automatically capture scenes that are composed of different static objects. We capture about 400 images using this automatic setup. We use a Samsung Galaxy Note 8 to capture all our real scenes. The camera parameters are calibrated using structure from motion in COLMAP (Schönberger and Frahm, 2016). Our method does not require accurate background masks for the input images to train the network. We simply crop the central regions around the objects in the captured images to avoid training on too many background pixels. Each network is trained in a scene-dependent way, using the input images for that single scene.
Our representation works with any differentiable reflectance model. In practice, we use a microfacet BRDF model that combines a diffuse Lambertian term with a specular term that uses the GGX distribution (Walter et al., 2007). The parameters of this model include a diffuse albedo and a specular roughness. With this model, the neural reflectance field MLP thus outputs a 8-D vector at every scene point, corresponding to the 3-D diffuse albedo, 1-D specular roughness, 3-D surface normal and 1-D transmittance. We use this BRDF model for every result in the paper, except for Fig. 6 where we capture a furry object. Here we use the classical fur reflectance model (Kajiya and Kay, 1989) and replace the surface normal with a fiber tangent vector.
Training parameters and loss function.
We implement our neural reflectance field and ray marching in PyTorch. During training, we randomly samplepixel rays as a batch to train our network under collocated light as described in Sec. 4.2. We use Adam optimizer with an initial learning rate of 0.0001. We use coarse samples and fine samples to adaptively sample light rays when building the adaptive transmittance volume and camera rays when computing the final radiance.
We supervise the regressed radiance values from both the coarse and the fine network with the ground truth radiance from the captured images using the
loss. Since we consider opaque objects, we also regularize the ray transmittance (from the fine network), forcing it to be close to 0 or 1, which is helpful to get a clean background. Our total loss function is given by:
where denotes a pixel ray and is a hyper-parameter that controls the strength of the regularization term.
We use 4 NVIDIA RTX 2080Ti GPUs to train each reflectance field network for about 2 days. At inference time, the network takes about 30 seconds to render a image using our adaptive transmittance volume.
We now demonstrate our results in this section. We first evaluate our method by comparing our view synthesis and relighting results with other methods. We then show more results and applications of our method. Please refer to the supplementary video for more video results.
Comparisons with previous methods.
Most previous learning-based works focus only on the sub-problems of relighting (Xu et al., 2018; Ren et al., 2015) or view synthesis (Xu et al., 2019; Mildenhall et al., 2020; Lombardi et al., 2019; Mildenhall et al., 2019), and capture images with a fixed camera or fixed illumination, respectively. Instead, our input light and view are collocated and vary across all input images, allowing us to build a holistic scene representation that allows for both view synthesis and relighting. We are aware of only a few methods that address this problem and we compare against two of them. The first is a state-of-the-art mesh-based appearance acquisition method (Nam et al., 2018) that reconstructs a 3D mesh and per-vertex BRDFs from collocated flash images. The reconstructed geometry and reflectance can then be used to achieve relighting and view synthesis. We also compare with a learning-based method (Bi et al., 2020a), that predicts a discrete volume with explicit per-voxel reflectance properties. This technique supports relighting and view synthesis via opacity accumulation-based ray marching. In Fig. 4, we show qualitative comparisons of images renderer from the respective reconstructions under novel collocated and non-collocated light-view settings. Results for all methods were generated from the same inputs by their respective authors. Please refer to the supplementary video for video comparisons.
Fig. 4 shows that our method achieves significantly better rendering results than (Nam et al., 2018). They leverage a classical multi-view stereo (MVS) method to reconstruct an initial mesh, and then recover a refined mesh and per-vertex BRDFs via traditional optimization. However, for challenging real scenes, MVS often fails to recover reasonable initial geometry in regions that with little texture, high specularity, or thin structures. This leads to highly distorted and even missing geometry in their results. In addition, since specular effects typically influence very few pixels, their optimization-based reflectance estimation step is unable to recover them, leading to a mostly diffuse appearance, . In contrast, our neural reflectance field bypasses mesh reconstruction and is able to accurately resolve fine geometric structure with volume densities. This leads to high-quality rendering results with realistic geometric details, high specularities and hard shadows.
Our method also outperforms the previous deep volume rendering method (Bi et al., 2020a). While that method also avoids the geometric reconstruction issues arising from Nam et al. (2018), it fails to recover high-frequency details in the results, as reflected in many of the insets shown in Fig. 4. This is because they regress a discrete volume with per-voxel BRDFs; the rendering quality is limited by the resolution of the volume, which is strictly constrained by system memory. Instead, by leveraging a continuous functional representation, our network can properly recover high-frequency appearance. Our neural reflectance field is also extremely compact, with weights consuming only 5 MB of memory. In contrast, (Bi et al., 2020a) uses a network that requires 400 MB to predict a volume that consumes several gigabytes of memory during rendering. Our approach is more efficient in terms of memory usage and has more potential to be extended to capture of large-scale real scenes.
Additional results on diverse real scenes.
We now demonstrate additional view synthesis and relighting results from our method on diverse real scenes in Figs. 5, 6, and 7. Fig. 5 shows results on complex objects. Our method successfully recovers various challenging high-frequency appearance effects, such as detailed geometry, complex textures, specularities, and hard shadows. Note that the detailed thin geometry of the grass in Plane and the complex normal variation on the surfaces in Dragon and Superhero are all well reproduced realistically. Our method can also handle challenging scenes that consist of multiple objects, like Shop. These lead to complex cast shadows between objects, that our method accurately reproduces in spite of never having observed them in the input images. This can be attributed to the ability of our method to infer reliable geometry (in the form of a volume density) from just collocated image samples.
In Fig. 6, we acquire the appearance of a furry object. Here, we plug in the classical fur reflectance model (Kajiya and Kay, 1989) into our representation, demonstrating the ability of neural reflectance fields to work with a wide range of reflectance models. While the results here are slightly blurrier than the other scenes (Fig. 5), they still look very realistic with the desired furry appearance. Our method can also be used to capture facial appearance, as shown in Fig. 7. Here, we use a handheld cellphone and simply capture a video (with flash) walking around the person. From this video, we sample 150 images and train a neural reflectance field that allows for re-rendering under varying viewpoint and lighting. Acquiring facial appearance is an extensively studied problem and recent deep learning-based approaches have demonstrated portrait relighting from sparse inputs. However, these either require calibrated illumination (Xu et al., 2018; Meka et al., 2019) or focus on low-frequency illumination (Sun et al., 2019; Zhou et al., 2019). In contrast, our images are captured with a practical setup, and are of high quality with realistic specularities and hard shadows, in spite of not making any face-specific assumptions in our method.
Since our setup only captures images under collocated view and light, we do not have ground truth captured images to evaluate renderings under non-collocated camera and light. We thus compare using a synthetic scene in Fig. 8, where we can render the ground truth under any lighting and viewpoint. As shown in Fig. 8, our method is able to accurately reproduce the high-frequency textures, specularities, and hard shadows in the rendered images, which are very close to the ground truth.
Integrating with Monte-Carlo renderers.
While neural rendering approaches have made remarkable progress in the recent past, one challenge with them is that they still require custom components that may not be consistent with standard scene representations and rendering engines. In addition, most current methods focus on the view synthesis task (Mildenhall et al., 2020; Lombardi et al., 2019) and do not model the interaction of lighting with the captured scene. While Bi et al. (2020a) do model lighting, it is based on opacity accumulation and only supports a fixed step size, which is not valid for Monte Carlo rendering. In contrast, our neural reflectance field representation models all camera-light interactions with the scene. In addition, it is trained in conjunction with a physically-based ray marching framework. As a result, it can be easily integrated using standard graphics rendering engines, by simply implementing the reflectance function as a special phase function.
In particular, we use Mitsuba (Jakob, 2010) to render one of our captured neural reflectance fields under complex environment illumination, and show these results in Fig. 9. We simply compute discrete volumes from our reflectance fields and use the volume to do Monte Carlo rendering. While simple, this leads to very realistic rendering results in Fig. 9. Also note that this allows us to compose a scene that is made up of our captured object and traditional 3D models represented by meshes with BRDFs, and simulate the light transport between these different representations including complex shadows and inter-reflections. While these results contain fewer details compared to our other results, this is caused by the limited volume resolution and can be addressed by potentially implementing our network in Mitsuba.
Our method is able to produce high-frequency appearance effects with fine details in most cases. However, it may still result in slightly blurry results when there are too many details (like the results in Fig. 6 and 7). Increasing the network capacity could potentially alleviate this. While our method generally generates a clean background without requiring any masks, some minor dark floaters occasionally appear, mainly coming from background regions that are not dark enough and are seen by several views. This usually can be addressed by masking the volume density in 3D with a bounding box. Our adaptive transmittance is efficient, but it may introduce some minor flickering in videos when doing relighting, due to inconsistent adaptive samples across frames. Increasing the number of samples in the volume usually resolves this. Some of these issues are visible in the supplementary video.
We present a deep learning based approach for appearance acquisition using a simple mobile phone setup. We present a novel neural reflectance field representation, which encodes volume rendering properties to model the geometry and reflectance of real scenes. We leverage a differentiable physically based ray marching framework to learn the neural reflectance field in a scene-dependent deep training process. We demonstrate that our neural reflectance field can be effectively estimated from cellphone flash images under collocated camera and light, allowing us to render photo-realistic images under arbitrary camera and (non-collocated) light positions. Our method is able to generate high-quality relighting and view synthesis results, reproducing challenging appearance effects, such as specularities, shadows, occlusions, and fine textures, which are significantly better than results from previous mesh-based and volume-based methods. Moreover, since our neural reflectance field are learned in a physically based rendering framework, they can be also rendered in standard graphics rendering engines, enabling scene modeling applications. Our approach takes a step towards making neural capture and rendering more practical and compatible with standard graphics pipelines.
This work was supported in part by ONR grants N000141712687, N000141912293, N000142012529, NSF grant 1617234, Adobe, the Ronald L. Graham Chair and the UC San Diego Center for Visual Computing.
- Reflectance modeling by neural texture synthesis. ACM Trans. Graph. 35 (4), pp. 65:1–65:13. External Links: Cited by: §2.
- Two-shot svbrdf capture for stationary materials. ACM Transactions on Graphics 34 (4), pp. 110:1–110:13. External Links: Cited by: §2.
- Deep reflectance volumes: relightable reconstructionsfrom multi-view photometric images. ECCV. Cited by: §1, §1, §2, §3.2, Figure 4, §6, §6, §6.
- Deep 3d capture: geometry and reflectance from sparse multi-view images. arXiv preprint arXiv:2003.12642. Cited by: §2.
- Point-based multi-view stereo network. In ICCV, Cited by: §2.
- Deep stereo using adaptive thin volume representation with uncertainty awareness. arXiv preprint arXiv:1911.12012. Cited by: §2.
- Acquiring the reflectance field of a human face. In SIGGRAPH, pp. 145–156. Cited by: §2.
- Single-image SVBRDF capture with a rendering-aware deep network. ACM Transactions on Graphics 37 (4), pp. 128. Cited by: §1, §2.
- Silhouette and stereo fusion for 3d object modeling. Computer Vision and Image Understanding 96 (3), pp. 367–392. Cited by: §2.
- A gonioreflectometer for measuring the bidirectional reflectance of material for use in illumination computation. Ph.D. Thesis, Citeseer. Cited by: §2.
- Accurate, dense, and robust multiview stereopsis. IEEE TPAMI 32 (8), pp. 1362–1376. Cited by: §2.
- Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–15. Cited by: §2.
- Reflectance capture using univariate sampling of brdfs. In ICCV, pp. 5362–5370. Cited by: §2.
- Mitsuba renderer. Note: http://www.mitsuba-renderer.org Cited by: §1, Figure 9, §6.
- SurfaceNet: an end-to-end 3D neural network for multiview stereopsis. In ICCV, pp. 2307–2315. Cited by: §2.
- Rendering fur with three dimensional textures. ACM Siggraph Computer Graphics 23 (3), pp. 271–280. Cited by: §3.2, §4.1, §5, Figure 6, §6.
Efficient reflectance capture using an autoencoder.. SIGGRAPH 37 (4), pp. 127–1. Cited by: §2.
- Learning efficient illumination multiplexing for joint capture of reflectance and shape. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–12. Cited by: §2.
- Real shading in unreal engine 4. SIGGRAPH 2013 Course. Cited by: §3.2.
- A model for volume lighting and modeling. IEEE transactions on visualization and computer graphics 9 (2), pp. 150–162. Cited by: §1, §3.1, §3.2.
- A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: §2.
- Learning generative models for rendering specular microgeometry. ACM Transactions on Graphics (SIGGRAPH Asia 2019) 38 (6), pp. 225. Cited by: §2.
- Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 31–42. Cited by: §2.
- Materials for masses: SVBRDF acquisition with a single mobile phone image. In ECCV, pp. 72–87. Cited by: §1, §2.
- Learning to reconstruct shape and spatially-varying reflectance from a single image. In SIGGRAPH Asia 2018, pp. 269. Cited by: §2.
- Deep shadow maps. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 385–392. Cited by: §4.3, §4.3.
- Neural volumes: learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG) 38 (4), pp. 65. Cited by: §1, §1, §2, §2, §2, §3.1, §3, §6, §6.
- A data-driven reflectance model. SIGGRAPH 22 (3), pp. 759–769. Cited by: §2.
- Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1 (2), pp. 99–108. Cited by: §1, §3.1, §3.1, §3.2.
- Deep reflectance fields: high-quality facial reflectance field inference from color gradient illumination. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–12. Cited by: §6.
- Occupancy networks: learning 3d reconstruction in function space. arXiv preprint arXiv:1812.03828. Cited by: §2.
- Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2, §6.
- NeRF: representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934. Cited by: §1, §1, §2, §2, §2, §3.1, §3.1, §3, §4.1, §4.1, §4.2, §6, §6.
- Practical SVBRDF acquisition of 3D objects with unstructured flash photography. In SIGGRAPH Asia 2018, pp. 267. Cited by: §1, §1, §1, §2, §2, Figure 4, §6, §6, §6.
- On optimal, minimal brdf sampling for reflectance acquisition. ACM Transactions on Graphics (TOG) 34 (6), pp. 1–11. Cited by: §2.
- Monte carlo methods for volumetric light transport simulation. In Computer Graphics Forum, Vol. 37, pp. 551–576. Cited by: §1, §3.1.
- Multi-view relighting using a geometry-aware network. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2.
Pointnet: deep learning on point sets for 3d classification and segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §2.
- On the spectral bias of neural networks. arXiv preprint arXiv:1806.08734. Cited by: §4.1.
- Neural btf compression and interpolation. In Computer Graphics Forum, Vol. 38, pp. 235–244. Cited by: §2.
- Image based relighting using neural networks. ACM Transactions on Graphics 34 (4), pp. 1–12. Cited by: §2, §6.
- Matryoshka networks: predicting 3d geometry via nested shape layers. In CVPR, pp. 1936–1944. Cited by: §2.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.
- Pixelwise view selection for unstructured multi-view stereo. In ECCV, Cited by: §2.
- Deepvoxels: learning persistent 3d feature embeddings. In CVPR, pp. 2437–2446. Cited by: §2.
- Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, pp. 1119–1130. Cited by: §1, §2, §2.
- Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 175–184. Cited by: §2.
- Learning to synthesize a 4d rgbd light field from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2243–2251. Cited by: §2.
- Perspective shadow maps. In SIGGRAPH, pp. 557–562. Cited by: §4.3.
- Single image portrait relighting. SIGGRAPH. Cited by: §2, §2, §6.
- State of the Art on Neural Rendering. Computer Graphics Forum (EG STAR 2020). Cited by: §1.
- Deferred neural rendering: image synthesis using neural textures. ACM Transactions on Graphics 38 (4), pp. 1–12. Cited by: §2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.1.
- A learned shape-adaptive subsurface scattering model. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–15. Cited by: §2.
- Microfacet models for refraction through rough surfaces. EGSR 07. Cited by: §4.1, §5.
- Casting curved shadows on curved surfaces. In SIGGRAPH, Vol. 12, pp. 270–274. Cited by: §4.3.
- Deep view synthesis from sparse photometric images. SIGGRAPH 38 (4), pp. 76. Cited by: §2, §6.
- Minimal brdf sampling for two-shot near-field reflectance acquisition. ACM Transactions on Graphics 35 (6), pp. 188. Cited by: §2.
- Deep image-based relighting from optimal sparse samples. SIGGRAPH 37 (4), pp. 126. Cited by: §1, §2, §2, §6, §6.
- MVSNet: depth inference for unstructured multi-view stereo. In ECCV, pp. 767–783. Cited by: §2.
- Recurrent mvsnet for high-resolution multi-view stereo depth inference. In CVPR, pp. 5525–5534. Cited by: §2.
- Deep single-image portrait relighting. In CVPR, pp. 7194–7202. Cited by: §2, §6.
- Stereo magnification: learning view synthesis using multiplane images. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–12. Cited by: §2, §2.
- Multi-view photometric stereo with spatially varying isotropic materials. In CVPR, pp. 1482–1489. Cited by: §1.