The glint of light off an object reveals much about its shape and composition – whether it’s wet or dry, rough or polished, round or flat. Yet, hidden in the pattern of highlights is also an image of the environment, often so distorted that we don’t even realize it’s there. Remarkably, images of the shiny bag of chips (Fig. 1) contain sufficient clues to be able to reconstruct a detailed image of the room, including the layout of lights, windows, and even objects outside that are visible through windows.
In their visual microphone work, Davis et al. [Davis14] showed how sound and even conversations can be reconstructed from the minute vibrations visible in a bag of chips. Inspired by their work, we show that the same bag of chips can be used to reconstruct the environment. Instead of high speed video, however, we operate on RGBD video, as obtained with commodity depth sensors.
Visualizing the environment is closely connected to the problem of modeling the scene that reflects that environment. We solve both problems; beyond visualizing the room, we seek to predict how the objects and scene appear from any new viewpoint — i.e., to virtually explore the scene as if you were there. This view synthesis problem is a classical challenge in computer vision and graphics with a large literature, but several open problems remain. Chief among them are 1) specular surfaces, 2) inter-reflections, and 3) simple capture. In this paper we address all three of these problems, based on the framework ofsurface light fields [wood2000surface].
Our environment reconstructions, which we call specular reflectance maps (SRMs), represent the distant environment map convolved with the object’s specular BRDF. In cases where the object has a strong mirror-like reflections, this SRM provides sharp, detailed features like the one seen in Fig. 1. As most scenes are composed of a mixture of materials, each scene has multiple basis SRMs. We therefore reconstruct a global set of SRMs, together with a weighted material segmentation of scene surfaces. Based on the recovered SRMs, together with additional physically motivated components, we build a neural rendering network capable of faithfully approximating the true surface light field.
A major contribution of our approach is the capability of reconstructing a surface light field with the same input needed to compute shape alone [newcombe2011kinectfusion]
using an RGBD camera. Additional contributions of our approach include the ability to operate on regular (low-dynamic range) imagery, and applicability to general, non-convex, textured scenes containing multiple objects and both diffuse and specular materials. Lastly, we release RGBD dataset capturing reflective objects to facilitate research on lighting estimation and image-based rendering.
2 Related Work
We review related work in environment lighting estimation and novel-view synthesis approaches for modeling specular surfaces.
2.1 Environment Estimation
The most straightforward way to capture an environment map (image) is via light probes (e.g., a mirrored ball [debevecHDR]) or taking photos with a 360 camera [park2018surface]. Human eye balls [nishino04] can even serve as light probes when they are present. For many applications, however, light probes are not available and we must rely on existing cues in the scene itself.
Other methods instead study recovering lighting from a photo of a general scene. Because this problem is severely under-constrained, these methods often rely on human inputs [karsch2011rendering, zheng2012interactive] or manually designed “intrinsic image” priors on illumination, material, and surface properties [karsch2014automatic, barron2014shape, barron2012shape, bi20151, lombardi2012reflectance].
Recent developments in deep learning techniques facilitate data-driven approaches for single view estimation.[gardner2017learning, gardner2019deep, song2019neural, legendre2019deeplight] learn a mapping from a perspective image to a wider-angle panoramic image. Other methods train models specifically tailored for outdoor scenes [hold2017deep, hold2019deep]. Because the single-view problem is severely ill-posed, most results are plausible but often non-veridical. Closely related to our work, Georgoulis et al. [georgoulis2017around] reconstructs perceivable environment images from specular reflctions, but it makes very limiting assumptions, such as a single, floating object with textureless and painted surfaces, known geometry, and manual specification of materials and segmentation.
Previous approaches achieve environment estimation from multi-view inputs, often times as a byproduct of solving for scene appearance models.
For the special case of planar reflectors, layer separation techniques [szeliski2000layer, sinha2012image, xue2015computational, han2017reflection, guo2014robust, jachnik2012real, zhang2018single] enable high quality reconstructions of reflected environments, e.g., from video of a glass picture frame. Inferring reflections for general, curved surfaces is dramatically harder, even for humans, as the reflected content depends strongly and nonlinearly on surface shape and spatially-varying material properties,
A number of researchers have sought to recover low-frequency lighting from multiple images of curved objects. [zollhofer2015shading, or2015rgbd, maier2017intrinsic3d] infer spherical harmonics lighting (following [ramamoorthi2001signal]) to refine the surface geometry using principles of shape-from-shading. [richter2016instant] jointly optimizes low frequency lighting and BRDFs of a reconstructed scene. While suitable for approximating light source directions, these models don’t capture detailed images of the environment.
Wu et al. [wu2015simultaneous], like us, use a hand-held RGBD sensor to recover lighting and reflectance properties. But the method can only reconstruct a single, floating, convex object, and requires a black background. Dong et al. [dong2014appearance] produces high quality environment images from a video of a single rotating object. This method assumes a laboratory setup with a mechanical rotator, and manual registration of an accurate geometry to their video. Similarly, Xia et al. [xia2016recovering] use a robotic arm with calibration patterns to rotate an object. The authors note highly specular surfaces cause trouble, thus limiting their real object samples to mostly rough, glossy materials. In contrast, our method operates with a hand-held camera for a wide-range of multi-object scenes, and is designed to support specularity.
2.2 Novel View Synthesis
Novel view synthesis (NVS) methods synthesize realistic scene renderings from new camera viewpoints. In this section we focus on NVS methods capable of modeling specular reflections. We refer to [szeliski2010computer, thies2019deferred] for a more extensive review of the broader field.
Light field methods [gortler1996lumigraph, levoy1996light, chen2002light, wood2000surface, davis2012unstructured] enable highly realistic views of specular surfaces at the expense of laborious scene capture from densely sampled viewpoints. Chen et al. [chen2018deep]
regresses surface light field with neural networks to reduce the number of required views, but the system still needs samples across the hemisphere captured with a mechanical system. Although Parket al. [park2018surface] avoid dense hemispherical view sampling by applying a parametric BRDF model to represent the specular component, they assume known lighting.
Recent work applies convolutional neural networks (CNN) to image-based rendering[flynn2016deepstereo, neuralrendering]. Hedman et al. [hedman2018deep]
replaced the traditional view blending heuristics of IBR systems with a CNN-learned blending weights. Still, novel views are composed of existing, captured pixels, so unobserved specular highlights cannot be synthesized. More recently,[aliev2019neural, thies2019deferred] enhance the traditional rendering pipeline by attaching learned features to 2D texture maps [thies2019deferred] or 3D point clouds [aliev2019neural] and achieve high quality view synthesis results. The features are nonetheless specifically optimized to fit the input views and do not extrapolate well to novel views. Recent learning-based methods achieve impressive local (versus hemispherical) light field reconstruction from a small set of images [mildenhall2019local, srinivasan2017learning, choi2019extreme, kalantari2016learning, zhou2018stereo].
BRDF Estimation Methods
Another way to synthesize novel views is to recover intrinsic surface reflection functions, known as BRDFs [nicodemus1965directional]. In general, recovering the surface BRDFs is a difficult task, as it involves inverting the complex light transport process. Consequently, existing reflectance capture methods place limits on operating range: e.g. isolated single object [wu2015simultaneous, dong2014appearance], known or controlled lighting [park2018surface, debevec1996modeling, lensch2003image, zhou2016sparse, xu2019deep], single view surface (versus a full 3D mesh) [goldman2010shape, li2018learning], flash photography [aittala2015two, lee2018practical, nam2018practical], or spatially constant material [meka2018lime, kim2017lightweight].
Very few view synthesis techniques support interreflections. Modeling general multi-object scene requires solving for global illumination (e.g. shadows or interreflections), which is shown to be difficult and sensitive to imperfections of real-world inputs [azinovic2019inverse]. Similarly, Lombardi et al. [lombardi2016radiometric] models multi-bounce lighting but with noticeable artifacts and limit their results to mostly uniformly textured objects. Zhang et al. [zhang2016emptying] requires manual annotations of light types and locations.
3 Method Overview
Our system takes a video and 3D mesh of a static scene (obtained via Newcombe et al. [newcombe2011kinectfusion]) as input and automatically recovers an image of the environment along with a scene appearance model that enables novel view synthesis. Our approach excels in specular scenes, and accounts for both specular interreflection and Fresnel effects. A key advantage of our approach is the use of easy, casual data capture from a hand-held camera; we reconstruct the environment map and a surface light field with the same input data needed to reconstruct the geometry alone, e.g. using [newcombe2011kinectfusion].
In Section 4, we provide a review of the formulation of the surface light field [wood2000surface] and define the specular reflectance map (SRM). Then, in Section 5, we show that given geometry and diffuse texture as input, we can jointly recover SRMs and material segmentation through an end-to-end optimization approach. Lastly, in Section 6, we describe a scene-specific neural rendering network that combines recovered SRMs and other rendering components to synthesize realistic novel-view images, with interreflections and Fresnel effects.
4 Surface Light Field Formulation
We model scene appearance using the concept of a surface light field [wood2000surface], which defines the color radiance of a surface point in every view direction, given approximate geometry, denoted [newcombe2011kinectfusion].
Formally, the surface light field, denoted , assigns an RGB radiance value to a ray coming from surface point with outgoing direction : . As is common in computer graphics [phong1975illumination, ward1992measuring], we decompose the surface light field into diffuse (view-independent) and specular (view-dependent) components:
We compute the diffuse texture for each surface point as the minimum intensity of across different input views following [szeliski2000layer, park2018surface]. Because the diffuse component is view-independent, we can then render it from arbitrary viewpoints using the estimated geometry. However, textured 3D reconstructions typically contain errors (e.g., silhouettes are enlarged, as in Fig. 2), so we refine the rendered texture image using a neural network (Sec. 5).
For the specular component, we define the specular reflectance map (SRM) (also known as lumisphere [wood2000surface]) and denoted , as a function that maps a reflection ray direction
, defined as the vector reflection ofabout surface normal [wood2000surface] to specular reflectance (i.e., radiance): , where is a unit hemisphere around the scene center. This model assumes distant environment illumination, although we add support for specular interreflection later in Sec. 6.1. Note that this model is closely related to the prefiltered environment maps [kautz2000unified] used in graphics community for real-time rendering of specular highlights.
Given a specular reflectance map , we can render the specular image from a virtual camera as follows:
where is a shadow (visibility) term that is when the reflected ray from intersects with known geometry , and otherwise.
An SRM contains distant environment lighting convolved with a particular specular BRDF. As a result, a single SRM can only accurately describe one surface material. In order to generalize to multiple (and spatially varying) materials, we modify Eq. (2) by assuming the material at point is a linear combination of basis materials [goldman2010shape, alldrin2008photometric, zickler2005reflectance]:
where , and is user-specified. For each surface point , defines the weight of material basis . We use a neural network to approximate these weights in image-space, as described in the next section.
5 Estimating SRMs and Material Segmentation
Given scene shape and photos from known viewpoints as input, we now describe how to recover an optimal set of SRMs and material weights.
Suppose we want to predict a view of the scene from camera at pixels , given known SRMs and material weights. We render the known diffuse component from diffuse texture , and a blending weight map for each SRM using standard rasterization. A reflection direction image is obtained by computing per-pixel values. We then compute the specular component image by looking up the reflected ray directions in each SRM, and then combining the radiance values using :
where is the visibility term of pixel as used in Eq. (3). Each is stored as a 2D panorama image of resolution 500 x 250 in spherical coordinates.
Now, suppose that SRMs and material weights are unknown; the optimal SRMs and combination weights minimize the energy defined as the sum of differences between the real photos and the rendered composites of diffuse and specular images over all input frames :
where is a pixel-wise loss.
While Eq. (5) could be minimized directly to obtain values of and , in practice, there are several limiting factors. First, specular highlights tend to be sparse and cover a small percentage of specular scene surfaces. Points on specular surfaces that don’t see a highlight are difficult to differentiate from diffuse surface points, thus making the problem of assigning material weights to surface points severely under-constrained. In addition, captured geometry is seldom perfect, and misalignments in reconstructed diffuse texture can result in incorrect SRMs. In the remainder of this section, we describe our approach to overcome each of these limiting factors.
Material weight network.
First, to address the problem of material ambiguity, we pose the material assignment problem as a statistical pattern recognition task. We compute the 2D weight mapswith a convolutional neural network that learns to map a diffuse texture image patch to the blending weight of th material: This network learns correlations between diffuse texture and material properties (e.g., shininess), and is trained on each scene by jointly optimizing the network weights and SRMs to reproduce the input images.
Since predicts material weights in image-space, and therefore per view, we introduce a view-consistency regularization function penalizing the pixel-wise difference in the predicted materials between a pair of views when cross-projected to each other (i.e., one image is warped to the other using the known geometry and pose).
Diffuse refinement network.
Small errors in geometry and calibration, as are typical in scanned models, cause misalignment and ghosting artifacts in the texture reconstruction . Therefore, we introduce a refinement network to correct these errors (Fig. 2). We replace with the refined texture image: . Similar to the material weights, we penalize the inconsistency of the refined diffuse images across viewpoints using Both networks and
follow the encoder-decoder architecture with residual connections[johnson2016perceptual, he2016deep], while has lower number of parameters. We refer readers to supplementary for more details.
In order to get fine details in , it is necessary to use more than a simple loss. Therefore, we define the image distance metric as a combination of pixel-wise loss, perceptual loss computed from feature activations of a pretrained network [chen2017photographic], and adversarial loss [goodfellow2014generative, isola2017image]. Our total loss, for a pair of images , is:
where is the discriminator, and , , and are balancing coefficients, which are 0.01, 1.0, 0.05, respectively. The neural network-based perceptual and adversarial loss are effective because they are robust to image-space misalignments caused by errors in the estimated geometry and poses.
Finally, we add a sparsity term on the specular image to regularize the specular component from containing colors from the diffuse texture.
Combining all elements, we get the final loss function:
where is a randomly chosen frame in the same batch with
during each stochastic gradient descent step., and are set to 1e-4. An overview diagram is shown in Fig. 3. Fig. 5 shows that the optimization discovers coherent material regions and perceivable environment image.
6 Novel-View Neural Rendering
With reconstructed SRMs and material weights, we can synthesize specular appearance from any desired viewpoint via Eq. (2). However, while the approach detailed in Sec. 5 reconstructs high quality SRMs, the renderings often lack realism (shown in supplementary), due to two factors. First, errors in geometry and camera pose can sometimes lead to weaker reconstructed highlights. Second, the SRMs do not model more complex light transport effects such as interreflections or Fresnel reflection. This section describes how we train a network to address these two limitations, yielding more realistic results.
Simulations only go so far, and computer renderings will never be perfect. In principle, you could train a CNN to render images as a function of viewpoint directly, training on actual photos. Indeed, several recent neural rendering methods adapt image translation [isola2017image] to learn mappings from projected point clouds [neuralrendering, pittaluga2019revealing, aliev2019neural] or a UV map image [thies2019deferred] to a photo. However, these methods struggle to extrapolate far away from the input views because their networks have to figure out the physics of specular highlights from scratch (see Sec. 8.2).
Rather than treat the rendering problem as a black box, we arm the neural renderer with knowledge of physics – in particular, diffuse, specular, interreflection, and Fresnel reflection, to use in learning how to render images. Formally, we introduce an adversarial neural network-based generator and discriminator to render realistic photos. takes as input our best prediction of diffuse and specular components for the current view (obtained from Eq. (7)), along with interreflection and Fresnel terms , , and that will be defined later in this section.
Consequently, the generator receives as input and outputs a prediction of the view, while the discriminator scores its realism. We use the combination of pixelwise , perceptual loss [chen2017photographic], and the adversarial loss [isola2017image] as described in Sec. 5:
where is the mean of perceptual loss across all input images, and and are similarly defined as an average loss across frames. Note that this renderer is scene specific, trained only on images of a particular scene to extrapolate new views of that same scene, as commonly done in the neural rendering community [neuralrendering, thies2019deferred, aliev2019neural].
6.1 Modeling Interreflections and Fresnel Effects
Eq. (2) models only the direct illumination of each surface point by the environment, neglecting interreflections. While modeling full, global, diffuse + specular light transport is intractable, we can approximate first order interreflections by ray-tracing a first-bounce image (FBI) as follows. For each pixel in the virtual viewpoint to be rendered, cast a ray from the camera center through . If we pretend for now that every scene surface is a perfect mirror, that ray will bounce potentially multiple times and intersect multiple surfaces. Let be the second point of intersection of that ray with the scene. Render the pixel at in FBI with the diffuse color of , or with black if there is no second intersection Fig. 4d.
Glossy (imperfect mirror) interreflections can be modeled by convolving the FBI with the BRDF. Strictly speaking, however, the interreflected image should be filtered in the angular domain [ramamoorthi2001signal], rather than image space, i.e., convolution of incoming light following the specular lobe whose center is the reflection ray direction . Given , angular domain convolution can be approximated in image space by convolving the FBI image weighted by . However, because we do not know the specular kernel, we let the network infer the weights using as a guide. We encode the for each pixel as a three-channel image (Fig. 4e).
Fresnel effects make highlights stronger at near-glancing view angles and are important for realistic rendering. Fresnel coefficients are approximated following [schlick1994inexpensive]: where is the angle between the surface normal and the camera ray, and is a material-specific constant. We compute a Fresnel coefficient image (FCI), where each pixel contains , and provide it to the network as an additional input, shown in Fig. 4(f).
In total, the rendering components are now composed of five images: diffuse and specular images, FBI image, , and FCI. is then given as input to the neural network, and our network weights are optimized as in Eq. (8). Fig. 4 shows the effectiveness of the additional three rendering components on modeling interreflections.
6.2 Implementation Details
We follow [johnson2016perceptual] for the generator network architecture, while we use the PatchGAN discriminator [isola2017image] and employ the loss of LSGAN [mao2017least]. We use ADAM [kingma2014adam] with learning rate 2e-4 to optimize the objectives. Data augmentation was carried out by applying random rotation, translation, flipping, and scaling to each input and output pair, which was essential for viewpoint generalization. We refer readers to supplementary for comprehensive implementation details.
We capture ten sequences of RGBD video with a hand-held Primesense depth camera, featuring a wide range of materials, lighting, objects, environments, and camera paths. We plan to release the dataset along with the camera parameters and reconstructed textured mesh. The length of each sequence ranges from 1500 to 3000 frames, which are split into train and test frames. Some of the sequences were captured such that the test views are very far from the training views, making them ideal for benchmarking the extrapolation abilities of novel-view synthesis methods. Moreover, many of the sequences come with ground truth HDR environment maps to facilitate future research on environment estimation. Further capture and data-processing details are covered in supplementary.
We conduct experiments to test our system’s ability to estimate images of the environment and synthesize novel viewpoints. We also perform ablation studies to characterize the factors that most contribute to system performance.
We compare our approach to several state-of-the-art methods: recent single view lighting estimation methods (DeepLight [LeGendre19], Gardner et al. [gardner2017learning]), an RGBD video-based lighting and material reconstruction method (Lombardi et al. [lombardi2016radiometric]), IR-based BRDF estimation method (Park et al. [park2018surface]), and two leading view synthesis methods capable of handling specular highlights – DeepBlending [hedman2018deep] and Deferred Neural Rendering (DNS) [thies2019deferred]. Note that these methods show state-of-the-art performance in their respective tasks, so we omit comparisons that are already included in their reports: e.g., DeepBlending thoroughly compares with image-based rendering methods [cayon2015bayesian, penner2017soft, buehler2001unstructured, eisemann2008floating, hedman2016scalable].
8.1 Environment Estimation
Our computed SRMs demonstrate our system’s ability to infer detailed images of the environment from the pattern and motion of specular highlights on an object. For example from 5(b), we can see the general layout of the living room, and even count the number of floors in buildings visible through the window. Note that the person capturing the video does not appear in the environment map because he is constantly moving. The shadow of the person, however, could cause artifacts – e.g. the fluorescent lighting in the first row of Fig. 5 is discontinuous.
Compared to the state-of-the-art single view estimation methods [legendre2019deeplight, gardner2017learning], our method produces a more accurate image of the environment, as shown in Fig. 6. Note our reconstruction shows a person standing near the window and autumn colors in a tree visible through the window.
We compare with a multi-view RGBD based method of Lombardi et al. [lombardi2016radiometric] on a synthetic scene containing a blob, which we obtained from the authors. As in [lombardi2016radiometric], we estimate lighting from the known geometry with added noise and a video of the scene rendering. Results show our method produces more accurate estimate than the analytical BRDF method of Lombardi et al. [lombardi2016radiometric] (Fig. 6).
8.2 Novel-View Synthesis
We recover specular reflectance maps and train a generative network for each video sequence. The trained model is then used to generate novel views from held-out views.
In the supplementary, we show novel view generation results for different scenes, along with the intermediate rendering components and ground truth images. As view synthesis results are better shown in video form, we strongly encourage readers to watch the supplementary video.
While view extrapolation is a key for many applications, it has been particularly challenging for scenes with reflections. To test the operating range of our and other recent view synthesis results, we study how the quality of view prediction degrades as a function of the distance to the nearest input images (in difference of viewing angles) (Fig. 8). The prediction quality measured with the neural network-based perceptual loss [zhang2018unreasonable], which is known to be more robust to shifts or misalignments, against the ground truth test image taken from same pose. We used two video sequences both containing highly reflective surfaces and taken with intentionally large difference in train and test viewpoints. In order to measure the quality of extrapolation, we focus our attention on parts of the scene which exhibit significant view-dependent effects. That is, we mask out the diffuse backgrounds and measure the loss on only central objects of the scene. We compare our method with DeepBlending [hedman2018deep] and Thies et al. [thies2019deferred]. The quantitative (Fig. 8) and qualitative (Fig. 7) results show that our method is able to produce more accurate images of the scene from extrapolated viewpoints.
9 Limitations and Future work
Our approach relies on the reconstructed mesh obtained from fusing depth images of consumer-level depth cameras and thus fails for surfaces out of the operating range of these cameras, e.g., thin, transparent, or mirror surfaces. Currently, the recovered environment captures the lighting filtered by the surface BRDF; separating these two factors is an interesting topic of future work, perhaps via data-driven deconvolution. Last, reconstructing a room-scale photorealistic appearance model remains a major challenge.
Appendix A Overview
In this document we provide additional experimental results and extended technical details to supplement the main submission. We first discuss the effects on the output of the system made by changes in the loss functions (Sec. B), scene surface characteristics (surface roughness) (Sec. C), and number of material bases (Sec. D). We then showcase our system’s ability to model the Fresnel effect (Sec. E), and compare our method against a recent BRDF estimation approach (Sec. F). In Sections G,H, we explain the data capture process and provide additional implementation details. Finally, we describe our supplementary video (Sec. I) and show additional novel-view synthesis results along with their intermediate rendering components (Sec. J).
Appendix B Effects of Loss Functions
In this section, we study how the choice of loss functions affects the quality of environment estimation and novel view synthesis. Specifically, we consider three loss functions between prediction and reference images as introduced in the main paper: (i) pixel-wise loss, (ii) neural-network based perceptual loss, and (iii) adversarial loss. We run each of our algorithms (environment estimation and novel-view synthesis) for the three following cases: using (i) only, (i+ii) only, and all loss functions combined (i+ii+iii). For both algorithms we provide visual comparisons for each set of loss functions in Figures 11,12.
b.1 Environment Estimation
We run our joint optimization of SRMs and material weights to recover a visualization of the environment using the set of loss functions described above. As shown in Fig. 12, the pixel-wise L1 loss was unable to effectively penalize the view prediction error because it is very sensitive to misalignments due to noisy geometry and camera pose. While the addition of perceptual loss produces better results, one can observe muted specular highlights in the very bright regions. The adversarial loss, in addition to the two other losses, effectively deals with the input errors while simultaneously correctly capturing the light sources.
b.2 Novel-View Synthesis
We similarly train the novel-view neural rendering network in Sec. 6 using the aforementioned loss functions. Results in Fig. 11 shows that while L1 loss fails to capture specularity when significant image misalignments exist, the addition of perceptual loss somewhat addresses the issue. As expected, using adversarial loss, along with all other losses, allows the neural network to fully capture the intensity of specular highlights.
Appendix C Effects of Surface Roughness
As descrbied in the main paper, our recovered specular reflectance map is environment lighting convolved with the surface’s specular BRDF. Thus, the quality of the estimated SRM should depend on the roughness of the surface, e.g. a near Lambertian surface would not provide significant information about its surroundings. To test this claim, we run the SRM estimation algorithm on a synthetic object with varying levels of specular roughness. Specifically, we vary the roughness parameter of the GGX shading model [walter2007microfacet] from 0.01 to 1.0, where smaller values correspond to more mirror-like surfaces. We render images of the synthetic object, and provide those rendered images, as well as the geometry (with added noise in both scale and vertex displacements, to simulate a real scanning scenario), to our algorithm. The results show that the accuracy of environment estimation decreases as the object surface gets more rough, as expected (Fig. 16). Note that although increasing amounts of surface roughness does cause the amount of detail in our estimated environments to decrease, this is expected, as the recovered SRM still faithfully reproduces the convolved lighting (Fig. 15).
Appendix D Effects of Number of Material Bases
The joint SRM and segmentation optimization of the main paper requires a user to set the number of material bases. In this section, we study how the algorithm is affected by the user specified number. Specifically, for a scene containing two cans, we run our algorithm twice, with number of material bases set to be two and three, respectively. The results of the experiment in Figure 13 suggest that the number of material bases does not have a significant effect on the output of our system.
Appendix E Fresnel Effect Example
The Fresnel effect is a phenomenon where specular highlights tend to be stronger at near-glancing view angles, and is an important visual effect in the graphics community. We show in Fig. 14 that our neural rendering system correctly models the Fresnel effect. In the supplementary video, we show the Fresnel effect in motion, along with comparisons to the ground truth sequences.
Appendix F Comparison to BRDF Fitting
Recovering a parametric analytical BRDF is a popular strategy to model view-dependent effects. We thus compare our neural network-based novel-view synthesis approach against a recent BRDF fitting method of [park2018surface] that uses an IR laser and camera to optimize for the surface specular BRDF parameters. As shown in Fig. 17, sharp specular BRDF fitting methods are prone to failure when there are calibration errors or misalignments in geometry.
Appendix G Data Capture Details
As described in Sec. 7 of the main paper, we capture ten videos of objects with varying materials, lighting and compositions. We used a Primesense Carmine RGBD structured light camera. We perform intrinsic and radiometric calibrations, and correct the images for vignetting. During capture, the color and depth streams were hardware-synchronized, and registered to the color camera frame-of-reference. The resolution of both streams are VGA (640x480) and the frame rate was set to 30fps. Camera exposure was manually set and fixed within a scene.
We obtained camera extrinsics by running ORB-SLAM [mur2017orb] (ICP [newcombe2011kinectfusion] was alternatively used for feature-poor scenes). Using the estimated pose, we ran volumetric fusion [newcombe2011kinectfusion] to obtain the geometry reconstruction. Once geometry and rough camera poses are estimated, we ran frame-to-model dense photometric alignment following [park2018surface] for more accurate camera positions, which are subsequently used to fuse in the diffuse texture to the geometry. Following [park2018surface], we use iteratively reweighted least squares to compute a robust minimum of intensity for each surface point across viewpoints, which provides a good approximation to the diffuse texture.
Appendix H Implementation Details
Our pipeline is built using PyTorch[paszke2017automatic]
. For all of our experiments we used ADAM optimizer with learning rate 2e-4 for the neural networks and 1e-3 for the SRM pixels. For the SRM optimization described in Sec. 5 of the main text the training was run for 40 epochs (i.e. each training frame is processed 40 times), while the neural renderer training was run for 75 epochs.
We find that data augmentation plays a significant role to the view generalization of our algorithm. For training in Sec. 5, we used random rotation (up to 180), translation (up to 100 pixels), and horizontal and vertical flips. For neural renderer training in Sec. 6, we additionally scale the input images by a random factor between 0.8 and 1.25.
We use Blender [blender] for computing the reflection direction image and the first bounce interreflection (FBI) image described in the main text.
h.1 Network Architectures
Let C(k,ch_in,ch_out,s) be a convolution layer with kernel size k, input channel size ch_in, output channel size ch_out
, and strides. When the stride s is smaller than 1, we first conduct nearest-pixel upsampling on the input feature and then process it with a regular convolution layer. We denote CNR and CR
to be the Convolution-InstanceNorm-ReLU layer and Convolution-ReLU layer, respectively. A residual blockR(ch) of channel size ch contains convolutional layers of CNR(3,ch,ch,1)-CN(3,ch,ch,1), where the final output is the sum of the outputs of the first and the second layer.
Encoder-Decoder Network Architecture
The architecture of the texture refinement network and the neural rendering network in Sec.5 and Sec.6 closely follow the architecture of an encoder-decoder network of Johnson et al. [johnson2016perceptual]: CNR(9,ch_in,32,1)-CNR(3,32,64,2)-CNR(3,64,
-C(3,32,3,1), where c_in represents a variable input channel size, which is 3 and 13 for the texture refinement network and neural rendering generator, respectively.
Material Weight Network
The architecture of the material weight estimation network in Sec. 5 is as follows: CNR(5,3,64,2)-CNR(3,64,64,2)-R(64)-R(64)-
The discriminator network used for the adversarial loss in Eq.7 and Eq.8 of the main paper both use the same architecture as follows: CR(4,3,64,2)-CNR(4,64,128,2)-CNR(4,128,
256,2)-CNR(4,256,512,2)-C(1,512,1,1). For this network, we use a LeakyReLU activation (slope 0.2) instead of the regular ReLU, so CNR used here is a Convolution-InstanceNorn-LeakyReLU layer. Note that the spatial dimension of the discriminator output is larger than 1x1 for our image dimensions (640x480), i.e., the discriminator scores realism of patches rather than the whole image (as in PatchGAN [isola2017image]).
Appendix I Supplementary Video
We strongly encourage readers to watch the supplementary video, as many of our results we present are best seen as videos. Our supplementary video contains visualizations of input videos, environment estimations, our neural novel-view synthesis (NVS) renderings, and side-by-side comparisons against the state-of-the-art NVS methods. We note that the ground truth videos of the NVS section are cropped such that regions with missing geometry are displayed as black. The purpose of the crop is to provide equal visual comparisons between the ground truth and the rendering, so that viewers are able to focus on the realism of reconstructed scene instead of the background. Since the reconstructed geometry is not always perfectly aligned with the input videos, some boundaries of the ground truth stream may contain noticeable artifacts, such as edge-fattening. An example of this can be seen in the ‘acryl’ sequence, near the top of the object.
Appendix J Additional Results
Table 1 shows numerical comparisons on novel-view synthesis against state-of-the-art methods [hedman2018deep, thies2019deferred] for the two scenes presented in the main text (Fig. 7). We adopt two commonly used metrics, i.e. pixel-wise L1 and deep perceptual loss [johnson2016perceptual], to measure the distance between a predicted novel-view image and its corresponding ground-truth test image held-out during training. As described in the main text we focus on the systems’ ability to extrapolate specular highlight, thus we only measure the errors on the object surfaces, i.e. we remove diffuse backgrounds.
Fig. 18 shows that the naïve addition of diffuse and specular components obtained from the optimization in Sec. 5 does not results in photorealistic novel view synthesis, thus motivating a separate neural rendering step that takes as input the intermediate physically-based rendering components.
Fig. 19 shows novel-view neural rendering results, together with the estimated components (diffuse and specular images , ) provided as input to the renderer. Our approach can synthesize photorealistic novel views of a scene with wide range of materials, object compositions, and lighting condition. Note that the featured scenes contain challenging properties such as bumpy surfaces (Fruits), rough reflecting surfaces (Macbook), and concave surfaces (Bowls). Overall, we demonstrate the robustness of our approach for various materials including fabric, metals, plastic, ceramic, fruit, wood, glass, etc.