Seeing the World in a Bag of Chips

01/14/2020 ∙ by Jeong Joon Park, et al. ∙ University of Washington 10

We address the dual problems of novel view synthesis and environment reconstruction from hand-held RGBD sensors. Our contributions include 1) modeling highly specular objects, 2) modeling inter-reflections and Fresnel effects, and 3) enabling surface light field reconstruction with the same input needed to reconstruct shape alone. In cases where scene surface has a strong mirror-like material component, we generate highly detailed environment images, revealing room composition, objects, people, buildings, and trees visible through windows. Our approach yields state of the art view synthesis techniques, operates on low dynamic range imagery, and is robust to geometric and calibration errors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 8

page 10

page 11

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The glint of light off an object reveals much about its shape and composition – whether it’s wet or dry, rough or polished, round or flat. Yet, hidden in the pattern of highlights is also an image of the environment, often so distorted that we don’t even realize it’s there. Remarkably, images of the shiny bag of chips (Fig. 1) contain sufficient clues to be able to reconstruct a detailed image of the room, including the layout of lights, windows, and even objects outside that are visible through windows.

In their visual microphone work, Davis et al. [Davis14] showed how sound and even conversations can be reconstructed from the minute vibrations visible in a bag of chips. Inspired by their work, we show that the same bag of chips can be used to reconstruct the environment. Instead of high speed video, however, we operate on RGBD video, as obtained with commodity depth sensors.

Visualizing the environment is closely connected to the problem of modeling the scene that reflects that environment. We solve both problems; beyond visualizing the room, we seek to predict how the objects and scene appear from any new viewpoint — i.e., to virtually explore the scene as if you were there. This view synthesis problem is a classical challenge in computer vision and graphics with a large literature, but several open problems remain. Chief among them are 1) specular surfaces, 2) inter-reflections, and 3) simple capture. In this paper we address all three of these problems, based on the framework of

surface light fields [wood2000surface].

Our environment reconstructions, which we call specular reflectance maps (SRMs), represent the distant environment map convolved with the object’s specular BRDF. In cases where the object has a strong mirror-like reflections, this SRM provides sharp, detailed features like the one seen in Fig. 1. As most scenes are composed of a mixture of materials, each scene has multiple basis SRMs. We therefore reconstruct a global set of SRMs, together with a weighted material segmentation of scene surfaces. Based on the recovered SRMs, together with additional physically motivated components, we build a neural rendering network capable of faithfully approximating the true surface light field.

A major contribution of our approach is the capability of reconstructing a surface light field with the same input needed to compute shape alone [newcombe2011kinectfusion]

using an RGBD camera. Additional contributions of our approach include the ability to operate on regular (low-dynamic range) imagery, and applicability to general, non-convex, textured scenes containing multiple objects and both diffuse and specular materials. Lastly, we release RGBD dataset capturing reflective objects to facilitate research on lighting estimation and image-based rendering.

2 Related Work

We review related work in environment lighting estimation and novel-view synthesis approaches for modeling specular surfaces.

2.1 Environment Estimation

Single-View Estimation

The most straightforward way to capture an environment map (image) is via light probes (e.g., a mirrored ball [debevecHDR]) or taking photos with a 360 camera [park2018surface]. Human eye balls [nishino04] can even serve as light probes when they are present. For many applications, however, light probes are not available and we must rely on existing cues in the scene itself.

Other methods instead study recovering lighting from a photo of a general scene. Because this problem is severely under-constrained, these methods often rely on human inputs [karsch2011rendering, zheng2012interactive] or manually designed “intrinsic image” priors on illumination, material, and surface properties [karsch2014automatic, barron2014shape, barron2012shape, bi20151, lombardi2012reflectance].

Recent developments in deep learning techniques facilitate data-driven approaches for single view estimation.

[gardner2017learning, gardner2019deep, song2019neural, legendre2019deeplight] learn a mapping from a perspective image to a wider-angle panoramic image. Other methods train models specifically tailored for outdoor scenes [hold2017deep, hold2019deep]. Because the single-view problem is severely ill-posed, most results are plausible but often non-veridical. Closely related to our work, Georgoulis et al. [georgoulis2017around] reconstructs perceivable environment images from specular reflctions, but it makes very limiting assumptions, such as a single, floating object with textureless and painted surfaces, known geometry, and manual specification of materials and segmentation.

Multi-View Estimation

Previous approaches achieve environment estimation from multi-view inputs, often times as a byproduct of solving for scene appearance models.

For the special case of planar reflectors, layer separation techniques [szeliski2000layer, sinha2012image, xue2015computational, han2017reflection, guo2014robust, jachnik2012real, zhang2018single] enable high quality reconstructions of reflected environments, e.g., from video of a glass picture frame. Inferring reflections for general, curved surfaces is dramatically harder, even for humans, as the reflected content depends strongly and nonlinearly on surface shape and spatially-varying material properties,

A number of researchers have sought to recover low-frequency lighting from multiple images of curved objects. [zollhofer2015shading, or2015rgbd, maier2017intrinsic3d] infer spherical harmonics lighting (following [ramamoorthi2001signal]) to refine the surface geometry using principles of shape-from-shading. [richter2016instant] jointly optimizes low frequency lighting and BRDFs of a reconstructed scene. While suitable for approximating light source directions, these models don’t capture detailed images of the environment.

Wu et al. [wu2015simultaneous], like us, use a hand-held RGBD sensor to recover lighting and reflectance properties. But the method can only reconstruct a single, floating, convex object, and requires a black background. Dong et al. [dong2014appearance] produces high quality environment images from a video of a single rotating object. This method assumes a laboratory setup with a mechanical rotator, and manual registration of an accurate geometry to their video. Similarly, Xia et al. [xia2016recovering] use a robotic arm with calibration patterns to rotate an object. The authors note highly specular surfaces cause trouble, thus limiting their real object samples to mostly rough, glossy materials. In contrast, our method operates with a hand-held camera for a wide-range of multi-object scenes, and is designed to support specularity.

2.2 Novel View Synthesis

Novel view synthesis (NVS) methods synthesize realistic scene renderings from new camera viewpoints. In this section we focus on NVS methods capable of modeling specular reflections. We refer to [szeliski2010computer, thies2019deferred] for a more extensive review of the broader field.

Image-based Rendering

Light field methods [gortler1996lumigraph, levoy1996light, chen2002light, wood2000surface, davis2012unstructured] enable highly realistic views of specular surfaces at the expense of laborious scene capture from densely sampled viewpoints. Chen et al. [chen2018deep]

regresses surface light field with neural networks to reduce the number of required views, but the system still needs samples across the hemisphere captured with a mechanical system. Although Park

et al. [park2018surface] avoid dense hemispherical view sampling by applying a parametric BRDF model to represent the specular component, they assume known lighting.

Recent work applies convolutional neural networks (CNN) to image-based rendering

[flynn2016deepstereo, neuralrendering]. Hedman et al. [hedman2018deep]

replaced the traditional view blending heuristics of IBR systems with a CNN-learned blending weights. Still, novel views are composed of existing, captured pixels, so unobserved specular highlights cannot be synthesized. More recently,

[aliev2019neural, thies2019deferred] enhance the traditional rendering pipeline by attaching learned features to 2D texture maps [thies2019deferred] or 3D point clouds [aliev2019neural] and achieve high quality view synthesis results. The features are nonetheless specifically optimized to fit the input views and do not extrapolate well to novel views. Recent learning-based methods achieve impressive local (versus hemispherical) light field reconstruction from a small set of images [mildenhall2019local, srinivasan2017learning, choi2019extreme, kalantari2016learning, zhou2018stereo].

BRDF Estimation Methods

Another way to synthesize novel views is to recover intrinsic surface reflection functions, known as BRDFs [nicodemus1965directional]. In general, recovering the surface BRDFs is a difficult task, as it involves inverting the complex light transport process. Consequently, existing reflectance capture methods place limits on operating range: e.g. isolated single object [wu2015simultaneous, dong2014appearance], known or controlled lighting [park2018surface, debevec1996modeling, lensch2003image, zhou2016sparse, xu2019deep], single view surface (versus a full 3D mesh) [goldman2010shape, li2018learning], flash photography [aittala2015two, lee2018practical, nam2018practical], or spatially constant material [meka2018lime, kim2017lightweight].

Interreflections

Very few view synthesis techniques support interreflections. Modeling general multi-object scene requires solving for global illumination (e.g. shadows or interreflections), which is shown to be difficult and sensitive to imperfections of real-world inputs [azinovic2019inverse]. Similarly, Lombardi et al. [lombardi2016radiometric] models multi-bounce lighting but with noticeable artifacts and limit their results to mostly uniformly textured objects. Zhang et al. [zhang2016emptying] requires manual annotations of light types and locations.

3 Method Overview

Our system takes a video and 3D mesh of a static scene (obtained via Newcombe et al. [newcombe2011kinectfusion]) as input and automatically recovers an image of the environment along with a scene appearance model that enables novel view synthesis. Our approach excels in specular scenes, and accounts for both specular interreflection and Fresnel effects. A key advantage of our approach is the use of easy, casual data capture from a hand-held camera; we reconstruct the environment map and a surface light field with the same input data needed to reconstruct the geometry alone, e.g. using [newcombe2011kinectfusion].

In Section 4, we provide a review of the formulation of the surface light field [wood2000surface] and define the specular reflectance map (SRM). Then, in Section 5, we show that given geometry and diffuse texture as input, we can jointly recover SRMs and material segmentation through an end-to-end optimization approach. Lastly, in Section  6, we describe a scene-specific neural rendering network that combines recovered SRMs and other rendering components to synthesize realistic novel-view images, with interreflections and Fresnel effects.

4 Surface Light Field Formulation

We model scene appearance using the concept of a surface light field [wood2000surface], which defines the color radiance of a surface point in every view direction, given approximate geometry, denoted [newcombe2011kinectfusion].

Formally, the surface light field, denoted , assigns an RGB radiance value to a ray coming from surface point with outgoing direction : . As is common in computer graphics [phong1975illumination, ward1992measuring], we decompose the surface light field into diffuse (view-independent) and specular (view-dependent) components:

(1)

We compute the diffuse texture for each surface point as the minimum intensity of across different input views following [szeliski2000layer, park2018surface]. Because the diffuse component is view-independent, we can then render it from arbitrary viewpoints using the estimated geometry. However, textured 3D reconstructions typically contain errors (e.g., silhouettes are enlarged, as in Fig. 2), so we refine the rendered texture image using a neural network (Sec. 5).

For the specular component, we define the specular reflectance map (SRM) (also known as lumisphere [wood2000surface]) and denoted , as a function that maps a reflection ray direction

, defined as the vector reflection of

about surface normal [wood2000surface] to specular reflectance (i.e., radiance): , where is a unit hemisphere around the scene center. This model assumes distant environment illumination, although we add support for specular interreflection later in Sec. 6.1. Note that this model is closely related to the prefiltered environment maps [kautz2000unified] used in graphics community for real-time rendering of specular highlights.

Given a specular reflectance map , we can render the specular image from a virtual camera as follows:

(2)

where is a shadow (visibility) term that is when the reflected ray from intersects with known geometry , and otherwise.

An SRM contains distant environment lighting convolved with a particular specular BRDF. As a result, a single SRM can only accurately describe one surface material. In order to generalize to multiple (and spatially varying) materials, we modify Eq. (2) by assuming the material at point is a linear combination of basis materials [goldman2010shape, alldrin2008photometric, zickler2005reflectance]:

(3)

where , and is user-specified. For each surface point , defines the weight of material basis . We use a neural network to approximate these weights in image-space, as described in the next section.

(a) Diffuse image
(b) Refined Diffuse image
Figure 2: The role of diffuse network to correct geometry and texture errors of RGBD reconstruction. The bottle geometry in image (a) is estimated larger than it actually is, and the background textures exhibit ghosting artifacts (faces). The use of the refinement network corrects these issues (b). Best viewed digitally.

5 Estimating SRMs and Material Segmentation

Given scene shape and photos from known viewpoints as input, we now describe how to recover an optimal set of SRMs and material weights.

Suppose we want to predict a view of the scene from camera at pixels , given known SRMs and material weights. We render the known diffuse component from diffuse texture , and a blending weight map for each SRM using standard rasterization. A reflection direction image is obtained by computing per-pixel values. We then compute the specular component image by looking up the reflected ray directions in each SRM, and then combining the radiance values using :

(4)

where is the visibility term of pixel as used in Eq. (3). Each is stored as a 2D panorama image of resolution 500 x 250 in spherical coordinates.

Now, suppose that SRMs and material weights are unknown; the optimal SRMs and combination weights minimize the energy defined as the sum of differences between the real photos and the rendered composites of diffuse and specular images over all input frames :

(5)

where is a pixel-wise loss.

While Eq. (5) could be minimized directly to obtain values of and , in practice, there are several limiting factors. First, specular highlights tend to be sparse and cover a small percentage of specular scene surfaces. Points on specular surfaces that don’t see a highlight are difficult to differentiate from diffuse surface points, thus making the problem of assigning material weights to surface points severely under-constrained. In addition, captured geometry is seldom perfect, and misalignments in reconstructed diffuse texture can result in incorrect SRMs. In the remainder of this section, we describe our approach to overcome each of these limiting factors.

Material weight network.

First, to address the problem of material ambiguity, we pose the material assignment problem as a statistical pattern recognition task. We compute the 2D weight maps

with a convolutional neural network that learns to map a diffuse texture image patch to the blending weight of th material: This network learns correlations between diffuse texture and material properties (e.g., shininess), and is trained on each scene by jointly optimizing the network weights and SRMs to reproduce the input images.

Since predicts material weights in image-space, and therefore per view, we introduce a view-consistency regularization function penalizing the pixel-wise difference in the predicted materials between a pair of views when cross-projected to each other (i.e., one image is warped to the other using the known geometry and pose).

Diffuse refinement network.

Small errors in geometry and calibration, as are typical in scanned models, cause misalignment and ghosting artifacts in the texture reconstruction . Therefore, we introduce a refinement network to correct these errors (Fig. 2). We replace with the refined texture image: . Similar to the material weights, we penalize the inconsistency of the refined diffuse images across viewpoints using Both networks and

follow the encoder-decoder architecture with residual connections

[johnson2016perceptual, he2016deep], while has lower number of parameters. We refer readers to supplementary for more details.

Robust Loss.

In order to get fine details in , it is necessary to use more than a simple loss. Therefore, we define the image distance metric as a combination of pixel-wise loss, perceptual loss computed from feature activations of a pretrained network [chen2017photographic], and adversarial loss [goodfellow2014generative, isola2017image]. Our total loss, for a pair of images , is:

(6)

where is the discriminator, and , , and are balancing coefficients, which are 0.01, 1.0, 0.05, respectively. The neural network-based perceptual and adversarial loss are effective because they are robust to image-space misalignments caused by errors in the estimated geometry and poses.

Finally, we add a sparsity term on the specular image to regularize the specular component from containing colors from the diffuse texture.

Combining all elements, we get the final loss function:

(7)

where is a randomly chosen frame in the same batch with

during each stochastic gradient descent step.

, and are set to 1e-4. An overview diagram is shown in Fig. 3. Fig. 5 shows that the optimization discovers coherent material regions and perceivable environment image.

Figure 3: A diagram showing the components of our SRM estimation pipeline (optimized parameters shown in bold). We predict a view by adding refined diffuse texture (Fig 2) and the specular image . is computed, for each pixel, by looking up the basis SRMs (’s) with surface reflection direction and blending them with weights obtained via network . The loss between the predicted view and ground truth

is backpropagated to jointly optimize the SRM pixels and network weights.

(a) W/O Interreflections
(b) With Interreflections
(c) Ground Truth
(d) FBI
(e)
(f) Fresnel
Figure 4: Modeling interreflections. First row shows images of an unseen viewpoint rendered by a network trained with direct (a) and with interreflection + Fresnel models (b), compared to ground truth (c). Note proper interreflections on the bottom of the green bottle (b). (d), (e), and (f) show first-bounce image (FBI), reflection direction image (), and Fresnel coefficient image (FCI), respectively. Best viewed digitally.
(a) Input Video
(b) Material Weights
(c) Recovered SRM
(d) Ground Truth
(e) Recovered SRM
(f) Ground Truth
(g) Zoom-in(ours)
(h) Zoom-in(GT)
Figure 5: Sample results of recovered SRMs and material weights. Given input video frames (a), we recover global SRMs (c) and their linear combination weights (b) from the optimization of Eq. 7. The scenes presented here have two material bases, visualized with red and green channels. Estimated SRMs (c) corresponding to the shiny object surface (green channel) correctly capture the light sources of the scenes, shown in the reference panorama images (d). For both scenes the SRMs corresponding to the red channel is mostly black, thus not shown, as the surface is mostly diffuse. The recovered SRM of (c) overemphasizes blue channel due to oversaturation in input images. Third row shows estimation result from a video of the same bag of chips (first row) under different lighting. Close inspection of the recovered environment (g) shows great amount of details about the scene, e.g. number of floors in a nearby building.
(a) Input
(b) Legendre et al. [LeGendre19]
(c) Gardner et al. [gardner2017learning]
(d) Our Result
(e) Ground Truth
(f) Synthetic Scene
(g) Lombardi et al. [lombardi2016radiometric]
(h) Our Result
(i) Ground Truth
Figure 6: Comparisons with existing single-view and multi-view based environment estimation methods. For single-view approaches, results of Deeplight [LeGendre19] (c) and Gardner et al. [gardner2019deep] (b), given the input image (a), are relatively blurry and less accurate compared to the ground truth (e), while our approach reconstructs a detailed environment (d) from a video of the same scene. Additionally, from a video sequence and noisy geometry of a synthetic scene (f), our method (h) more accurately recovers the surrounding environment (i) compared to Lombardi et al. (g) that incorrectly estimates bright regions.

6 Novel-View Neural Rendering

With reconstructed SRMs and material weights, we can synthesize specular appearance from any desired viewpoint via Eq. (2). However, while the approach detailed in Sec. 5 reconstructs high quality SRMs, the renderings often lack realism (shown in supplementary), due to two factors. First, errors in geometry and camera pose can sometimes lead to weaker reconstructed highlights. Second, the SRMs do not model more complex light transport effects such as interreflections or Fresnel reflection. This section describes how we train a network to address these two limitations, yielding more realistic results.

Simulations only go so far, and computer renderings will never be perfect. In principle, you could train a CNN to render images as a function of viewpoint directly, training on actual photos. Indeed, several recent neural rendering methods adapt image translation [isola2017image] to learn mappings from projected point clouds [neuralrendering, pittaluga2019revealing, aliev2019neural] or a UV map image [thies2019deferred] to a photo. However, these methods struggle to extrapolate far away from the input views because their networks have to figure out the physics of specular highlights from scratch (see Sec. 8.2).

Rather than treat the rendering problem as a black box, we arm the neural renderer with knowledge of physics – in particular, diffuse, specular, interreflection, and Fresnel reflection, to use in learning how to render images. Formally, we introduce an adversarial neural network-based generator and discriminator to render realistic photos. takes as input our best prediction of diffuse and specular components for the current view (obtained from Eq. (7)), along with interreflection and Fresnel terms , , and that will be defined later in this section.

Consequently, the generator receives as input and outputs a prediction of the view, while the discriminator scores its realism. We use the combination of pixelwise , perceptual loss [chen2017photographic], and the adversarial loss [isola2017image] as described in Sec. 5:

(8)

where is the mean of perceptual loss across all input images, and and are similarly defined as an average loss across frames. Note that this renderer is scene specific, trained only on images of a particular scene to extrapolate new views of that same scene, as commonly done in the neural rendering community [neuralrendering, thies2019deferred, aliev2019neural].

6.1 Modeling Interreflections and Fresnel Effects

Eq. (2) models only the direct illumination of each surface point by the environment, neglecting interreflections. While modeling full, global, diffuse + specular light transport is intractable, we can approximate first order interreflections by ray-tracing a first-bounce image (FBI) as follows. For each pixel in the virtual viewpoint to be rendered, cast a ray from the camera center through . If we pretend for now that every scene surface is a perfect mirror, that ray will bounce potentially multiple times and intersect multiple surfaces. Let be the second point of intersection of that ray with the scene. Render the pixel at in FBI with the diffuse color of , or with black if there is no second intersection Fig. 4d.

Glossy (imperfect mirror) interreflections can be modeled by convolving the FBI with the BRDF. Strictly speaking, however, the interreflected image should be filtered in the angular domain [ramamoorthi2001signal], rather than image space, i.e., convolution of incoming light following the specular lobe whose center is the reflection ray direction . Given , angular domain convolution can be approximated in image space by convolving the FBI image weighted by . However, because we do not know the specular kernel, we let the network infer the weights using as a guide. We encode the for each pixel as a three-channel image (Fig. 4e).

Fresnel effects make highlights stronger at near-glancing view angles and are important for realistic rendering. Fresnel coefficients are approximated following [schlick1994inexpensive]: where is the angle between the surface normal and the camera ray, and is a material-specific constant. We compute a Fresnel coefficient image (FCI), where each pixel contains , and provide it to the network as an additional input, shown in Fig. 4(f).

In total, the rendering components are now composed of five images: diffuse and specular images, FBI image, , and FCI. is then given as input to the neural network, and our network weights are optimized as in Eq. (8). Fig. 4 shows the effectiveness of the additional three rendering components on modeling interreflections.

6.2 Implementation Details

We follow [johnson2016perceptual] for the generator network architecture, while we use the PatchGAN discriminator [isola2017image] and employ the loss of LSGAN [mao2017least]. We use ADAM [kingma2014adam] with learning rate 2e-4 to optimize the objectives. Data augmentation was carried out by applying random rotation, translation, flipping, and scaling to each input and output pair, which was essential for viewpoint generalization. We refer readers to supplementary for comprehensive implementation details.

(a) Camera Trajectory
(b) Reference Photo
(c) Ours
(d) DeepBlending [hedman2018deep]
(e) Thies et al. [thies2019deferred]
Figure 7: View extrapolation to extreme viewpoints. We evaluate the novel-view generation on test views (red frusta) that are furthest from the input views (black frusta) (a). The view predictions of DeepBlending [hedman2018deep] and Thies et al. [thies2019deferred] (d,e) are notably different from the reference photographs (b), e.g. highlights on the back of the cat missing, and incorrect highlights at the bottom of the cans. Thies et al. [thies2019deferred] shows severe artifacts, likely because their learned UV texture features allow overfitting to input views, and thus cannot generalize to very different viewpoints. Our method (c) produces images with highlights appearing at correct locations.

7 Dataset

We capture ten sequences of RGBD video with a hand-held Primesense depth camera, featuring a wide range of materials, lighting, objects, environments, and camera paths. We plan to release the dataset along with the camera parameters and reconstructed textured mesh. The length of each sequence ranges from 1500 to 3000 frames, which are split into train and test frames. Some of the sequences were captured such that the test views are very far from the training views, making them ideal for benchmarking the extrapolation abilities of novel-view synthesis methods. Moreover, many of the sequences come with ground truth HDR environment maps to facilitate future research on environment estimation. Further capture and data-processing details are covered in supplementary.

8 Experiments

We conduct experiments to test our system’s ability to estimate images of the environment and synthesize novel viewpoints. We also perform ablation studies to characterize the factors that most contribute to system performance.

We compare our approach to several state-of-the-art methods: recent single view lighting estimation methods (DeepLight [LeGendre19], Gardner et al. [gardner2017learning]), an RGBD video-based lighting and material reconstruction method (Lombardi et al. [lombardi2016radiometric]), IR-based BRDF estimation method (Park et al. [park2018surface]), and two leading view synthesis methods capable of handling specular highlights – DeepBlending [hedman2018deep] and Deferred Neural Rendering (DNS) [thies2019deferred]. Note that these methods show state-of-the-art performance in their respective tasks, so we omit comparisons that are already included in their reports: e.g., DeepBlending thoroughly compares with image-based rendering methods [cayon2015bayesian, penner2017soft, buehler2001unstructured, eisemann2008floating, hedman2016scalable].

8.1 Environment Estimation

Our computed SRMs demonstrate our system’s ability to infer detailed images of the environment from the pattern and motion of specular highlights on an object. For example from 5(b), we can see the general layout of the living room, and even count the number of floors in buildings visible through the window. Note that the person capturing the video does not appear in the environment map because he is constantly moving. The shadow of the person, however, could cause artifacts – e.g. the fluorescent lighting in the first row of Fig. 5 is discontinuous.

Compared to the state-of-the-art single view estimation methods [legendre2019deeplight, gardner2017learning], our method produces a more accurate image of the environment, as shown in Fig. 6. Note our reconstruction shows a person standing near the window and autumn colors in a tree visible through the window.

We compare with a multi-view RGBD based method of Lombardi et al. [lombardi2016radiometric] on a synthetic scene containing a blob, which we obtained from the authors. As in [lombardi2016radiometric], we estimate lighting from the known geometry with added noise and a video of the scene rendering. Results show our method produces more accurate estimate than the analytical BRDF method of Lombardi et al. [lombardi2016radiometric] (Fig. 6).

8.2 Novel-View Synthesis

We recover specular reflectance maps and train a generative network for each video sequence. The trained model is then used to generate novel views from held-out views.

In the supplementary, we show novel view generation results for different scenes, along with the intermediate rendering components and ground truth images. As view synthesis results are better shown in video form, we strongly encourage readers to watch the supplementary video.

Figure 8: Quantitative comparisons for novel-view synthesis. We plot the perceptual loss [zhang2018unreasonable] between a novel view rendering and the ground truth test image as a function of its distance to the nearest training view (measured in angle between the view vectors). We compare our method with two leading NVS methods [hedman2018deep, thies2019deferred] on two scenes. On average, our results have lowest error. Notice that Thies et al.’s prediction quality worsens dramatically for extrapolated views, suggesting method overfits to input views.
(a) Synthesized Novel-view
(b) Material Weights
Figure 9: Image (a) shows a synthesized novel view (using neural rendering of (Sec. 6) of a scene with multiple glossy materials. The spatially varying material of the wooden tabletop and the laptop is correctly discovered by our algorithm (Sec. 5). The SRM blending weights are visualized by RGB channels of image (b).
(a) Ground Truth
(b) Synthesized Novel-view
Figure 10: Demonstration of modeling concave surfaces. The appearance of highly concave bowls is realistically reconstructed by our system. The rendered result (b) captures both occlusions and highlights of the ground truth (a).

Viewpoint Extrapolation

While view extrapolation is a key for many applications, it has been particularly challenging for scenes with reflections. To test the operating range of our and other recent view synthesis results, we study how the quality of view prediction degrades as a function of the distance to the nearest input images (in difference of viewing angles) (Fig. 8). The prediction quality measured with the neural network-based perceptual loss [zhang2018unreasonable], which is known to be more robust to shifts or misalignments, against the ground truth test image taken from same pose. We used two video sequences both containing highly reflective surfaces and taken with intentionally large difference in train and test viewpoints. In order to measure the quality of extrapolation, we focus our attention on parts of the scene which exhibit significant view-dependent effects. That is, we mask out the diffuse backgrounds and measure the loss on only central objects of the scene. We compare our method with DeepBlending [hedman2018deep] and Thies et al. [thies2019deferred]. The quantitative (Fig. 8) and qualitative (Fig. 7) results show that our method is able to produce more accurate images of the scene from extrapolated viewpoints.

8.3 Robustness

Our method is robust to various scene configurations, such as scenes containing multiple objects (Fig. 7), spatially varying materials (Fig. 9), and concave surfaces (Fig. 10). In the supplementary, we study how the loss functions and surface roughness affect our results.

9 Limitations and Future work

Our approach relies on the reconstructed mesh obtained from fusing depth images of consumer-level depth cameras and thus fails for surfaces out of the operating range of these cameras, e.g., thin, transparent, or mirror surfaces. Currently, the recovered environment captures the lighting filtered by the surface BRDF; separating these two factors is an interesting topic of future work, perhaps via data-driven deconvolution. Last, reconstructing a room-scale photorealistic appearance model remains a major challenge.

Supplementary

Appendix A Overview

In this document we provide additional experimental results and extended technical details to supplement the main submission. We first discuss the effects on the output of the system made by changes in the loss functions (Sec. B), scene surface characteristics (surface roughness) (Sec. C), and number of material bases (Sec. D). We then showcase our system’s ability to model the Fresnel effect (Sec. E), and compare our method against a recent BRDF estimation approach (Sec. F). In Sections G,H, we explain the data capture process and provide additional implementation details. Finally, we describe our supplementary video (Sec. I) and show additional novel-view synthesis results along with their intermediate rendering components (Sec. J).

Appendix B Effects of Loss Functions

In this section, we study how the choice of loss functions affects the quality of environment estimation and novel view synthesis. Specifically, we consider three loss functions between prediction and reference images as introduced in the main paper: (i) pixel-wise loss, (ii) neural-network based perceptual loss, and (iii) adversarial loss. We run each of our algorithms (environment estimation and novel-view synthesis) for the three following cases: using (i) only, (i+ii) only, and all loss functions combined (i+ii+iii). For both algorithms we provide visual comparisons for each set of loss functions in Figures 11,12.

b.1 Environment Estimation

We run our joint optimization of SRMs and material weights to recover a visualization of the environment using the set of loss functions described above. As shown in Fig. 12, the pixel-wise L1 loss was unable to effectively penalize the view prediction error because it is very sensitive to misalignments due to noisy geometry and camera pose. While the addition of perceptual loss produces better results, one can observe muted specular highlights in the very bright regions. The adversarial loss, in addition to the two other losses, effectively deals with the input errors while simultaneously correctly capturing the light sources.

b.2 Novel-View Synthesis

We similarly train the novel-view neural rendering network in Sec. 6 using the aforementioned loss functions. Results in Fig. 11 shows that while L1 loss fails to capture specularity when significant image misalignments exist, the addition of perceptual loss somewhat addresses the issue. As expected, using adversarial loss, along with all other losses, allows the neural network to fully capture the intensity of specular highlights.

[width=trim=.150pt 0 .170pt .050pt ,clip]images/loss_rf.jpg

(a) GT

[width=trim=.150pt 0 .170pt .050pt ,clip]images/loss_l1.jpg

(b) L1 Loss

[width=trim=.150pt 0 .170pt .050pt ,clip]images/loss_vgg.jpg

(c) L1+Percept

[width=trim=.150pt 0 .170pt .050pt ,clip]images/loss_gan.jpg

(d) All Losses
Figure 11: Effects of loss functions on neural-rendering. The specular highlights on the forehead of the Labcat is expressed weaker than it actually is when using L1 or perceptual loss, likely due to geometric and calibration errors. The highlight is best expressed when the neural rendering pipeline of Sec. 6 is trained with the combination of L1, perceptual, and adversarial loss.
(a) Scene
(b) L1 Loss
(c) L1+Perceptual Loss
(d) L1+Perceptual+GAN Loss
Figure 12: Environment estimation using different loss functions. From input video sequences (a), we run our SRM estimation algorithm, varying the final loss function between the view predictions and input images. Because L1 loss (b) is very sensitive to misalignments caused by geometric and calibration errors, it averages out the observed specular highlights, resulting in missing detail for large portions of the environment. While the addition of perceptual loss (c) mitigates this problem, the resulting SRMs often lose the brightness or details of the specular highlights. The adoption of GAN loss produces improved results (d).

Appendix C Effects of Surface Roughness

As descrbied in the main paper, our recovered specular reflectance map is environment lighting convolved with the surface’s specular BRDF. Thus, the quality of the estimated SRM should depend on the roughness of the surface, e.g. a near Lambertian surface would not provide significant information about its surroundings. To test this claim, we run the SRM estimation algorithm on a synthetic object with varying levels of specular roughness. Specifically, we vary the roughness parameter of the GGX shading model [walter2007microfacet] from 0.01 to 1.0, where smaller values correspond to more mirror-like surfaces. We render images of the synthetic object, and provide those rendered images, as well as the geometry (with added noise in both scale and vertex displacements, to simulate a real scanning scenario), to our algorithm. The results show that the accuracy of environment estimation decreases as the object surface gets more rough, as expected (Fig. 16). Note that although increasing amounts of surface roughness does cause the amount of detail in our estimated environments to decrease, this is expected, as the recovered SRM still faithfully reproduces the convolved lighting (Fig. 15).

Appendix D Effects of Number of Material Bases

The joint SRM and segmentation optimization of the main paper requires a user to set the number of material bases. In this section, we study how the algorithm is affected by the user specified number. Specifically, for a scene containing two cans, we run our algorithm twice, with number of material bases set to be two and three, respectively. The results of the experiment in Figure 13 suggest that the number of material bases does not have a significant effect on the output of our system.

(a) Input Texture
(b) Material Weight,
(c) Material Weight,
(d) Recovered SRM,
(e) Recovered SRM,
Figure 13: Sensitivity to the number of material bases . We run our SRM estimation and material segmentation pipeline twice on a same scene but with different number of material bases , showing that our system is robust to the choice of . We show the predicted combination weights of the network trained with two (b) and three (c) material bases. For both cases (b,c), SRMs that correspond to the red and blue channel are mostly black, i.e. diffuse BRDF. Note that our algorithm consistently assigns the specular material (green channel) to the same regions of the image (cans), and that the recovered SRMs corresponding to the green channel (d,e) are almost identical.

Appendix E Fresnel Effect Example

The Fresnel effect is a phenomenon where specular highlights tend to be stronger at near-glancing view angles, and is an important visual effect in the graphics community. We show in Fig. 14 that our neural rendering system correctly models the Fresnel effect. In the supplementary video, we show the Fresnel effect in motion, along with comparisons to the ground truth sequences.

Appendix F Comparison to BRDF Fitting

Recovering a parametric analytical BRDF is a popular strategy to model view-dependent effects. We thus compare our neural network-based novel-view synthesis approach against a recent BRDF fitting method of [park2018surface] that uses an IR laser and camera to optimize for the surface specular BRDF parameters. As shown in Fig. 17, sharp specular BRDF fitting methods are prone to failure when there are calibration errors or misalignments in geometry.

[width=trim=.50pt 0.150pt 0.10pt .050pt ,clip]images/bottle0.jpg

(a) View 1

[width=trim=.450pt 0.10pt 0.150pt .10pt ,clip]images/bottle1.jpg

(b) View 2

[width=trim=.50pt 0.150pt 0.10pt .050pt ,clip]images/bottle2.jpg

(c) View 3

[width=trim=.50pt 0.150pt 0.10pt .050pt ,clip]images/schlick0.jpg

(d) View 1

[width=trim=.450pt 0.10pt 0.150pt .10pt ,clip]images/schlick1.jpg

(e) View 2

[width=trim=.50pt 0.150pt 0.10pt .050pt ,clip]images/schlick2.jpg

(f) View 3
Figure 14: Demonstration of the Fresnel effect. The intensity of specular highlights tends to be amplified at slant viewing angles. We show three different views (a,b,c) for a glossy bottle, each of them generated by our neural rendering pipeline and presenting different viewing angles with respect to the bottle. Notice that the neural rendering correctly amplifies the specular highlights as the viewing angle gets closer to perpendicular with the surface normal. Images (d,e,f) show the computed Fresnel coefficient (FCI) (see Sec. 6.1) for the corresponding views. These images are given as input to the neural-renderer that subsequently use them to simulate the Fresnel effect. Best viewed digitally.

[width=trim=.00pt 0.140pt 0.0pt .00pt ,clip]images/gt.jpg

(a) Ground Truth Environment

[width=trim=.240pt 0.140pt 0.240pt .140pt ,clip]images/Image001.jpg

(b) Input Frame

[width=trim=.00pt 0.140pt 0.0pt .00pt ,clip]images/light001.jpg

(c) Recovered SRM (GGX roughness 0.01)

[width=trim=.240pt 0.140pt 0.240pt .140pt ,clip]images/Image010.jpg

(d) Input Frame

[width=trim=.00pt 0.140pt 0.0pt .00pt ,clip]images/light010.jpg

(e) Recovered SRM (GGX roughness 0.1)

[width=trim=.240pt 0.140pt 0.240pt .140pt ,clip]images/Image070.jpg

(f) Input Frame

[width=trim=.00pt 0.140pt 0.0pt .00pt ,clip]images/light070.jpg

(g) Recovered SRM (GGX roughness 0.7)
Figure 15: Recovering SRM for different surface roughness. We test the quality of estimated SRMs (c,e,g) for various surface materials (shown in (b,d,f)). The results closely match our expectation that environment estimation through specularity is challenging for glossy (d) and diffuse (f) surfaces, compared to the mirror-like surfaces (c). Note that the input to our system are rendering images and noisy geometry, from which our system reliably estimates the environment.
Figure 16: Accuracy of environment estimation under dif ferent amounts of surface roughness. We see that increas ing the material roughness does indeed decrease the over all quality of the reconstructed environment image measured in pixel-wise L2 distance. Note that the roughness parameter is from the GGX [walter2007microfacet] shading model which we use to render the synthetic models.

Appendix G Data Capture Details

As described in Sec. 7 of the main paper, we capture ten videos of objects with varying materials, lighting and compositions. We used a Primesense Carmine RGBD structured light camera. We perform intrinsic and radiometric calibrations, and correct the images for vignetting. During capture, the color and depth streams were hardware-synchronized, and registered to the color camera frame-of-reference. The resolution of both streams are VGA (640x480) and the frame rate was set to 30fps. Camera exposure was manually set and fixed within a scene.

We obtained camera extrinsics by running ORB-SLAM [mur2017orb] (ICP [newcombe2011kinectfusion] was alternatively used for feature-poor scenes). Using the estimated pose, we ran volumetric fusion [newcombe2011kinectfusion] to obtain the geometry reconstruction. Once geometry and rough camera poses are estimated, we ran frame-to-model dense photometric alignment following [park2018surface] for more accurate camera positions, which are subsequently used to fuse in the diffuse texture to the geometry. Following [park2018surface], we use iteratively reweighted least squares to compute a robust minimum of intensity for each surface point across viewpoints, which provides a good approximation to the diffuse texture.

Appendix H Implementation Details

Our pipeline is built using PyTorch

[paszke2017automatic]

. For all of our experiments we used ADAM optimizer with learning rate 2e-4 for the neural networks and 1e-3 for the SRM pixels. For the SRM optimization described in Sec. 5 of the main text the training was run for 40 epochs (i.e. each training frame is processed 40 times), while the neural renderer training was run for 75 epochs.

We find that data augmentation plays a significant role to the view generalization of our algorithm. For training in Sec. 5, we used random rotation (up to 180), translation (up to 100 pixels), and horizontal and vertical flips. For neural renderer training in Sec. 6, we additionally scale the input images by a random factor between 0.8 and 1.25.

We use Blender [blender] for computing the reflection direction image and the first bounce interreflection (FBI) image described in the main text.

h.1 Network Architectures

Let C(k,ch_in,ch_out,s) be a convolution layer with kernel size k, input channel size ch_in, output channel size ch_out

, and stride

s. When the stride s is smaller than 1, we first conduct nearest-pixel upsampling on the input feature and then process it with a regular convolution layer. We denote CNR and CR

to be the Convolution-InstanceNorm-ReLU layer and Convolution-ReLU layer, respectively. A residual block

R(ch) of channel size ch contains convolutional layers of CNR(3,ch,ch,1)-CN(3,ch,ch,1), where the final output is the sum of the outputs of the first and the second layer.

Encoder-Decoder Network Architecture

The architecture of the texture refinement network and the neural rendering network in Sec.5 and Sec.6 closely follow the architecture of an encoder-decoder network of Johnson et al. [johnson2016perceptual]: CNR(9,ch_in,32,1)-CNR(3,32,64,2)-CNR(3,64,
128,2)-R(128)-R(128)-R(128)-R(128)-R(128)
-CNR(3,128,64,1/2)-CNR(3,64,32,1/2)
-C(3,32,3,1)
, where c_in represents a variable input channel size, which is 3 and 13 for the texture refinement network and neural rendering generator, respectively.

Material Weight Network

The architecture of the material weight estimation network in Sec. 5 is as follows: CNR(5,3,64,2)-CNR(3,64,64,2)-R(64)-R(64)-
CNR(3,64,32,1/2)-C(3,32,3,1/2)
.

Discriminator Architecture

The discriminator network used for the adversarial loss in Eq.7 and Eq.8 of the main paper both use the same architecture as follows: CR(4,3,64,2)-CNR(4,64,128,2)-CNR(4,128,
256,2)-CNR(4,256,512,2)-C(1,512,1,1)
. For this network, we use a LeakyReLU activation (slope 0.2) instead of the regular ReLU, so CNR used here is a Convolution-InstanceNorn-LeakyReLU layer. Note that the spatial dimension of the discriminator output is larger than 1x1 for our image dimensions (640x480), i.e., the discriminator scores realism of patches rather than the whole image (as in PatchGAN [isola2017image]).

Appendix I Supplementary Video

We strongly encourage readers to watch the supplementary video, as many of our results we present are best seen as videos. Our supplementary video contains visualizations of input videos, environment estimations, our neural novel-view synthesis (NVS) renderings, and side-by-side comparisons against the state-of-the-art NVS methods. We note that the ground truth videos of the NVS section are cropped such that regions with missing geometry are displayed as black. The purpose of the crop is to provide equal visual comparisons between the ground truth and the rendering, so that viewers are able to focus on the realism of reconstructed scene instead of the background. Since the reconstructed geometry is not always perfectly aligned with the input videos, some boundaries of the ground truth stream may contain noticeable artifacts, such as edge-fattening. An example of this can be seen in the ‘acryl’ sequence, near the top of the object.

Appendix J Additional Results

Cans-L1 Labcat-L1 Cans-perc Labcat-perc
[hedman2018deep] 9.82e-3 6.87e-3 0.186 0.137
[thies2019deferred] 9.88e-3 8.04e-3 0.163 0.178
Ours 4.51e-3 5.71e-3 0.103 0.098
Table 1: Average pixel-wise L1 error and perceptual error values (lower is better) across the different view synthesis methods on the two datasets (Cans, Labcat). The L1 metric is computed as mean L1 distance across pixels and channels between novel-view prediction and ground-truth images. The perceptual error numbers correspond to the mean values of the measurements shown in Figure 7 of the main paper. As described in the main paper, we mask out the background (e.g. carpet) and focus only on the specular object surfaces.

Table 1 shows numerical comparisons on novel-view synthesis against state-of-the-art methods [hedman2018deep, thies2019deferred] for the two scenes presented in the main text (Fig. 7). We adopt two commonly used metrics, i.e. pixel-wise L1 and deep perceptual loss [johnson2016perceptual], to measure the distance between a predicted novel-view image and its corresponding ground-truth test image held-out during training. As described in the main text we focus on the systems’ ability to extrapolate specular highlight, thus we only measure the errors on the object surfaces, i.e. we remove diffuse backgrounds.

(a) Reference
(b) Our Reconstruction
(c) Reconstruction by [park2018surface]
Figure 17: Comparison with Surface Light Field Fusion [park2018surface]. Note that the sharp specular highlight on the bottom-left of the Corncho bag is poorly reconstructed in the rendering of [park2018surface] (c). As shown in Sec. B and Fig. 19, these high frequency appearance details are only captured when using neural rendering and robust loss functions (b).

[width=trim=.150pt 0.150pt 0.30pt .150pt ,clip]images/cans_ref_gamma.jpg

(a) Ground Truth

[width=trim=.150pt 0.150pt 0.30pt .150pt ,clip]images/cans_naive_gamma.jpg

(b) Rendering with SRM
Figure 18: Motivation for neural rendering. While the SRM and segmentation obtained from the optimization of Sec. 5 of the main text provides high quality environment reconstruction, the simple addition of the diffuse and specular component does not yield photorealistic rendering (b) compared to the ground truth (a). This motivates the neural rendering network that takes input as the intermediate rendering components and generate photorealistic images (e.g. shown in Fig. 19).

Fig. 18 shows that the naïve addition of diffuse and specular components obtained from the optimization in Sec. 5 does not results in photorealistic novel view synthesis, thus motivating a separate neural rendering step that takes as input the intermediate physically-based rendering components.

Fig. 19 shows novel-view neural rendering results, together with the estimated components (diffuse and specular images , ) provided as input to the renderer. Our approach can synthesize photorealistic novel views of a scene with wide range of materials, object compositions, and lighting condition. Note that the featured scenes contain challenging properties such as bumpy surfaces (Fruits), rough reflecting surfaces (Macbook), and concave surfaces (Bowls). Overall, we demonstrate the robustness of our approach for various materials including fabric, metals, plastic, ceramic, fruit, wood, glass, etc.

(a) Ground Truth
(b) Our Rendering
(c) Specular Component
(d) Diffuse Component
Figure 19: Novel view renderings and intermediate rendering components for various scenes. From left to right: (a) reference photograph, (b) our rendering, (c) specular reflectance map image, and (d) diffuse texture image. Note that some of the ground truth reference images have black “background” pixels inserted near the top and left borders where reconstructed geometry is missing, to provide equal visual comparisons to rendered images.

References