Log In Sign Up

Factorized and Controllable Neural Re-Rendering of Outdoor Scene for Photo Extrapolation

Expanding an existing tourist photo from a partially captured scene to a full scene is one of the desired experiences for photography applications. Although photo extrapolation has been well studied, it is much more challenging to extrapolate a photo (i.e., selfie) from a narrow field of view to a wider one while maintaining a similar visual style. In this paper, we propose a factorized neural re-rendering model to produce photorealistic novel views from cluttered outdoor Internet photo collections, which enables the applications including controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. Specifically, we first develop a novel factorized re-rendering pipeline to handle the ambiguity in the decomposition of geometry, appearance and illumination. We also propose a composited training strategy to tackle the unexpected occlusion in Internet images. Moreover, to enhance photo-realism when extrapolating tourist photographs, we propose a novel realism augmentation process to complement appearance details, which automatically propagates the texture details from a narrow captured photo to the extrapolated neural rendered image. The experiments and photo editing examples on outdoor scenes demonstrate the superior performance of our proposed method in both photo-realism and downstream applications.


page 1

page 4

page 5

page 6

page 7

page 8


Neural Parameterization for Dynamic Human Head Editing

Implicit radiance functions emerged as a powerful scene representation f...

NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections

We present a learning-based method for synthesizing novel views of compl...

What Makes Kevin Spacey Look Like Kevin Spacey

We reconstruct a controllable model of a person from a large photo colle...

Hallucinated Neural Radiance Fields in the Wild

Neural Radiance Fields (NeRF) has recently gained popularity for its imp...

Plausible Shading Decomposition For Layered Photo Retouching

Photographers routinely compose multiple manipulated photos of the same ...

Predicting Surface Reflectance Properties of Outdoor Scenes Under Unknown Natural Illumination

Estimating and modelling the appearance of an object under outdoor illum...

Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion

Photo-to-caricature translation aims to synthesize the caricature as a r...

1. Introduction

When a tourist visits a famous attraction, (s)he usually likes to take a picture that can capture both the person and the whole scene. However, in many cases, only a part of the scene can be captured due to the narrow field of view (FoV) of the camera and/or crowded people, which may be frustrating to the tourist. So similar to other image post-processing (e.g

., super-resolution, deblurring,

etc.), it would be of great help if we can handily extrapolate the photo to obtain a wider view of the scene, while maintaining a similar visual style (i.e., lighting condition or filtering effect) between the complemented image area and the original photo, as shown in the last row of Fig. 1. Moreover, it could be more attractive to expand the extrapolated photo to full-scene dynamic 3D with vivid effects as shown in Fig. 7.

The most straight-forward solution relies on 2D image processing including 2D image stitching (Brown and Lowe, 2007), 2D image extrapolation (Wang et al., 2014, 2018) or 2D image generation (Rockwell et al., 2021; Teterwak et al., 2019; Kim et al., 2021; Sabini and Rusak, 2018; Yang et al., 2019). 2D image stitching aims to combine multi-view images of small FoVs into a panorama, however it can only generate high-quality images when the input multi-view images are captured at the same location and time, which may be impossible for the crowded environment, selfie mode or the old photos captured before. Image extrapolation or generation learns to generate visually consistent content for the extrapolated regions using library images. However, such 2D methods are uncontrollable, which makes the extrapolated parts unreal even though they look plausible, and the library images have to be carefully captured and generally occlusion-free (Wang et al., 2014, 2018). Moreover, it is also impossible for these 2D methods to produce 3D photos with immersive user experience (see Fig. 7).

Another possible solution is to first reconstruct the corresponding outdoor scene landmark from a collection of images and then re-render the scene with a large field of view and lighting effect close to the original photos, which has been widely studied in computer vision and graphics (Schönberger and Frahm, 2016; Xu and Tao, 2019; Kazhdan et al., 2006; Snavely et al., 2006). Traditional approaches use multi-view stereo and photometric blending to reconstruct a textured scene mesh (Kazhdan et al., 2006; Waechter et al., 2014), so as to support the virtual presence of the scene but require high-quality image sets (Waechter et al., 2014; Philip et al., 2019) and are not feasible for adapting lighting conditions to a specific photograph. Recently, neural rendering shows promising results on surface mesh reconstructions, novel view synthesis and scene relighting. It also enables to model appearance variations of outdoor scene landmarks from cluttered Internet photo collections (Meshry et al., 2019; Martin-Brualla et al., 2021; Tancik et al., 2022; Chen et al., 2022), which liberates the restriction of the data requirement and delivers high flexibility for lighting effect adaptation. However, existing works either disentangle lighting variations in a latent space while being regardless of explicitly controlling illumination changes (Martin-Brualla et al., 2021; Tancik et al., 2022; Chen et al., 2022), or require carefully captured images to ensure a smooth factorization of material appearances (Srinivasan et al., 2021; Boss et al., 2021a; Zhang et al., 2021; Boss et al., 2021b). Besides, due to the nature of network smoothness, the rendered images from the neural implicit rendering tend to average the observations and inevitably lose some appearance details (e.g., gushing springs, etc.), which largely degrades the user experience of photo extrapolation.

In this paper, we propose a novel factorized and controllable neural re-rendering pipeline that enables realistic outdoor photo extrapolation from the readily accessible but cluttered Internet photo collections. Instead of modeling appearance variations as a whole in a latent space (Meshry et al., 2019; Martin-Brualla et al., 2021; Tancik et al., 2022; Chen et al., 2022)

, our rendering model factorizes scene representation into several components (see Fig. 

1), including base appearance, scene geometry (and normal), synthetic sky and an explicitly explainable illumination condition (with data-driven HDR environment map, affine tone mapping and learnable shadow). Once the scene has been encoded in the rendering model, we can easily perform controllable scene re-rendering under novel views with user-selected lighting conditions, and conduct photo extrapolation or even extrapolated 3D photo generation that extends a captured tourist selfie from a narrow FoV to a widen view, while maintaining similar lighting effect and appearance details by utilizing photo adaptation and a novel realism augmentation mechanism.

However, it is non-trivial to learn a factorized scene representation from cluttered outdoor photos and conduct realistic photo extrapolation with neural rendering models. 1) Due to the ill-posed nature of the problem, naïve solutions of NeRF-based inverse rendering (Srinivasan et al., 2021; Boss et al., 2021a; Zhang et al., 2021; Boss et al., 2021b) are no longer applicable for such cluttered images and unbounded outdoor scenes, hence we propose a novel rendering pipeline for this challenging task. Specifically, at the rendering stage, we utilize a data-driven sky HDR decoder from Gillan et al(Hold-Geoffroy et al., 2019) to constrain the HDR map optimization in a reasonable latent space and resolve the scale ambiguity between the recovered HDR environment map and the base appearance. Then, to model the unobserved shadow caster (e.g., buildings behind the attraction), we introduce a learnable shadow branch that provides spatial shadow value during volume rendering. Finally, to handle dramatic color distortion that is beyond physically explainable lighting (e.g., user’s filtering effect), we apply a learnable affine tone mapping (Rematas et al., 2022; Tancik et al., 2022) to the rendered pixels. 2) As the training images are collected in a crowd-sourcing paradigm from the Internet, there might be unexpected occluders (e.g., tourists or birds) that affect the learning of factorized rendering even with transient modeling (Martin-Brualla et al., 2021) (see Sec 4.6). To tackle this challenge, we employ a composited training scheme to first train the geometry model and then train the rendering model with distilled occlusion-free images from NeRF-W (Martin-Brualla et al., 2021). This process can be regarded as transferring the latent appearance embedding into an explicit and controllable illumination parameterization. 3) Since the neural implicit model tends to fuse appearances from multi-view observation, the re-rendered scene is somehow more blurry than the user-captured photo and also lacks some live details such as water of the fountain splashes and clouds. To bridge the gap between the neural rendering and the tourist photos during photo extrapolation, we propose a novel realism augmentation by fully exploiting rich textures from the given photo and propagating them into the rendered view. In this way, the extrapolated photograph can be more visually coherent to the captured one.

Our contribution can be summarized as follows. First, we propose a novel factorized neural rendering model which learns to encode unbounded outdoor scenes from cluttered Internet photo collections, and delivers the capability of controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. Second, to tackle the challenges of learning outdoor scene representation, our factorized rendering pipeline enables to handle varying lighting effects and color distortions by utilizing a composited training scheme to guide the training process. Moreover, a novel realism augmentation mechanism is also proposed to effectively complement details from a narrow-view real photo to a wide-view synthesized image. At last, the experiments and photography editing examples on several outdoor attractions show the superiority of our method in scene re-rendering, photo extrapolation, and extrapolated 3D photo generation.

2. Related Works

Outdoor scene reconstruction and rendering. Traditional methods generally use SfM (Schönberger and Frahm, 2016) and MVS techniques to reconstruct surface mesh (Xu and Tao, 2019; Kazhdan et al., 2006), and blend colored images (Waechter et al., 2014; Philip et al., 2019) to obtain a textured mesh for visualization. But they require high-quality images with consistent illumination condition, and cannot cannot handle Internet photo collections (Snavely et al., 2006) with varying lighting and frequent occlusions. Recently, researchers use neural rendering techniques for outdoor scene rendering (Mildenhall et al., 2020; Li et al., 2020a; Martin-Brualla et al., 2021; Meshry et al., 2019; Rematas et al., 2022; Tancik et al., 2022; Xiangli et al., 2021; Chen et al., 2022). Li et al(Li et al., 2020a) uses multi-plane images to render outdoor attractions from photo collections, but cannot produce reasonable views when looking from the tilted side view due to the limitation of MPIs (Mildenhall et al., 2020). NRW (Meshry et al., 2019), NeRF-W (Martin-Brualla et al., 2021) and their following works (Rematas et al., 2022; Tancik et al., 2022; Xiangli et al., 2021; Chen et al., 2022; Yang et al., 2022)

model outdoor lighting variations with a latent appearance code, which enable novel view synthesis with customizable camera trajectories and support appearance transition with code interpolation. However, as these methods learn appearance variations in a standalone latent space, they do not support controllable re-rendering with user-selected lighting effects.

Scene rendering with controllable illumination. Early methods mainly rely on optical equipment to measure the geometry (Yu and Malik, 1998; Loscos et al., 1999), reflectance (Masselus et al., 2003; Troccoli and Allen, 2008) and environment lighting (Debevec, 2006; Stumpfel et al., 2006)

for relightable scene rendering. In recent years, researchers propose to solve the scene relighting (or inverse rendering) with neural networks 

(Li et al., 2020b; Li and Snavely, 2018; Luo et al., 2020; Yu et al., 2020; Yu and Smith, 2021; Zhu et al., 2021; Hold-Geoffroy et al., 2019), but only support static photograph. Very recently, some works (Srinivasan et al., 2021; Boss et al., 2021a; Zhang et al., 2021; Kuang et al., 2022; Boss et al., 2021b; Guo et al., 2020) build up a relightable implicit representation upon object-centric neural volume rendering (Mildenhall et al., 2020; Yang et al., 2021; Bangbang Yang and Chong Bao et al., 2022), which produce a “self-occlusion style” shadow effect by utilizing visibility from learned density field. However, they cannot be extended to large-scale outdoor scenes and are also not capable of handling noisy observations such as Internet photo collections. In parallel to our works, Rudnev et al(Rudnev et al., 2021) proposes to learn a neural radiance field for outdoor scene relighting, but shows limited ability of representing and re-rendering on Internet photo collections due to the coarse learned geometry (or surface normal) and simplified lighting model. Instead, as our surface normal is derived from the SDF field and we utilize a more flexible external lighting model, our approach can be applied to a broader range of outdoor scenes with cluttered photo collections.

Photo extrapolation. Photo extrapolation (a.k.a. image outpainting/expansion) can extend a given image with a narrow FoV to a wide view. Early approaches either build up a photo library (Wang et al., 2014) or a short video clip (Wang et al., 2018) of the surrounding scene, and perform image-montage or stitching to outpaint the images. Since these approaches use reliable reference images to ensure consistent appearance in the extended area, they require laborious capturing of the scene and cannot adapt to the lighting conditions at different times of the day. Recent works attempt to conduct the extrapolation with generative neural models (Rockwell et al., 2021; Teterwak et al., 2019; Kim et al., 2021; Sabini and Rusak, 2018; Yang et al., 2019), which shows plausible results for natural landscapes (e.g., mountain valley, beach) or daily scenes (e.g., cars, corridors), but might not look reasonable for tourist attractions with specific shapes and appearances (see Sec. 4.3). Besides, all these image-based photo extrapolation methods are not designed to support rendering novel views with given camera trajectories, which hinders downstream applications like the extrapolated 3D photo generation.

Figure 2. Overview. Our model learns the geometry and re-rendering of outdoor scenes from the photo collection through a composited training scheme. Specifically, scenes are rendered using external lighting with several factorized components, including geometry, basic appearance, HDR environment map, tone mapping, shadows, and synthetic sky. See the text for more details. Photo by Flickr user chiaki(c_c).6.

3. Method

We propose a novel factorized neural rendering framework that learns to encode outdoor scenes from Internet photo collection, which enables controllable scene re-rendering with user-desired lighting condition and photo extrapolation or extrapolated 3D photo generation that extends a narrow-view image to a broaden field of view. We show an overview of our method in Fig. 2. Unlike previous neural implicit methods (Meshry et al., 2019; Martin-Brualla et al., 2021; Tancik et al., 2022; Chen et al., 2022) that encode all the appearance variations (e.g., lighting condition, auto exposure, white balancing and filtering effects, etc.) in one latent space, we present the first attempt to model outdoor scenes with a more controllable and explainable re-rendering pipeline (Sec. 3.1). To survive from the training with noisy Internet photos, we utilize a composited training scheme (Sec. 3.2), which learns scene geometry with transient removal strategy from Martin-Brualla et al(Martin-Brualla et al., 2021), and supervises re-rendering with distilled occlusion-free images. Moreover, we apply a novel realism augmentation technique that propagates appearance details from tourist photos to the rendered views (Sec. 3.3), which efficiently improves the photo-realism of the rendering results. Please refer to our supplementary material for more technical background.

3.1. Factorized Outdoor Scene Re-Rendering

Factorized rendering formulation. We first introduce our factorized scene re-rendering pipeline, as shown in the middle part of Fig. 2. To allow controllable re-rendering of neural implicit model with user-specific external lighting, previous methods generally require empirical normal regularization (Zhang et al., 2021; Srinivasan et al., 2021) or post-processing (Kuang et al., 2022) to obtain a smooth surface normal, which inevitably hurts geometry details. In contrast, we select SDF functions as the representation of scene geometry (Wang et al., 2021), since it offers exact surface and well-defined normal (by computing gradient w.r.t. query point) to facilitate the explicit relighting process. Specifically, we represent scene geometry and basic color with a geometry MLP and base appearance MLP, and use explicit external lighting (i.e., HDR map from HDR decoder and affine tone mapping), learnable shadow MLP and a standalone sky generator to re-render the scene with appearance variations. The rendering of pixel with point samples along the ray is defined as follows:


where is the relit scene color (introduced later), is the generated sky color along the ray direction v and conditioned by environment code , is the tone mapping conditioned by tone code , is accumulated transmittance, is the cumulative distribution of logistic distribution, and is opacity derived from adjacent SDF. More specifically, we define the relit scene color as the following:


where is the basic color from the base appearance MLP, is the spatial varying shadow value from the shadow MLP conditioned by shadow code , indicate incoming light direction, is the surface normal derived from the gradient w.r.t. query point , is the incoming HDR lighting along and conditioned by environment code , is the solid angle to the light sample. In summary, the external lighting condition (or appearance variation) is implicitly encoded as environment code , shadow code and tone code. Note that we adopt the Lambertian reflectance assumption as previous works (Li et al., 2020b; Yu et al., 2020; Yu and Smith, 2021), which is generally sufficient for outdoor scenes. Besides, we also apply positional encoding  (Mildenhall et al., 2020) to the query points and viewing directions . Next, we will introduce the details of each factorized component.

Data-driven HDR decoder. Though it is technically sound to re-render outdoor scenes with HDR maps, the problem of disentangling external environment lighting from photo collections is highly ill-posed. The reasons behind it include the unconstrained freedom of HDR and scale ambiguity between base appearance and HDR intensities. For example, one might learn an HDR map that embraces main colors of the scene while leaving a degenerated base appearance, or produces a reasonable relit with a brighter base appearance and a darker HDR map. To this end, we propose to use a data-driven outdoor HDR prior, which constrains the optimization of environment maps in a pre-trained latent space. Practically, we first train a panoramic HDR sky network (Hold-Geoffroy et al., 2019) with Laval sky dataset (Hold-Geoffroy et al., 2019), and pick up the sky decoder as a prior. During the re-rendering stage, we fixed the weight of this decoder, and take as input a per-frame latent environment code , and then downscale the decoder’s output to obtain a environment map. In this way, our model can search for a proper HDR map while avoiding color leaking from buildings into light maps (e.g., similar observations for the indoor scenes (Zhang et al., 2021)).

Learnable shadow. Following standard rendering pipeline (Akenine-Moller et al., 2019; Sloan et al., 2002), recent neural implicit rendering and relighting approaches (Srinivasan et al., 2021; Zhang et al., 2021; Guo et al., 2020) tend to learn a visibility mapping from self-occlusions of density field, which is used to synthesize shadow effect on the rendered views but requires knowing the shape and position of all the object occluders in the scenes. However, for outdoor photo collections, we found that the shadow might come from unobserved buildings behind the capturing positions, which beyond the visibility from self-occlusion of the reconstructed scene geometry and make it much more complicated. Rather than pursuing a physically correct shadow mapping, inspired by 2D inverse rendering (Yu et al., 2020), we propose to model the shadow effect with a spatial variant shadow MLP. Specifically, the shadow MLP is conditioned with a per-frame latent shadow code , and learns a shadow value for each query point along volume rendering rays. To alleviate undesired scaling between base appearance and shadow, we add a shadow regularization during the training stage, which encourages the shadow value close to 1 and is defined as the following:


In our experiment, we find this shadow modeling successfully simulates shadow effects for cluttered outdoor photo collections while being regardless of issues with unseen shadow casters (Sec. 4.6).

Affine tone mapping. Because the output of data-driven HDR decoder is a physically plausible sky lighting that is robust to sensor variation such as white balancing and exposure, we need to take additional tone mapping to handle a large variant of all these color distortions (even including extreme filtering effect by users) in photo collections. In practice, we adopt the strategy from Rematas et al(Rematas et al., 2022) by learning a affine tone mapping matrix (only upper 3 rows) for each frame, where the matrix is the output of a lightweight tone mapper with per-frame latent tone code as input. To avoid color shifting of base appearance due to unconstrained freedom of mapping matrix, we append an affine regularization into the training loss, which encourages the affine tone mapping to be “zero-mean” and is defined as the following:


is the Hadamard product, is the

column vector whose entries are all 1.

Neural sky generator. Since we build up the scene geometry with an SDF-based model, it is not applicable to model background sky along with the buildings (Martin-Brualla et al., 2021; Wang et al., 2021; Rematas et al., 2022). Motivated by Rematas et al. and Hao et al(Rematas et al., 2022; Hao et al., 2021), we use a neural sky generator to simulate the sky dome of the scene, which directly maps the viewing direction to a 3-channel sky color with the condition of environment code . As demonstrated in Eq. (1), sky colors are blended according to the remaining transmittance, so we can jointly train the sky generator at the re-rendering stage.

3.2. Composited Training with Photo Collections

Compared to previous methods, our scene representation is much harder to train due to the ill-posed nature of the factorization and noisy Internet data collections. Therefore, we develop a composited training, which learns scene geometry and re-rendering in a two-staged fashion.

Learning scene geometry from Internet photos. In the first stage, we learn scene geometry from photo collections with a geometry MLP and radiance MLP following Wang et al(Wang et al., 2021). To handle occasional object occlusions and appearance variations, we adopt the appearance embedding and transient MLP from NeRF-W (Martin-Brualla et al., 2021) (only for this stage). Unlike radiance field methods (Martin-Brualla et al., 2021; Tancik et al., 2022) that can render the sky with scattered far sampling, the SDF-based method is inclined to learn an exact surface, which results in a sky dome stitching to the building that is not desired for external relighting. So, we additionally apply a sky segmentation loss to encourage the sky area to be empty, which is defined as:


where the sky mask is annotated with Mask-RCNN (He et al., 2017). Now, we define the training loss of the scene geometry as:


where is the photometric loss with transient modeling following NeRF-W (Martin-Brualla et al., 2021), is the Eikonal loss as suggested by Gropp et al(Gropp et al., 2020). We set , and . Note that we omit the form of pixels summation in this section for brevity. After the first training stage, we only keep the geometry MLP, while discarding radiance and transient MLP.

Joint optimization of re-rendering. In the second stage, we learn factorized scene re-rendering with geometry MLP frozen. Instead of training with raw cluttered photos or manually masking out occluders, we take a distilled fashion by exploiting occlusion-free images from NeRF-W’s static branch for a more steady supervision. Our experiment shows that this strategy efficiently eases the learning of factorized re-rendering and improves the rendering quality both quantitatively and qualitatively (Sec. 4.6). The training loss is then defined as the following:


where is the MSE loss between the re-rendered pixel and the occlusion-free images. , and are the loss weights for the MSE loss, shadow regularization (see Eq. (3)), and affine tone mapping regularization (see Eq. (4)) respectively. We empirically set , and .

Figure 3. The pipeline of realism augmentation. We exploit texture details from tourist’s photos and propagate these details (e.g., water splash of the fountain) into the neural rendered large-FoV image. Photo by Flickr user MikiAnn.

3.3. Photo Adaptation & Realism Augmentation

Optimization based photo adaptation. Once the factorized scene representation has been trained, our model can be adapted to real-captured photos with novel lighting conditions, i.e., minimizing photometric error between rendered pixels and the captured photo with latent optimization on shadow code , environment code and tone code . Besides, when adapting to photos with a large portion of people like tourist selfies in photo extrapolation applications, we empirically mask out these part and only optimize pixels labelled as sky and attraction (e.g., buildings and sculptures).

Realism augmentation. Even though the optimization-based photo adaptation can achieve a rendering result close to the real photo, the detail of the synthetic view is still less realistic. For example, some live details such as water splashes of the fountain and shapes of clouds are missing, which is mainly due to the fact that the neural implicit rendering tends to average texture details from multi-view observations. Fortunately, for tasks like photo extrapolation, there is still an opportunity to enhance rendering details if we can fully exploit information from the given photos. To this end, we design a novel realism augmentation strategy, which significantly propagates texture details from a narrow-view real photo to a wide-view rendered image. We show the pipeline of this strategy in Fig. 3, which illustrates the on-the-fly learning and inference fashion of this augmentation process. Specifically, we adopt an encoder-decoder based network structure from a super-resolution work LIIF (Chen et al., 2021) as the realism augmentation network since it uses an implicit representation of the image and is flexible to support arbitrary scale and aspect ratio. At the learning stage, we set the network input as a downscale rendered image, and fine-tune the network with the target of aligned and masked (without occluders such as tourists) real photo, so the network learns to compensate details from a blurry neural rendering to the real one. Then, at the propagating stage, we fix the network and simply forward the network with the complete rendered image, where the learned “detail-compensating” knowledge would be propagated to the full view of the rendering result.

Figure 4. We compare the scene re-rendering quality with other methods on several outdoor attractions, and also visualize the surface normal of our modeling. Note that our normal is much smoother than the previous NeRF-based method (see the supplementary materials for the comparison with NeRF-W). All images are from (Jin et al., 2021).
Methods Trevi Fountain Sacre Coeur Pantheon Exterior Westminster Abbey Notre Dame Front Facade
PixelSynth (Rockwell et al., 2021) 14.73 0.587 0.693 14.19 0.659 0.566 12.09 0.603 0.607 14.00 0.664 0.618 12.43 0.589 0.643
NeRF-W * (Martin-Brualla et al., 2021) 21.31 0.764 0.380 21.23 0.850 0.283 24.78 0.875 0.265 21.35 0.801 0.364 20.86 0.735 0.456
Ours 23.09 0.792 0.345 21.51 0.849 0.162 24.46 0.867 0.237 24.66 0.854 0.238 22.75 0.833 0.254
Table 1. We compare the scene re-rendering quality with PixelSynth (Rockwell et al., 2021) and NeRF-W (Martin-Brualla et al., 2021) on five outdoor scenes of the Internet photo collections (Jin et al., 2021; Snavely et al., 2006). Note that we use an alternative implementation of NeRF-W. See the text for details.

4. Experiments

In this section, we first evaluate the outdoor scene re-rendering quality of our method (Sec. 4.2), and then conduct photo extrapolation (Sec. 4.3), controllable scene re-rendering (Sec. 4.4) and extrapolated 3D photo generation (Sec. 4.5) on several outdoor attractions. At last, we perform ablation studies to analyse the effectiveness of the training strategy and factorized re-rendering components (Sec. 4.6).

4.1. Datasets

Following the previous work (Martin-Brualla et al., 2021), we use Internet photo collections of outdoor attractions from the Phototourism (IMC-PT) 2020 dataset (Jin et al., 2021; Snavely et al., 2006), where the image poses are recovered by COLMAP (Schönberger and Frahm, 2016). Specifically, we select 5 famous tourist attractions, including Trevi Fountain, Sacre Coeur, Westminster Abbey, Pantheon and Notre Dame. For Trevi Fountain and Sacre Coeur, we follow the split of NeRF-W for training and testing. For the other three scenes, we also take a similar pre-processing by discarding training images with a large portion of object occlusions, and only select occlusion-free images for metric evaluation.

Figure 5. We compare photo extrapolation with Auto-Stitch (Brown and Lowe, 2007), PixelSynth (Brown and Lowe, 2007) and NeRF-W (Brown and Lowe, 2007) on four outdoor scenes (Jin et al., 2021; Snavely et al., 2006). Photos by Flickr users Hugão Cota, Legalv1, Foster’s Lightroom, and stobor.

4.2. Comparison of Scene Re-Rendering Quality

We first compare the scene re-rendering quality with the evaluation protocol from NeRF-W (Martin-Brualla et al., 2021), i.e., giving a left half image for optimization, the neural network is asked to render the full view of the image. The metrics of PSNR, SSIM and LPIPS (Zhang et al., 2018) are used to measure the rendering quality. Specifically, we adopt the baseline method NeRF-W (Martin-Brualla et al., 2021) and the SOTA image extrapolation method PixelSynth (Rockwell et al., 2021) for comparison. The other relightable neural implicit rendering methods (e.g., NeRV (Srinivasan et al., 2021), NeRD (Boss et al., 2021a), NerFactor (Zhang et al., 2021), and etc.) are not applicable here, because they are not feasible for unbounded outdoor scenes or learning from cluttered photo collections. Note that since NeRF-W has not released the official source code, we adopt an alternative implementation 111 in our experiment, thus the reported result is different from (Martin-Brualla et al., 2021). For PixelSynth, we overfit the network to each individual scene with occlusion-free training images as introduced in Sec. 3.2. We report the quantitative results in Tab. 1 and present the quantitative visualization in Fig. 4. It is obvious that even though we overfit each PixelSynth model to a specific scene, the complete rendering view is still far from satisfactory (e.g., the geometry structure differs a lot to the ground-truth), which proves that the GAN network of the PixelSynth does not ensure a consistent rendering output. NeRF-W achieves much better results for scene re-rendering, but some live details (e.g., water splash in Trevi Fountain and the sky clouds in Notre Dame) are still missing due to the smooth nature of neural implicit field (Zhang et al., 2021). Generally, a disentangled rendering pipeline is usually more challenging to render high-quality images (Zhang et al., 2021; Srinivasan et al., 2021). Thanks to the factorized re-rendering pipeline and realism enhancement in our method, we still achieve on-par or even better rendering quality both quantitatively and qualitatively while successfully maintaining live details close to the ground-truth. Moreover, we also exhibit our surface normal in the fourth row of Fig. 4, which demonstrates the high-quality geometry of the learned model and we believe it is the key to achieving a good re-rendering result with external lighting. Please refer to the supplementary materials for the additional comparison of surface normal with NeRF-W.

Figure 6. We exhibit the results of controllable re-rendering by changing tone mapping and HDR environment maps.

4.3. Comparison of Photo Extrapolation

We now conduct the comparison on the photo extrapolation task in Fig. 5. In this part, we also perform an extrapolation test on a 2D image-based method (Brown and Lowe, 2007), which is denoted as Auto-Stitch. Specifically, we first retrieve 30 nearby captured position from photo collections based on the SfM camera poses and perform panoramic stitching with OpenPano 222 (Brown and Lowe, 2007). Then, we warp the stitched image to the given photo view and blend the front tourists into the image. It is clear that the extrapolated views of this pipeline are full of stitching artifacts such as human shadows, and there are some vacancy parts near the border of extended images due to the lack of observations near the captured position, which indicates that the 2D image-based approach is not suitable for photo extrapolation with Internet photo collections, as they might need a carefully collected data library to achieve a clean result (Wang et al., 2014; Philip et al., 2019). For the photo extrapolation results of NeRF-W and PixelSynth, similar to what we have analyzed in Sec. 4.2, they are faced with the issues of the lack of live texture details and distorted 3D structure and color (e.g., in Pantheon of Fig. 5, due to the generalizability issue of appearance embedding, NeRF-W correctly simulates the appearance of the building, but the blue sky color is severely distorted), which inevitably degrades the user experience of this functionality. Thanks to the factorized scene representation and realism augmentation, our results show better photo-realism with vivid details.

(a) Original Static Photo
(b) [GIF] Extrapolated 3D Photo
Figure 7. We show two examples of extrapolated 3D photo generation, which transfers tourist photos into extrapolated and dynamic 3D photos with camera moving effect. Please use Adobe Reader or check our project webpage to see animations. Photos by Flickr users MikiAnn and Chris Devers.

4.4. Controllable Scene Re-Rendering

We show our controllable scene re-rendering capability with user-selected tone mapping and HDR environment maps in Fig. 6. Note that we cannot find proper competitors for this task, since existing methods either only support appearance changing through the latent space while lack of explicit controlling of lighting effect (Meshry et al., 2019; Martin-Brualla et al., 2021; Tancik et al., 2022; Chen et al., 2022), or only support inverse rendering and relighting for 2D images but cannot synthesize novel lighting effect with given camera trajectories (Yu and Smith, 2021; Yu et al., 2020). As shown in Fig. 6, with the factorized scene representation, we can freely control the lighting effect through tone mapping and even user-selected HDR maps, e.g., the appearance of the re-rendered building naturally exhibits the lighting effect with a dusk and cool tone from the given HDR maps in Fig. 6 (c) and (d).

4.5. Extrapolated 3D Photo Generation

We show the capability of extrapolated 3D photo generation in Fig. 7. As shown in Fig. 7, by simply adapting lighting condition to the given photo (Sec. 3.3) and enlarging FoV of the renderer, our method naturally transfers a static tourist photo into an extrapolated and dynamic 3D photo with vivid camera moving effect, whereas previous works (Shih et al., 2020; Wiles et al., 2020; Niklaus et al., 2019) only generate 3D photo bounded by visible areas. Please refer to the supplementary material for the detailed implementation of our 3D photo generation.

4.6. Ablation Studies

Figure 8. We inspect the trained model w/ or w/o occlusion-free supervision for the re-rendering training stage.
Figure 9. We analyse the effectiveness of affine tone mapping and realism augmentation.
Figure 10. We show the effectiveness of our shadow modeling and visualize the learned shadow component.
Config. Trevi Fountain
w/o Tone Mapping 22.95 0.787 0.356
w/o Shadow 21.77 0.772 0.378
w/o Realism Aug. 20.80 0.681 0.512
Full Model 23.09 0.792 0.345
Table 2. We perform ablation studies of the rendering component and the realism augmentation on the Trevi Fountain.

Composited training with occlusion-free supervision. We first inspect the effectiveness of the occlusion-free supervision in the re-rendering stage during composited training (Sec. 3.2). As shown in Fig. 8, when training with raw images that contains occluders, we might end up with a neural model that brings some shadows at the lower part of the rendered image (Fig. 8 (a)), while the rendered result (Fig. 8 (b)) from the full model is free of such artifacts. This proves the necessity of occlusion-free supervision for learning factorized re-rendering.

Affine tone mapping. We then analyze the impact of affine tone mapping in the factorized re-rendering. As shown in Fig. 9 and Tab. 2, the rendered scene without tone mapping shows pale lighting effects compared to the ground-truth image, which demonstrates that relying on the HDR decoder alone cannot guarantee faithful modeling of various lighting effects. By introducing affine tone mapping, we mitigate the pressure of the HDR decoder, and achieve better photo adaptation ability (e.g., Fig. 9 (c) shows the scene lighted by yellow sunlight as the ground-truth while Fig. 9 (a) fails).

Shadow modeling. We also study the effectiveness of shadow modeling in Fig. 10 and Tab. 2. It is noticeable that when introducing shadow modeling, our method can better simulate shadows in the outdoor scene (e.g., dimming appearance at the highlighted area in Fig. 10), even if the shadow caster (e.g., the shadow caused by some building behind the fountain, see highlighted green rectangle in Fig. 10) is not observed before. Meanwhile, when equipped with shadow modeling, the metric of the rendering quality is also improved (see the second and the last row of Tab. 2).

Realism augmentation. We finally inspect the efficacy of the realism augmentation mechanism. As shown in Fig. 9 and Fig. 3, the texture details of buildings and scenes (e.g., springs of fountain and curves of sculptures) have been enhanced with rich patterns, and the lighting effect is also much closer to the ground-truth image. Besides, the metric result is also improved by a large margin as show in Tab. 2. These results demonstrate the value of realism augmentation in broader applications such as photo adaptation and extrapolation where a single neural model is required to adapt to different illumination conditions and dynamic scene details.

5. Conclusion

We propose a novel factorized neural re-rendering model, which encodes the appearance and geometry of outdoor scenes from Internet photo collections in a factorized paradigm, and delivers controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. One limitation is that we take the Lambertian reflectance assumption for modeling, which is not capable of representing shiny and mirrored materials such as glass walls of buildings. Second, our method is agnostic to unobserved shadow casters (e.g., building behind the tourists), so the shadow effect is not controllable through external lighting. Third, we do not extrapolate photos for uncaptured part of human bodies. A possible workaround is to adopt portrait image completion techniques (Wu et al., 2019) to complete the bodies, which can be directly incorporated with our pipeline in future works.

Acknowledgement. This work was partially supported by the NSFC (No. 62102356) and Zhejiang Lab (2021PE0AC01).


  • T. Akenine-Moller, E. Haines, and N. Hoffman (2019) Real-time rendering. AK Peters/crc Press. Cited by: §3.1.
  • Bangbang Yang and Chong Bao, Y. Zhang, J. Zeng, H. Bao, Z. Cui, and G. Zhang (2022) NeuMesh: learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • M. Boss, R. Braun, V. Jampani, J. T. Barron, C. Liu, and H. Lensch (2021a) NeRD: neural reflectance decomposition from image collections. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12684–12694. Cited by: §1, §1, §2, §4.2.
  • M. Boss, V. Jampani, R. Braun, C. Liu, J. Barron, and H. Lensch (2021b) Neural-pil: neural pre-integrated lighting for reflectance decomposition. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, §2.
  • M. Brown and D. G. Lowe (2007) Automatic panoramic image stitching using invariant features. International journal of computer vision 74 (1), pp. 59–73. Cited by: §1, Figure 5, §4.3.
  • X. Chen, Q. Zhang, X. Li, Y. Chen, Y. Feng, X. Wang, and J. Wang (2022) Hallucinated neural radiance fields in the wild. pp. 12943–12952. Cited by: §1, §1, §2, §3, §4.4.
  • Y. Chen, S. Liu, and X. Wang (2021) Learning continuous image representation with local implicit image function. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8628–8638. Cited by: §3.3.
  • P. Debevec (2006) Image-based lighting. In ACM SIGGRAPH 2006 Courses, pp. 4–es. Cited by: §2.
  • A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman (2020) Implicit geometric regularization for learning shapes. pp. 3569–3579. Cited by: §3.2.
  • M. Guo, A. Fathi, J. Wu, and T. Funkhouser (2020) Object-centric neural scene rendering. arXiv preprint arXiv:2012.08503. Cited by: §2, §3.1.
  • Z. Hao, A. Mallya, S. Belongie, and M. Liu (2021) GANcraft: unsupervised 3d neural rendering of minecraft worlds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14072–14082. Cited by: §3.1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.2.
  • Y. Hold-Geoffroy, A. Athawale, and J. Lalonde (2019)

    Deep sky modeling for single image outdoor lighting estimation

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6927–6935. Cited by: §1, §2, §3.1.
  • Y. Jin, D. Mishkin, A. Mishchuk, J. Matas, P. Fua, K. M. Yi, and E. Trulls (2021) Image matching across wide baselines: from paper to practice. International Journal of Computer Vision 129 (2), pp. 517–547. Cited by: Figure 1, Figure 4, Table 1, Figure 5, §4.1.
  • M. M. Kazhdan, M. Bolitho, and H. Hoppe (2006) Poisson Surface Reconstruction. In Proceedings of Eurographics Symposium on Geometry Processing, pp. 61–70. Cited by: §1, §2.
  • K. Kim, Y. Yun, K. Kang, K. Kong, S. Lee, and S. Kang (2021) Painting outside as inside: edge guided image outpainting via bidirectional rearrangement with progressive step learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2122–2130. Cited by: §1, §2.
  • Z. Kuang, K. Olszewski, M. Chai, Z. Huang, P. Achlioptas, and S. Tulyakov (2022) NeROIC: neural rendering of objects from online image collections. arXiv preprint arXiv:2201.02533. Cited by: §2, §3.1.
  • Z. Li and N. Snavely (2018) CGIntrinsics: better intrinsic image decomposition through physically-based rendering. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–387. Cited by: §2.
  • Z. Li, W. Xian, A. Davis, and N. Snavely (2020a) Crowdsampling the plenoptic function. In European Conference on Computer Vision, pp. 178–196. Cited by: §2.
  • Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker (2020b) Inverse rendering for complex indoor scenes: shape, spatially-varying lighting and svbrdf from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2475–2484. Cited by: §2, §3.1.
  • C. Loscos, M. Frasson, G. Drettakis, B. Walter, X. Granier, and P. Poulin (1999) Interactive virtual relighting and remodeling of real scenes. In Eurographics Workshop on Rendering Techniques, pp. 329–340. Cited by: §2.
  • J. Luo, Z. Huang, Y. Li, X. Zhou, G. Zhang, and H. Bao (2020) NIID-net: adapting surface normal knowledge for intrinsic image decomposition in indoor scenes. IEEE Transactions on Visualization and Computer Graphics 26 (12), pp. 3434–3445. Cited by: §2.
  • R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) NeRF in the wild: neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219. Cited by: §1, §1, §1, §2, §3.1, §3.2, Table 1, §3, §4.1, §4.2, §4.4.
  • V. Masselus, P. Peers, P. Dutré, and Y. D. Willems (2003) Relighting with 4d incident light fields. ACM Transactions on Graphics (TOG) 22 (3), pp. 613–620. Cited by: §2.
  • M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely, and R. Martin-Brualla (2019) Neural rerendering in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6878–6887. Cited by: §1, §1, §2, §3, §4.4.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §2, §2, §3.1.
  • S. Niklaus, L. Mai, J. Yang, and F. Liu (2019) 3D ken burns effect from a single image. ACM Transactions on Graphics (ToG) 38 (6), pp. 1–15. Cited by: §4.5.
  • J. Philip, M. Gharbi, T. Zhou, A. A. Efros, and G. Drettakis (2019) Multi-view relighting using a geometry-aware network. ACM Trans. Graph. 38 (4), pp. 78–1. Cited by: §1, §2, §4.3.
  • K. Rematas, A. Liu, P. P. Srinivasan, J. T. Barron, A. Tagliasacchi, T. Funkhouser, and V. Ferrari (2022) Urban radiance fields. CVPR. Cited by: §1, §2, §3.1, §3.1.
  • C. Rockwell, D. F. Fouhey, and J. Johnson (2021) PixelSynth: generating a 3d-consistent experience from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14104–14113. Cited by: §1, §2, Table 1, §4.2.
  • V. Rudnev, M. Elgharib, W. Smith, L. Liu, V. Golyanik, and C. Theobalt (2021) Neural radiance fields for outdoor scene relighting. arXiv preprint arXiv:2112.05140. Cited by: §2.
  • M. Sabini and G. Rusak (2018) Painting outside the box: image outpainting with gans. arXiv preprint arXiv:1808.08483. Cited by: §1, §2.
  • J. L. Schönberger and J. Frahm (2016) Structure-from-Motion Revisited. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §1, §2, §4.1.
  • M. Shih, S. Su, J. Kopf, and J. Huang (2020) 3D photography using context-aware layered depth inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5.
  • P. Sloan, J. Kautz, and J. Snyder (2002) Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 527–536. Cited by: §3.1.
  • N. Snavely, S. M. Seitz, and R. Szeliski (2006) Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pp. 835–846. Cited by: §1, §2, Table 1, Figure 5, §4.1.
  • P. P. Srinivasan, B. Deng, X. Zhang, M. Tancik, B. Mildenhall, and J. T. Barron (2021) NeRV: neural reflectance and visibility fields for relighting and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7495–7504. Cited by: §1, §1, §2, §3.1, §3.1, §4.2.
  • J. Stumpfel, A. Jones, A. Wenger, C. Tchou, T. Hawkins, and P. Debevec (2006) Direct hdr capture of the sun and sky. In ACM SIGGRAPH 2006 Courses, pp. 5–es. Cited by: §2.
  • M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar (2022) Block-nerf: scalable large scene neural view synthesis. pp. 8248–8258. Cited by: §1, §1, §1, §2, §3.2, §3, §4.4.
  • P. Teterwak, A. Sarna, D. Krishnan, A. Maschinot, D. Belanger, C. Liu, and W. T. Freeman (2019)

    Boundless: generative adversarial networks for image extension

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10521–10530. Cited by: §1, §2.
  • A. Troccoli and P. Allen (2008) Building illumination coherent 3d models of large-scale outdoor scenes. International Journal of Computer Vision 78 (2), pp. 261–280. Cited by: §2.
  • M. Waechter, N. Moehrle, and M. Goesele (2014) Let there be color! — Large-scale texturing of 3D reconstructions. In Proceedings of the European Conference on Computer Vision, Cited by: §1, §2.
  • M. Wang, Y. Lai, Y. Liang, R. R. Martin, and S. Hu (2014) BiggerPicture: data-driven image extrapolation using graph matching. ACM Transactions on Graphics 33 (6). Cited by: §1, §2, §4.3.
  • M. Wang, A. Shamir, G. Yang, J. Lin, G. Yang, S. Lu, and S. Hu (2018) BiggerSelfie: selfie video expansion with hand-held camera. IEEE Transactions on Image Processing 27 (12), pp. 5854–5865. Cited by: §1, §2.
  • P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang (2021) NeuS: learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS. Cited by: §3.1, §3.1, §3.2.
  • O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson (2020) Synsin: end-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7467–7477. Cited by: §4.5.
  • X. Wu, R. Li, F. Zhang, J. Liu, J. Wang, A. Shamir, and S. Hu (2019) Deep portrait image completion and extrapolation. IEEE Transactions on Image Processing 29, pp. 2344–2355. Cited by: §5.
  • Y. Xiangli, L. Xu, X. Pan, N. Zhao, A. Rao, C. Theobalt, B. Dai, and D. Lin (2021) CityNeRF: building nerf at city scale. arXiv preprint arXiv:2112.05504. Cited by: §2.
  • Q. Xu and W. Tao (2019) Multi-Scale Geometric Consistency Guided Multi-View Stereo. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5483–5492. Cited by: §1, §2.
  • B. Yang, Y. Zhang, Y. Li, Z. Cui, S. Fanello, H. Bao, and G. Zhang (2022) Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM Trans. Graph. 41 (4), pp. 101:1–101:10. External Links: Link, Document Cited by: §2.
  • B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, G. Zhang, and Z. Cui (2021) Learning object-compositional neural radiance field for editable scene rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13779–13788. Cited by: §2.
  • Z. Yang, J. Dong, P. Liu, Y. Yang, and S. Yan (2019) Very long natural scenery image prediction by outpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10561–10570. Cited by: §1, §2.
  • Y. Yu, A. Meka, M. Elgharib, H. Seidel, C. Theobalt, and W. A. Smith (2020) Self-supervised outdoor scene relighting. In European Conference on Computer Vision, pp. 84–101. Cited by: §2, §3.1, §3.1, §4.4.
  • Y. Yu and W. A. P. Smith (2021) Outdoor inverse rendering from a single image using multiview self-supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, §3.1, §4.4.
  • Y. Yu and J. Malik (1998) Recovering photometric properties of architectural scenes from photographs. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 207–217. Cited by: §2.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In CVPR, Cited by: §4.2.
  • X. Zhang, P. P. Srinivasan, B. Deng, P. Debevec, W. T. Freeman, and J. T. Barron (2021) NeRFactor: neural factorization of shape and reflectance under an unknown illumination. ACM Transactions on Graphics (TOG) 40 (6), pp. 1–18. Cited by: §1, §1, §2, §3.1, §3.1, §3.1, §4.2.
  • Y. Zhu, Y. Zhang, S. Li, and B. Shi (2021) Spatially-varying outdoor lighting estimation from intrinsics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12834–12842. Cited by: §2.