Learning Neural Light Transport

06/05/2020 ∙ by Paul Sanzenbacher, et al. ∙ Max Planck Society 0

In recent years, deep generative models have gained significance due to their ability to synthesize natural-looking images with applications ranging from virtual reality to data augmentation for training computer vision models. While existing models are able to faithfully learn the image distribution of the training set, they often lack controllability as they operate in 2D pixel space and do not model the physical image formation process. In this work, we investigate the importance of 3D reasoning for photorealistic rendering. We present an approach for learning light transport in static and dynamic 3D scenes using a neural network with the goal of predicting photorealistic images. In contrast to existing approaches that operate in the 2D image domain, our approach reasons in both 3D and 2D space, thus enabling global illumination effects and manipulation of 3D scene geometry. Experimentally, we find that our model is able to produce photorealistic renderings of static and dynamic scenes. Moreover, it compares favorably to baselines which combine path tracing and image denoising at the same computational budget.



There are no comments yet.


page 9

page 15

page 17

page 18

page 22

page 23

page 24

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Photorealistic rendering is a core problem in graphics and vision. Algorithms which are able to reason about direct and indirect illumination of a scene (i.e., global illumination) have become an essential building block for a wide range of applications such as gaming, virtual reality, movies and others. With the advent of deep learning, synthetic data generation emerged as another important application 

[Alhaija2018IJCV, Gaidon2016CVPR, Shrivastava2017CVPR, Dosovitskiy2017CORL, Varol2017CVPR] with the potential to satisfy the notorious data hunger of modern deep learning systems. However, as modern deep neural networks require large amounts of data, most existing approaches rely on approximate rendering techniques to accelerate training [Gaidon2016CVPR, Shrivastava2017CVPR, Dosovitskiy2017CORL, Varol2017CVPR]

. Training embodied agents (i.e., using reinforcement learning) poses even stronger demands wrt. simulation time


Figure 1: Motivation. We learn photorealistic rendering using a 3D Light Transport Layer in combination with a 2D Image Synthesis Layer. We demonstrate that our hybrid 3D-2D approach is able to synthesize realistic images with global illumination effects in real-time.

Historically, photorealistic image synthesis is achieved using sampling-based rendering techniques [Pharr2016, Veach1998] where the physics of light transport [Kajiya1986SIGGRAPH] are exploited to transform a physical description of a scene into a realistic image. However, while physically based rendering yields photorealistic results, it is also notoriously slow with rendering times of up to multiple hours for a single image.

On the other hand, recent advances in deep learning enabled the generation of highly realistic images [Goodfellow2014NIPS, Mescheder2018ICML, Karras2018ICLR, Karras2019CVPR] in milliseconds on a commercial GPU. Unfortunately, most existing approaches make use of rather abstract latent representations which do not allow for precise control over the 3D content. Moreover, the lack of a holistic scene description limits neural rendering approaches in their ability to render images that are consistent across viewpoints or time. While some recent works [Sitzmann2019CVPR, Aliev2019ARXIV, Meshry2019CVPR, Noguchi2019ARXIV] have shown that neural network can produce consistent images for a given scene, these approaches usually do not explicitly reason about light transport. Consequently, these approaches are not able to handle fine-grained geometric scene manipulation and do not integrate illumination into the 3D representation: Consider a moving light source that is not visible in the current view. While it clearly influences illumination in the scene, an image-based approach can hardly reason about it.

Contributions: In this work, we investigate the importance of 3D vs. 2D reasoning for efficient learning-based photorealistic rendering. Towards this goal, we present a learning-based approach (see Fig. 1

for a high-level overview) which can predict photorealistic images from a point-cloud based scene representation in real time. In contrast to existing approaches, our method performs reasoning both in 3D and 2D space which allows for learning the physical light transport in a scene. We hypothesize that this enables our method to handle scene modifications such as object translations, object removal and lighting changes. At the same time, our method allows for learning useful heuristics (e.g., shadows that are not affected by moving objects) from the training data, enabling fast rendering without sacrificing quality. We introduce two variants of our approach: (1) a PointNet-based 

[Qi2017CVPR] model and (2) an extension of this model using photon sampling which improves the quality of shadows and specular reflections. We demonstrate both theoretically as well as empirically that our model can be trained without bias using noisy renderings from a physically based renderer.

2 Related Work

Rendering: Physically based rendering is a well-studied field [Veach1998, Pharr2016] where much of the research in recent years focuses on optimizing different parts of the rendering pipeline [Muller2019TOG, Vicini2019TOG, Ren2013TOG, Rainer19CGF] or denoising of fast noisy renderings [Buades2005CVPR, Michael2017HPG, Chaitanya2017SIGGRAPH]

. Moreover, there is a trend of making rendering algorithms differentiable in order to estimate scene properties

[Azinovic2019CVPR, Loper2014ECCV, Gkioulekas2013TOG, Gkioulekas2016ECCV] or to use them for training deep neural networks [Valentin2019, NimierDavid2019SIGGRAPHASIA, Li2018TOG, Kato2018CVPR]. While recent approaches strive to achieve real-time photorealistic rendering [Schied2018PACMCGIT, Schied2017HPG], they often require additional assumptions such as temporal smoothness and are inherently limited by temporal accumulation of information in screen space. In this paper, we investigate the suitability of neural networks for learning light transport end-to-end, with the goal of rendering photorealistic images of dynamic scenes in real time.

Generative Models: Recently, deep generative models such as vae [Hinton2016Science, Chan2005CPAM, Kingma2014ICLR, Huang2018NIPS] or (conditional) gan [Mescheder2018ICML, Karras2018ICLR, Karras2019CVPR, Brock2019ICLR, Isola2017CVPR] have demonstrated that neural networks are capable of generating photorealistic synthetic imagery. A major limitation of these approaches is that their latent representation is typically rather abstract, making it hard to synthesize consistent images across different perspectives or to manipulate the 3D scene content. In contrast to the aforementioned approaches, our model achieves consistency across viewpoints and scene configurations by exploiting rich 3D representations to approximate light transport.

Novel View Synthesis: Alhaija et al. [Alhaija2018ACCV] and Nalbach et al. [Nalbach2017CGF] describe methods for generating renderings from multiple image buffers such as depth and materials. While this approach allows for rendering realistic images, a major limitation is that it operates in image space, making it hard to model global illumination correctly. There also exist several approaches for novel view synthesis [Park2017CVPR, Tinghui16ECCV, Dosovitskiy2014CVPR, Tatarchenko2016ECCV, Eslami2018Science, Kulkarni2015NIPS, Chen2016NIPS, Worall2017ICCV, Xinchen2016NIPS, Rhodin2018ECCV, Chen2017ICCV, Wang2018CVPRa, Eslami2018Science] that make use of a latent scene representation. However, since these methods lack a geometric scene representation, it is hard to gain precise control over their output. In contrast, we learn to render images in a differentiable manner from a holistic scene representation.

Scene Representations: Aliev et al. [Aliev2019ARXIV] propose a neural approach for rendering novel views from point cloud representations [GarciaGarcia2016IJCNN, Qi2017NIPS, Zhang2019ICCV, Hua2018CVPR]. Meshry et al. [Meshry2019CVPR] model scenes using point clouds for re-rendering from novel views and under varying appearance. Hermosilla et al. [Hermosilla2019CGF] use point clouds for learning abstract features on a scene’s surface that can be used for rendering various illumination effects using direct illumination. While this approach captures complex illumination effects such as subsurface scattering, it is limited to diffuse, homogeneous materials and single objects. Recently, several alternative scene representations have been considered [Sitzmann2019NIPS, Thies2019TOG, Mescheder2019CVPR, Oechsle2019ICCV, Niemeyer2020CVPR]. Sitzmann et al. [Sitzmann2019CVPR] propose DeepVoxels, where object-centric static scenes are encoded in a voxel grid of learned features. While this method works well for rendering a sequence of coherent images, it is limited to compact static scenes due to the high memory requirements of voxel-based representations. Thies et al. [Thies2019TOG] propose a neural texture representation that can be used for novel view synthesis of objects. In contrast to these approaches our model aims to represent multiple scenes as well as scene dynamics by explicitly reasoning about light transport.

Figure 2: Model Overview. Our model samples a point cloud uniformly from the input 3D mesh and associates each point with additional properties (albedo, light spectrum). These features are processed using a Light Transport Layer which learns to approximate the light transport in the scene. The resulting features are projected into the 2D image domain and occluded points are removed. The final image is synthesized using an Image Synthesis Layer that takes the projected features as well as additional image space information as input.

3 Method

Our goal is to train a deep neural network to render a photorealistic scene specified in terms of a 3D model in real time. We first discuss our scene representation. Next, we describe our neural rendering architecture, which is able to learn complex illumination effects by exploiting both 3D and 2D information. Finally, we describe how we train our model using noisy renderings for supervision and show that under moderate assumptions our gradient estimates are unbiased. An overview of our approach is given in Fig. 2.

Scene Representation: How should a 3D scene be represented for efficient and photorealistic rendering? Traditionally, 3D geometry is often represented in the form of textured 3D meshes. However, while meshes and texture atlases are compact and encode useful geometric properties, they are inconvenient for neural networks due to their irregular structure. In contrast, voxel-based representations can be processed conveniently using 3D convolutions, yet they are limited by their cubic memory requirements. In this work, we therefore opt for a hybrid 2D-3D representation consisting of both image-space buffers such as albedo, normal and depth maps as well as 3D information. We represent 3D information in form of an unstructured point cloud sampled from the scene’s surface with learned feature embeddings enriched by additional light and material properties.

Architecture: Our neural rendering model comprises three main parts as illustrated in Fig. 2: a Light Transport Layer, a 3D-to-2D projection step and an Image Synthesis Layer. The Light Transport Layer models global illumination effects that cannot be modeled in image space: consider for example a movable lamp which is present in the scene but not visible from the current point of view. The position, color and intensity of the lamp heavily impact the overall illumination of the scene, but an image-based method, by definition, will fail to reason about these effects. We therefore propose to reason in the 3D domain.

Our Light Transport Layer takes a set of randomly sampled 3D surface points and associated attributes for each point as input. These attributes comprise the user-defined material type, the surface albedo, and the light intensity emitted by the point if the point is located on a light-emitting surface. Our goal is to define an architecture that is able to model or approximate light transport in a scene sufficiently well such that illumination effects like reflections and shadows are predicted correctly.

Towards this goal, we first predict a feature embedding for each point using a PointNet-based architecture [Qi2017CVPR]. While we found that such a global representation is able to reason about global illumination to some extent, we additionally propose a more explicit model for light transport to model illumination effects more accurately. Inspired by photon mapping [photon-mapping] we sample additional photon points from all light sources in the scene. Photons are randomly cast into the scene and their first intersection with the scene geometry are computed. For each photon intersection we process the position, color and direction of the initial photon point

with a fully connected neural network, resulting in a feature vector

at . The photon network thus encodes information about the light color, intensity and direction which is necessary for photorealistic shading.

Next, we remove occluded points in the 3D scene using the depth map and project the remaining point features and photon features onto the image plane using perspective projection where denotes the point location, is the camera matrix and the rigid world-to-view transformation matrix. The resulting 2D feature map is obtained by averaging all points projecting onto the same pixel. Formally, we obtain at pixel as


where denotes the inverse projection. We concatenate the resulting feature map with additional image-space information , i.e., a depth map, a normal map and an albedo map, which we obtain using OpenGL shaders. Note that this additional image-space information can be computed cheaply and complements the global scene representation with high-frequency albedo, normal and depth information. Additionally, we create a view ray map that encodes for each pixel a normalized vector pointing from the camera center to the pixel center in world coordinates. This information is necessary for learning specular reflection and refraction effects. The final image synthesis is performed using the Image Synthesis Layer which we implement using a conventional U-Net architecture [Ronneberger2015MICCAI].

Training: We train our model using a dataset which comprises pairs of 3D scene representations and noisy renderings that are obtained from a physically-based renderer which we run for few iterations. The input consists of a scene represented by a point cloud , a view represented by a world-to-view transform and additional image-space information . Our objective is to find a parameter vector which minimizes the mean squared error (mse) wrt. the model parameters :


Since obtaining clean renderings is very time-consuming, we propose the use of noisy renderings from a physically-based renderer. A similar technique has recently been used to learn image denoising [Lehtinen2018ICML, Krull2019CVPR]. Our key insight is that we can exploit the unbiasedness of rendering algorithms like bidirectional path tracing [Lafortune1993] to obtain unbiased gradient estimates.

Lemma 1.

Let be an input representation of a scene, our rendering network and a noisy rendering of following a distribution which depends on the chosen sampling-based rendering algorithm. Assume that the true (noise-free) rendering is given by . Further assume that the rendering algorithm is unbiased, i.e., . In this case, the following equality holds, i.e., the gradient estimates are unbiased:


See supplementary material. ∎

Implementation Details: For the Light Transport Layer, we use a PointNet-based architecture [Qi2017CVPR] with ResNet-blocks [He2016CVPR] of depth two. For the Image Synthesis Layer we use a UNet [Ronneberger2015MICCAI] with four downsampling and four upsamling blocks. The network architecture used for photon feature creation is a fully-connected ResNet [He2016CVPR] architecture with two residual blocks consisting of two fully-connected layers each. The input, hidden and output dimension is the same as for the PointNet architecture. For training, we use the Adam optimizer [Kingma2015ICLR] with a learning rate of and a batch size of 128 for static scenes and 32 for dynamic scenes (see sec:experiments). The learning rate is decayed exponentially by multiplying it by a factor of

after every epoch. More details are provided in the supplementary material.

4 Experiments

In our experiments, we investigate the importance of 3D reasoning for learning photorealistic rendering from noisy observations. We conduct two types of experiments: In our first set of experiments, we investigate the importance of 3D information and the influence of the different components of the Image Synthesis Layer. To analyze these properties independently of light transport, we first run our approach on a static scene observed from varying viewpoints. Our second set of experiments addresses dynamic scenes (moving objects and light sources) using our complete pipeline including the Light Transport Layer.

Datasets: For our experiments on static scenes, we evaluate our approach on a simple static indoor scene containing a table, two light sources and a glass egg [Veach1998]. Our experiments on dynamic scenes are based on four realistic indoor scenes from [Bittlerli2016]. We use Mitsuba [Jakob2010] for both rendering and point sampling. Renderings are created using bidirectional path tracing, a modification of path tracing that is unbiased and converges faster [Veach1998]. For each scene, we create a training set of 100,000 images at a resolution of pixels, varying the camera pose for each training sample. We sample 10,000 surface points for each scene. For our experiments on dynamic scenes, we randomly translate or remove objects in addition to varying the camera pose, and sample an additional 10,000 photons and intersections for each scene. The training data is visualized in the supplementary material.

Baselines: For our main experiment on dynamic scenes we use three baseline methods: (1) a 2D CNN baseline which predicts images from the image-space input alone, (2) a denoising approach similar to the model of Lehtinen et al. [Lehtinen2018ICML], which learns to predict smooth renderings using noisy renderings as input and (3) a simple feature projection approach similar to Aliev et al. [Aliev2019ARXIV] without Light Transport Layer. For the denoising approach we trade-off accuracy with run-time by adapting the number of pixels for which we run the bidirectional path tracer. We report results for , , and of the total number of image pixels with four samples per pixel, setting all other pixels to black. For fair comparison, we use the same convolutional architecture for all baselines and our image synthesis layer.

Metrics: For quantitative comparison, we evaluate mse and mssim [Wang2004TIP] with a window size of pixels. mse and mssim measure mostly low-level similarity. To also measure perceptual similarity, we compute the FID [Heusel2017NIPS] and a Feature-L1 distance [Oechsle2019ICCV] between predicted and ground truth images. For both the FID and Feature-L1 distance, we use the features of the final average pooling layer of an Inception v3 network [Szegedy2015CVPR, Szegedy2016CVPR]

trained on ImageNet 


4.1 Ablation Study on Static Scene

In this section, we conduct experiments on a static scene that does not contain moving objects or light sources. Our primary goal is to investigate the influence of the different elements of the Image Synthesis Layer as well as the importance of 3D information.

We compare the performance of our model without Light Transport Layer for different input modalities. Fig. 3 shows the different configurations which are evaluated against each other. We choose a subset of 6 (out of ) representative configurations to highlight the importance of each input. While configuration 1, 2 and 3 use only 3D information (but no image space information), configuration 4 and 5 rely solely on image space information. Finally, configuration 6 combines both 3D and image space information.

Teaching Input 1 2 3 4 5 6 Ground Truth

config position point features normal map ray direction map MSE MSSIM FID
1 yes no no no 0.0106 0.81 154.6
2 no yes no no 0.0107 0.80 158.6
3 yes yes no no 0.0108 0.81 149.4
4 no no yes no 0.0161 0.78 138.1
5 no no yes yes 0.0107 0.83 124.0
6 yes yes yes yes 0.0084 0.88 86.1
Figure 3: Ablation Study on Static Scene. Comparing different input configurations for a static scene. The metrics are evaluated on a separate held-out validation set comprising 2048 samples. All networks were trained for 200,000 iterations with a batch size of 128. Note how the full model (6) predicts images that are significantly less noisy than the teaching input (left). Additional qualitative results are provided in the supplementary material.

Results: Configurations 1, 2, 3 and 5 show similar performance in terms of mse, while configuration 5, which does not receive any projected point cloud information as input, clearly outperforms the other three configurations in terms of mssim and fid. However, surface normal information only yields good results if supplemented by viewpoint information, as becomes evident when comparing configurations 4 and 5. The most important insight is that all inputs in combination (configuration 6), outperform the other configurations for all metrics by a large margin. This result supports our initial hypothesis that reasoning in both 3D and 2D is crucial for this task. Fig. 3 (top) shows a qualitative result. While configurations 1, 2 and 3 achieve reasonable qualitative results, they also contain several artifacts (e.g., the table) which do not occur in configuration 6. Configurations 4 and 5 do not exploit 3D information, thus severely degrading visual fidelity. This highlights the importance of 3D information for learning-based rendering. Configuration 6 which uses both 2D image space as well as 3D information yields the best qualitative results.

Teaching Input Denoising (1/1) Denoising (1/64) CNN only Feature Projection Ours w/o Photons Ours w/ Photons Ground Truth
Architecture time / frame MSE () MSSIM () FID () Feature L1 ()
Denoising (1/1) 1.5059s 0.0005 0.880 26.4 0.163
Denoising (1/64) 0.0283s 0.0029 0.781 94.0 0.281
CNN only 0.0191s 0.0043 0.835 36.1 0.195
Feature Projection 0.0210s 0.0037 0.841 32.5 0.185
Ours (w/o Photons) 0.0243s 0.0044 0.841 31.4 0.184
Ours (w/ Photons) 0.0459s 0.0028 0.849 30.6 0.182
Figure 4: Dynamic Objects and Fixed Lights. Results on dynamic scenes where objects are modified but light sources kept fixed. We show the non-real-time denoising baseline “Denoising (1/1)” for reference. Additional results are provided in the supplementary material.
Figure 5: Dynamic Objects and Fixed Lights. Quantitative comparison of our approach to the denoising baseline, varying the sample density. We plot reconstruction accuracy in terms of MSSIM and FID over inference time. Numbers refer to the ratio of dropped pixels.

4.2 Results on Dynamic Scenes

To investigate the utility of 3D reasoning, we now turn our attention to dynamic scenes where objects (and light sources) are modified.

4.2.1 Dynamic Objects and Fixed Lights

We first train our network on a set of four scenes where objects are randomly removed or translated in the scene, but keep all light sources fixed.

Results: Fig. 4 shows qualitative and quantitative results for our approach and the baselines. We clearly see that our full model which uses both the Light Transport and the Image Synthesis Layers outperforms the other real-time approaches (lower section of the table), both qualitatively and quantitatively in terms of MSE, MSSIM, FID and Feature-L1 distance. While the non-real-time denoising approach “Denoising (1/1)” achieves the best results, the real-time denoising approach that uses much fewer samples performs the worst. We further analyze this behavior by plotting the MSSIM as a function of rendering time in Fig. 5. While denoising approaches are able to achieve compelling results, the proposed neural rendering approach provides a better accuracy/runtime trade-off while being fully differentiable.

As evident from Fig. 4, our simple feature projection baseline performs only slightly weaker than our variant without photon mapping. We attribute this to the fact that most of the light field in the scene can be encoded in local features and only dynamic parts like sharp shadows have to be learned. This highlights the capability of neural rendering approaches to learn useful heuristics from the training data. However, we also observe that our full architecture with photon mapping (which reasons more explicitly about light transport) achieves by far the best quantitative results.

Feature Projection Error     Ours w/o Photons Error     Ours w/ Photons Error    Ground Truth
Figure 6: Dynamic Objects and Dynamic Lights. We show the output of the feature projection baseline and our network’s predictions with and without photons alongside the corresponding error maps for moving light sources. See supplementary for more results.

4.2.2 Dynamic Objects and Dynamic Lights

In the previous experiment, both the feature projection and our approach without photons were able to handle shadows and other illumination effects well. The reason for this is that the light sources were assumed static, making it possible to encode viewpoint-dependent light properties into the point features. However, by design the feature projection baseline is unable to acquire an understanding of illumination effects in the presence of movable light sources that are not present in the current view. To see this effect, we augment the dataset from the previous experiment by turning all static light sources off and replacing them with a rectangular area light at the ceiling, which we move randomly.

Results: Results from our method with photons, our approach without photons and the feature projection baseline are shown in Fig. 6. We observe that the feature projection baseline produces considerable artifacts while our approach with photons leads to much sharper shadows and more consistent global illumination. This is also evident from the error maps in Fig. 6. We provide a full quantitative evaluation in the supplementary material.

5 Conclusion

In this work, we have systematically investigated the importance of 3D vs. 2D reasoning for learning based photorealistic rendering. Our experiments demonstrate that neural rendering benefits from joint 3D-2D reasoning, also confirming our hypothesis that reasoning in 3D is helpful in the presence of moving objects and light sources. In contrast to denoising methods which rely on outputs from a sampling-based renderer, the presented approach is fully differentiable and can be used for training deep neural networks end-to-end.

Supplementary Material

This supplementary document provides additional information on our approach and more experimental results. First, we provide detailed information on the Light Transport and Image Synthesis Layers in Section A. We then describe the data generation pipeline in more detail in Section B. Afterwards, we provide more information on the training procedure in Section C, including a proof showing that we can train our models using noisy, unbiased renderings as supervision signal. Finally, we provide additional qualitative and quantitative results in Section D.

Appendix A Architectures

a.1 Light Transport Layer

The core of the Light Transport Layer is a PointNet-based architecture [Qi2017CVPR] with fully-connected ResNet blocks [He2016CVPR], which is illustrated in Fig. 7. While the PointNet architecture can have arbitrary depth (number of ResNet blocks), we use a depth of two for all the experiments in the paper.

Since we train our model on dynamic scenes with a variable number number of visible objects, the input point clouds have different sizes for different training samples. In theory this is not a big problem, as PointNets can handle arbitrary point cloud sizes. However, since we are using mini batches for training, having the same number of points for each training sample is desirable. Therefore, our model always operates on the maximum point cloud size, and invisible objects are masked in the architecture using per-point visibility flags.

Figure 7: Light Transport Layer. In the first stage of the Light Transport Layer all points and supplementary point information

are processed in a preprocessing layer, whose purpose is to align its output feature dimension with the input feature dimension of the PointNet block. The point features are then processed in two consecutive PointNet blocks. A PointNet block comprises a residual block, where local features are computed for each point. The fully connected layers (hidden, output and shortcut layers) consist of 32 output neurons each, where the weights within one layer are shared between the input points. The input dimension to the first fully connected layer in a residual block is aligned with the output dimension of a PointNet block (64). Therefore, a fully connected shortcut layer is required for matching the feature dimensions at the end of a residual block. Following the residual block within a PointNet block, point features are concatenated with a global feature, which is computed as the maximum feature vector of all local features. The output features of the second PointNet block are denoted by

. We denote fully connected layers by fc

and ReLU activation functions by


a.2 3D-to-2D Projection Step

In the 3D-to-2D projection step, the 3D point features are projected to image space, where the point locations are discretized. Points that are occluded by the scene’s geometry are masked out, which is determined by performing an occlusion check using a rendered depth map. To make sure that we do not accidentally remove points on the scene’s surface, we use a tolerance of in the occlusion check. If multiple features are projected to the same pixel, we compute the mean feature vector for all points projecting to that pixel. If a pixel has no points projecting to it, its feature vector is defined as zero.

a.3 Image Synthesis Layer

The input to the Image Synthesis Layer are the projected features from the projection step and additional information in image space, which can be computed cheaply using OpenGL shaders These image space buffers contain information about the geometry and material information observed from the current view. They include depth map, albedo (diffuse reflectance), normal map in world coordinates as well as a view ray map, which contains for each pixel the ray direction in world coordinates going from the camera center through the respective pixel center. The intention behind using these image space layers is to leverage the image formation process in multiple ways. The normal and view direction information can be used by the network to infer shading in image space. The albedo layer supports texture synthesis where point projections are sparse. In addition, by providing this information in image space, the light transport layer can solely focus on the task of modeling the illumination in the scene. However, the image space layers do not contain useful information for reasoning about light transport in the scene. A detailed visualization and description of the Image Synthesis Layer is provided in Fig. 8.

Figure 8: Image Synthesis Layer. For final image synthesis we use a UNet [Ronneberger2015MICCAI] architecture where the resolution is reduced in three steps and expanded again. To this end, each level comprises a convolutional ResNet [He2016CVPR] block consisting of two

convolutional layers and ReLu activations. We use the same feature dimension for the input, hidden and output layers in a convolutional ResNet block. The convolutional ResNet blocks are followed by a downsampling step, which is implemented using max-pooling layers. The feature dimension of the convolutional layers depend on the level, starting at a dimension of

, which is then doubled after each downsampling step. The features of the lowest level are then upsampled again using bilinear interpolation, concatenated with the convolutional ResNet block output from the respective downsampling layer through a skip connection and processed in another convolutional ResNet block. After the last upsampling layer an additional convolutional layer is used to render an image with three channels. The numbers below the layers correspond to the number of feature maps in each layer. The numbers inside the layers correspond to the layer’s resolution, starting at a square resolution of

. We use in our static scene ablation study and for the other experiments.

Appendix B Datasets

b.1 Data Generation and Sampling

The datasets used in our experiments comprise a single scene for each static scene dataset, and four scenes for dynamic experiments [Bittlerli2016]. Since the data generation procedure for static scenes is a simplification of the dynamic case, we only describe the dynamic case in this section. Since we use learnable feature descriptors in our model, we must ensure that there are point correspondences between different training samples of the same scene. To this end, we sample an initial, static point cloud for each scene. This point cloud is then modified according to the scene modifications in the training sample. If an object is removed from the scene, the points are removed from the initial point cloud. If an object is translated, the points sampled from its surface are translated accordingly. For each scene in the dataset, we first sample a static point cloud, which is then modified for each sample in the dataset. A positive side effect of this is that we only have to store scene modification information for each sample, saving memory.

b.2 View Sampling

For each scene, we would like to cover the space of possible viewing locations and directions as accuractely as possible. At the same time we want to have a high number of views where a lot of scene details are visible to have an effective supervision signal for training. We observe that most of the objects in a scene are arranged along the walls or the floor. Therefore, we sample a viewing location uniformly from a bounding box that is slightly smaller than the scene’s bounding box. Note that this means that a few of the sampled locations might lie inside an object. However, we found that these “outliers” do not pose a problem to our method in practice as long as we observe a sufficiently large number of views outside of objects during training. Next, we sample a viewing direction by sampling a look-at location uniformly from a bounding box that is half the size of the location bounding box. As a result, the distance between the camera and scene objects is far enough to render views with rich image content.

b.3 Point Cloud Sampling

We define a scene by a set of shapes , where each shape is itself a set of triangles. Each shape is assigned a sampling importance corresponding to its surface area, which is the sum of triangle areas for that shape. Given the sampling importances and a point cloud of size , we first sample

shapes according to a distribution where the probability of sampling a shape is proportional to its sampling importance. This can be achieved by using discrete inverse transform sampling, where a discrete cumulative distribution is calculated for the sequence of shapes



Using a uniform sample , a shape index can be sampled according to


which can be implemented efficiently using bisection.

For each shape sampled from the distribution, our goal is to obtain a point sampled uniformly from the shape’s surface. Since we work with a mesh scene representation, all the shapes are represented by a set of triangles. Therefore, for each point we first sample the triangle with the same technique we used for shape sampling, using the triangle area as sampling importance. Then, we sample a point location uniformly from the triangle. This way, uniformly distributed samples from the shapes’ surfaces can be obtained.

b.4 Scene Modification Sampling

To train our model on all possible scene configurations (each object or light source could be located anywhere or not be present in the scene at all), we must cover this distribution well in the dataset. To this end, we manually define for each dynamic object an axis-aligned bounding box from which we sample a position for each training sample. The bounding boxes can also be limited to one or two dimensions, e.g. if an object can only be translated along a wall. Although we do not always get realistic object arrangements using this sampling strategy, this is not a limitation, as it makes our model more general (i.e. our model is trained for both realistic and non-realistic object arrangements). In addition to object translations, we randomly remove objects with a probability of 0.2 per object from the scene.

Figure 9: Data Sample Components. (1) noisy supervision, (2) ground truth rendering, (3) depth map, (4) normal map, (5) albedo map, (6) point cloud, (7) point cloud with occlusion masking, (8) per-point visibility, (9) per-point albedo, (10) per-point emitter spectrum, (11) photon origins, (12) photon intersections with scene.

b.5 Data Sample Components

In Fig. 9 we provide a visualization of the different components of a sample in our dataset. A sample consists of a noisy supervision rendering, which we obtain from a bidirectional path tracer [Jakob2010], using one sample per pixel (image (1)). For the evaluation of the test set, we render an additional ground truth rendering using 128 samples per pixel (image (2)). The images (3)–(5) in Fig. 9 are visualizations of the additional image-space layers as described in Section 3. Image (6) visualizes all points in the point cloud for that sample, and image (7) all points that are visible in the current view. Image (8) visualizes the per-point visibility of objects in the scene with yellow denoting invisible objects. The per-point visibility is needed as our architecture requires a fixed number of points as input, and is used for masking out invisible objects in the Light Transport Layer. Images (9) and (10) visualize additional point properties which are fed into the Light Transport Layer, such as the diffuse reflectance for each point as well as an emitter spectrum, which is non-zero for points lying on a light source. Images (11) and (12) show photon origins and their respective first intersection with the scene.

Appendix C Training

c.1 Hyperparameters

We train all models using the Adam optimizer [Kingma2015ICLR] with a learning rate of , which we decay by a factor of

after each epoch. These hyperparameters are the result of a hyperparameter optimization using grid search, where we tested different learning rates and decay rates for Adam and RMSprop for 100,000 iterations. For the static scene experiment in Section 

4.1 we use a batch size of 128. For the dynamic scene experiments in Section 4.2

we use a batch size of 32, as more GPU memory is required for the Light Transport Layer implementation. All models were created and trained using PyTorch 1.0

222https://pytorch.org [Paszke2017].

c.2 Supervised Learning with Noisy Renderings

Since rendering a large set of photorealistic renderings for training would require a lot of time, we use noisy renderings from a physically based renderer as supervision. More specifially, we use the bidirectional path tracing implementation in Mitsuba [Jakob2010]. Similar techniques have recently been used to learn image denoising [Lehtinen2018ICML, Krull2019CVPR]. Our key insight is that we can exploit the unbiasedness of state-of-the-art rendering algorithms like bidirectional path tracing [Lafortune1993] to obtain unbiased gradient estimates.

To this end, we describe the input to our network by random variable

, which comprises a point cloud , a view represented by a world-to-view transform and additional image-space information as described in Section 3. As supervision signal, we render a noisy image

that is an unbiased estimate of the ground truth rendering

. When we train our network using the mse and stochastic gradient descent, our gradients will be unbiased when using these noisy supervision renderings from such an unbiased rendering algorithm. Formally, this can be expressed as follows:

Lemma 2.

Let be an input representation of a scene, our rendering network and a noisy rendering of following a distribution which depends on the chosen sampling-based rendering algorithm. Assume that the true (noise-free) rendering is given by . Further assume that the rendering algorithm is unbiased, i.e., . In this case, the following equality holds, i.e., the gradient estimates are unbiased:


Since the expectation does not depend on the parameters , the gradient can be pulled out of the expectation. The left side of Eq. (6) becomes


By applying the binomial theorem and the property of the estimator being unbiased, which means that , the expectation term can be further expanded to


Taking the gradient with respect to in Eq. (10) allows for removing or adding terms that are constant with respect to . Thus, we can replace with :


Inserting this into Eq. (7) results in Eq. (6), concluding the proof. ∎

Appendix D Additional Results

Teaching Input 1 2 3 4 5 6 Ground Truth
Figure 10: Ablation Study on Static Scene. Additional visual results for our static scene ablation study, extending the results in Fig. 3 in Section 4.1.
Teaching Input Prediction Ground Truth     Teaching Input Prediction Ground Truth
Figure 11: Bathroom Scene. Results of our model on a realistic static bathroom scene.
Teaching Input Prediction Ground Truth     Teaching Input Prediction Ground Truth
Figure 12: Kitchen Scene. Results of our model on a realistic static kitchen scene.

For the static scenes ablation study in Section 4.1 we tested different input configurations for our network, showing that we achieve the best possible outcome by combining all of the inputs. Additional visual results for this experiment are shown in Fig. 10.

We also tested our model on two additional challenging static scenes, with results shown in Fig. 12 and Fig. 12, respectively. For this experiment, we used a realistic bathroom scene and a realistic kitchen scene [Bittlerli2016] at an image resolution of pixels. Both scenes were trained with a batch size of 128 for 150,000 iterations. Although there is no light transport to be learned in these static scene experiments, we find that our model is able to encode realistic static scenes well, and renders novel views accurately.

For dynamic scenes we also conducted two experiments: one where we compared our approach to a set of baselines in Section 4.2.1, using a dataset with dynamic objects and fixed lights, where we translated and removed objects randomly. And another experiment where we highlight the importance of the Light Transport Layer and the additional photon architecture in Section 4.2.2, on a dataset with dynamic objects and dynamic lights, where we additionally translate rectangular light sources randomly along the ceiling. Table 1 shows the full quantitative evaluation of the experiments for dynamic objects and fixed lights. For the experiment with dynamic objects and dynamic lights we provide a full quantitative evaluation in Table 2.

Fig. 14 shows additional visual results for the baseline comparison for dynamic objects and fixed lights, complementing Fig. 4. In Fig. 15 we show examples where our method does not predict the illumination accurately. These error images also show that for the denoising approaches errors occur mostly in image regions with high frequency components, i.e. edges and textures. For our approach, errors sometimes also occur in larger image regions when the prediction is inaccurate or sparse. This also explains that while our approach performs best for most of the metrics in Table 1 and Table 2, the MSE is lower for the denoising approaches.

In addition to the results shown in Fig. 6, we show visual results and error images for dynamic objects and dynamic lights in Fig. 16, as well as failure cases in Fig. 17.

In addition to the quantitative comparison for dynamic objects and fixed lights (Fig. 5), we show a more comprehensive quantitative comparison for dynamic objects and dynamic lights in Fig. 13. In addition to MSSIM and FID, we compare L1 feature losses from different stages of the Inception v3 network [Szegedy2015CVPR, Szegedy2016CVPR], showing that our approach clearly outperforms the denoising baselines on different levels of image abstraction.

Architecture time / frame MSE MSSIM FID Feature L1
Denoising (1/1) 1.5059s 0.0005 0.880 26.4 0.163
Denoising (1/4) 0.3800s 0.0007 0.867 28.1 0.172
Denoising (1/16) 0.0986s 0.0012 0.835 38.7 0.203
Denoising (1/32) 0.0532s 0.0018 0.813 54.1 0.233
Denoising (1/64) 0.0283s 0.0029 0.781 94.0 0.281
CNN only 0.0191s 0.0043 0.835 36.1 0.195
Feature projection 0.0210s 0.0037 0.841 32.5 0.185
Ours (w/o photons) 0.0243s 0.0044 0.841 31.4 0.184
Ours (w/ photons) 0.0459s 0.0028 0.849 30.6 0.182
Table 1: Dynamic Objects and Fixed Lights. Quantitative evaluation for our experiment on dynamic objects and fixed lights.
Architecture time / frame MSE MSSIM FID Feature L1
Denoising (1/1) 1.5059s 0.0002 0.930 17.1 0.137
Denoising (1/4) 0.3801s 0.0002 0.923 17.6 0.143
Denoising (1/16) 0.0988s 0.0005 0.896 23.5 0.172
Denoising (1/32) 0.0518s 0.0008 0.874 38.6 0.207
Denoising (1/64) 0.0283s 0.0016 0.839 84.1 0.269
CNN only 0.0190s 0.0100 0.827 33.7 0.199
Feature projection 0.0208s 0.0098 0.827 32.9 0.197
Ours (w/o photons) 0.0243s 0.0029 0.871 30.0 0.184
Ours (w/ photons) 0.0468s 0.0014 0.887 25.1 0.172
Table 2: Dynamic Objects and Dynamic Lights. Quantitative evaluation for our experiment on dynamic objects and dynamic lights.
Figure 13: Dynamic Objects and Dynamic Lights. This plot shows a quantitative comparison of our approach with the denoising baseline for different sample densities. We plot reconstruction accuracy over inference time for our experiment on dynamic objects and dynamic lights. The denoising labels refer to the ratio of pixels that are dropped. The layer indices (0–3) for the Feature L1 losses refer to outputs of the four major layers in the Inception v3 network.
Teaching Input Denoising (1/1) Denoising (1/64) CNN only Feature Projection Ours w/o Photons Ours w/ Photons Ground Truth
Figure 14: Dynamic Objects and Fixed Lights. Additional results for our method as well as for the baselines for dynamic objects and fixed lights, complementing Fig. 4.
Denoising (1/4) Denoising (1/16) Denoising (1/32) Denoising (1/64) Ours w/o photons Ours w/ photons
Figure 15: Dynamic Objects and Fixed Lights. Predictions and error images with respect to ground truth for different denoising approaches and our approach for dynamic objects and fixed lights. Error plots are shown below the respective prediction.
Feature Projection Error     Ours w/o Photons Error     Ours w/ Photons Error    Ground Truth
Figure 16: Dynamic Objects and Dynamic Lights. Additional visual results for dynamic objects and dynamic lights, complementing Fig. 6.
Feature Projection Error     Ours w/o Photons Error     Ours w/ Photons Error    Ground Truth
Figure 17: Dynamic Objects and Dynamic Lights. Example scenarios that are challenging for our approach with dynamic objects and dynamic lights. We observe failure cases for specular materials and mirrors, when objects are close to the camera and in the presence of fine shadows.