Deep Photon Mapping

by   Shilin Zhu, et al.
University of California, San Diego

Recently, deep learning-based denoising approaches have led to dramatic improvements in low sample-count Monte Carlo rendering. These approaches are aimed at path tracing, which is not ideal for simulating challenging light transport effects like caustics, where photon mapping is the method of choice. However, photon mapping requires very large numbers of traced photons to achieve high-quality reconstructions. In this paper, we develop the first deep learning-based method for particle-based rendering, and specifically focus on photon density estimation, the core of all particle-based methods. We train a novel deep neural network to predict a kernel function to aggregate photon contributions at shading points. Our network encodes individual photons into per-photon features, aggregates them in the neighborhood of a shading point to construct a photon local context vector, and infers a kernel function from the per-photon and photon local context features. This network is easy to incorporate in many previous photon mapping methods (by simply swapping the kernel density estimator) and can produce high-quality reconstructions of complex global illumination effects like caustics with an order of magnitude fewer photons compared to previous photon mapping methods.


page 1

page 5

page 6

page 8

page 9


Fast Monte Carlo Rendering via Multi-Resolution Sampling

Monte Carlo rendering algorithms are widely used to produce photorealist...

Photon-Driven Neural Path Guiding

Although Monte Carlo path tracing is a simple and effective algorithm to...

End-to-End Adaptive Monte Carlo Denoising and Super-Resolution

The classic Monte Carlo path tracing can achieve high quality rendering ...

Scatter Correction in X-ray CT by Physics-Inspired Deep Learning

A fundamental problem in X-ray Computed Tomography (CT) is the scatter d...

Survey: Machine Learning in Production Rendering

In the past few years, machine learning-based approaches have had some g...

Deep Scattering: Rendering Atmospheric Clouds with Radiance-Predicting Neural Networks

We present a technique for efficiently synthesizing images of atmospheri...

Kernel embedded nonlinear observational mappings in the variational mapping particle filter

Recently, some works have suggested methods to combine variational proba...

1 Introduction

Computing global illumination is crucial for photorealistic image synthesis. Ray tracing-based methods have been widely used to simulate complex light transport effects with global illumination in film, animation, video game and other industrial fields. The most successful approaches are based on either Monte Carlo (MC) integration, like path tracing [kajiya1986rendering, veach1997robust], or particle density estimation, like photon mapping [jensen1996global]. Photon mapping techniques are able to efficiently simulate caustics and other challenging light transport effects, which are very hard and even impossible for pure Monte Carlo-based methods to simulate.

In general, both MC-based and particle-based methods require numerous samples to render noise-free images, and are thus computationally expensive. Recently, significant progress has been made in denoising MC images rendered with low sample counts using deep learning techniques [chaitanya2017interactive, bako2017kernel]. However, there is relatively little work in particle-based methods for low-sample reconstruction and current photon mapping techniques still require a very large number of traced photons to achieve accurate, artifact-free radiance estimation.

We present the first deep learning-based approach for particle-based rendering that enables efficient, high-quality global illumination with a small number of photons. Our approach is particularly good at reconstructing diffuse-specular interactions like caustics, for which previous photon mapping methods require large photon sample counts (and path-tracing at reasonable sample counts can miss altogether). We focus on photon density estimation—a key component of all particle-based methods—and introduce a novel deep neural network that can estimate accurate photon density at any surface points in a scene given only sparsely distributed photons.

Previously, the most successful density estimation methods for photon mapping are kernel-based methods that use traditional kernel functions (like a uniform or cone kernel) to compute output radiance at a surface point as a weighted sum of nearby photons. While previous methods have improved the kernels by controlling the kernel bandwidths or shapes [kaplanyan2013adaptive, schjoth2008diffusion, kang2016adaptive], traditional kernel functions still require a large enough count of photons located in a small enough bandwidth around every surface shading point, for which a very large number of photons need to be traced, to compute accurate photon density. In contrast, we propose to learn to predict a kernel function at each shading point to effectively aggregate nearby photon contributions. Our predicted kernels leverage data priors and are able to compute accurate photon density estimation for complex global illumination from photon counts that are an order of magnitude fewer than traditional methods.

Our network considers local photons around a queried surface point within a predefined bandwidth as input. Unlike traditional methods that often treat photons individually or leverage standard statistics to aggregate photons, we leverage learned local photon statistics—encoded as a deep photon context vector inferred by the network—around a surface point for per photon kernel weight estimates. Specifically, the network first processes individual photons to extract per-photon features and aggregates them across photons using pooling operations to obtain a deep photon context feature that represents the local photon statistics. The network processes the individual per-photon features concatenated with the local context to compute per-photon kernel weights, which are used to perform density estimation by a weighted sum. We demonstrate that this approach of learning kernel prediction is more efficient than a baseline that directly estimates photon density from the aggregated deep context vector.

To train our network, we create diverse photon distributions by tracing photons in 500 procedurally generated scenes with complex shapes and materials. We sample surface points on diffuse surfaces, which form a 512512 image (one pixel per point) in each scene, and we compute the ground truth photon density of each point using progressive photon mapping [hachisuka2008progressive] with billions of photons. Note that, our network focuses on local photon distribution properties of surface points. Hence, every surface point in a scene is a training datum, allowing us to train a generalizable network without a lot of images.

In Fig. [, we demonstrate that, using only 15k photons, our method can synthesize high-quality images. Conversely, variations of path tracing and photon mapping fail to do so; even when combined with advanced progressive and adaptive techniques, SPPM and APPM require significantly more samples (1.5M photons) to achieve comparable results. This makes our approach an important step towards making photon mapping computationally efficient. Moreover, our experiments leverages an effective practical hybrid approach: using our method for reconstructing light-specular (LS) paths – the light transport paths that interact with specular surfaces before arriving at light sources—and low sample-count path tracing with learning-based denoising for all other light transport paths. This leverages the advantages of both MC denoising and our efficient photon density estimation technique.

2 Related Work

Monte Carlo path integration. Kajiya kajiya1986rendering introduced the rendering equation and Monte Carlo (MC) path tracing. Since then, various methods for MC path integration have been developed, including light tracing [dutre1993monte], bidirectional path tracing (BDPT) [lafortune1993bi, veach1995optimally], and Metroplis light transport (MLT) [veach1997robust, pauly2000metropolis, cline2005energy]

. These methods are able to simulate complex light transport with accurate global illumination in an unbiased way. However, pure MC based methods typically require a very large number of samples (traced paths), especially for very low probability paths like the classical caustic or specular-diffuse-specular (SDS) paths. We base our method on the photon mapping technique, which is efficient for caustics and SDS, and we aim to achieve sparse reconstruction.

Monte Carlo denoising. While there is little progress in sparse reconstruction with low sample counts in photon mapping, many approaches have been proposed to achieve MC rendering with low sample counts. A recent survey of sparse sampling and reconstruction is presented by Zwicker et al. zwicker2015recent. MC denoising methods can be categorized into a-priori methods that rely on prior theoretical knowledge [durand2005frequency, egan2009frequency, yan2015fast, wu2017multiple], and a-posteriori methods that filter out the noise in rendered images with few assumptions about the image signal [overbeck2009adaptive, rousselle2013robust, kalantari2015machine].

Recently, deep learning techniques have been introduced to achieve MC denoising [chaitanya2017interactive, bako2017kernel], and many methods utilize kernel prediction[bako2017kernel, vogels2018denoising, xu2019adversarial]

. Kalantari et al. kalantari2015machine propose to predict the parameters of fixed filtering functions using fully-connected neural networks. Bako et al. bako2017kernel leverage deep convolution neural networks to predict kernels to linearly combine the original noisy radiances of neighboring pixels. Gharbi et al. gharbi2019sample make use of individual screen-space path samples and predict a kernel for each sample that splats the radiance contributions to its neighboring pixels. In contrast, we apply deep learning in photon density estimation and leverage local photon statistics for density estimation from sparse photons. Our network considers individual scene-space photon samples around each shading point and predicts a kernel to gather per-photon contributions. Our approach is the first that introduces deep learning in photon mapping and demonstrates learning-based kernel prediction in this context.

Photon density estimation. The rendering equation [kajiya1986rendering, immel1986radiosity] can be approximated by particle density estimation [shirley1995global, jensen1996global, walter1997global]. Most particle-based methods are based on the original photon mapping framework [jensen1996global]

; it first traces rays from light sources to distribute photons in a scene, and then gathers neighboring photons at individual shading points to approximate radiance estimates. Photon mapping achieves low variance in the rendered images and leads to blurred, less noticeable artifacts at the cost of introducing bias in the estimates. Photon mapping is able to consistently converge to the correct solution by increasing the number of photons towards infinity and reducing the bandwidth towards zero.

Figure 1: Overview of our deep photon density estimation network. Given a set of photons within the bandwidth of a shading point, we pre-process these photons’ properties and input them to feature extractor MLPs that compute per-photon features. These are aggregated using max- and average-pooling to construct a deep context feature. The original per-photon features and the deep context are concatenated and processed by a kernel prediction MLP that predicts a kernel weight. Finally, these kernel weights are used to sum the photon contributions and produce the reflected radiance.

Previous work has investigated progressive methods to overcome the memory bottleneck and enable arbitrarily large photon numbers [hachisuka2008progressive, hachisuka2009stochastic, hachisuka2010progressive, knaus2011progressive], bidirectional methods to improve rendering glossy objects [vorba2011bidirectional], adaptive methods to optimize photon tracing [hachisuka2011robust], and the combination of unbiased MC methods and photon mapping [georgiev2011bidirectional, hachisuka2012path, georgiev2012light]. Many relevant works have been presented to improve the kernel density estimation by utilizing standard statistics for adaptive kernel bandwidth [jc95, kaplanyan2013adaptive, kang2016adaptive] or anisotropic kernel shapes [schjoth2008diffusion]. Other works leverage ray differentials [schjoth2007photon], blue noise distribution [spencer2009into, spencer2013photon, spencer2013progressive], and Gaussian mixtures fitting [jakob2011progressive] to improve the reconstruction. In contrast, we focus on accurately computing photon density with sparse photons, which hasn’t been explored in previous work. Essentially, we replace the traditional kernel density estimation with a novel learning based module, and keep the rest unchanged in the standard photon mapping framework. This potentially enables the combination of our technique and previous photon mapping techniques that focus on other components in the framework.

3 Background: Density estimation

Photon mapping techniques compute reflected radiance via density estimation. Kernel density estimation [wand1994kernel] is the most widely used density estimation method in statistics, and has been widely applied in photon mapping. Early works use the uniform kernel that treats nearby photons equally [jensen1996global, hachisuka2008progressive]; subsequent works extend photon density estimation to support arbitrary smooth kernels [hachisuka2010progressive, knaus2011progressive]. In general, the reflected radiance at a shading location is computed by:


where is the total number of photon paths that are emitted in a scene, is the reflected direction, is the location of a photon, is the photon contribution and represents the kernel function with bandwidth . In general, the photon contribution is the product of the BRDF and the photon energy. In this work, we only compute photon density on diffuse surfaces, as is done in many classical photon mapping methods. In this case, the BRDF at a shading point is , where is the albedo. Correspondingly, , where represents the accumulated path contribution divided by the sampling probability, which can be also interpreted as the energy flux carried by the photon. Therefore, can be removed and can be taken out of the summation in Eqn. 1. We therefore consider the photon energy as the photon contribution in this work.

The kernel assigns linear weights to photons, which are used to linearly combine the contributions of photons in a local window with radius . Traditionally, is a uniform function () or a function of the distance from the shading point to a photon (). Instead, we propose to leverage data priors to predict kernels to aggregate photon contributions.

4 Learning to compute photon density

In this section, we present our learning-based approach for photon density estimation. Our approach is light-weight and focuses on density estimation only; we keep the main framework of standard photon mapping and upgrade the traditional, distance-based and photon-independent kernel functions ( in Eqn. 1) to novel, learned and local-context-aware kernel functions represented by a deep neural network (see Fig. 1).

In particular, given a shading point, our network considers its nearest neighbor photons, which adaptively selects the bandwidth . Multiple properties of individual photons are used as input for the network, including photon positions , photon directions and photon contributions . We also supply the number of nearest photons to the network to let it better understand the local photon distribution. Our network (denoted as ) regresses per-photon kernel weights to compute radiance estimates via a weighted sum similar to Eqn. 1:


where represents the predicted kernel weight for photon . Note that, our network uses information about all photons in a local neighborhood for per-photon kernel prediction; it obtains deep photon statistics and associates per-photon information with statistical context to compute kernels for photon aggregation.

4.1 Input pre-processing

Photon distributions are highly diverse across shading points and across scenes, making it challenging to design a network that generalizes across different inputs. Besides, deep neural networks are known to benefit from normalized input data to correlate values from different domains. Therefore, we pre-process the input photon properties to allow for better generalizability and performance.

Since light intensities can have very high dynamic range (HDR), the photon contributions can vary widely in range, which is highly challenging for a network to process. We introduce a mapping function to pre-process the photon contributions,


where is an additional parameter. Essentially, maps HDR values from to . We further linearly map these values to and provide them as network input. We observe that such a mapping process facilitates the network learning.

For photon positions and directions , we first transform them into the local coordinate frame of the shading point; the coordinate frame is constructed using the position and normal of the shading point and two orthogonal directions that are randomly selected in the tangent plane. This transforms the network inputs into a consistent coordinate system and improves generalizability.

The bandwidth of our learned kernel is determined by the distance of the nearest photon. This leads to a large range of bandwidth values given various photon distributions, which is highly challenging for a deep neural network to process. Motivated by the bandwidth normalization used in traditional kernels [wand1994kernel, shirley1995global], we divide the photon positions in the local coordinates by the bandwidth , and scale the final density estimates by , which is shown in Eqn. 2. This normalizes all input photon positions into a unit sphere and post-scales the computed photon density by the actual window area. As a result, our network is invariant to the actual bandwidths, and effectively generalizes to different photon distributions and supports different numbers of total emitted photons that will introduce different bandwidths for the same .

Note that, different terms of our network input are all normalized into the range of , which enables our network to correlate and leverage different photon properties from various domains in an efficient way. Our input pre-processing also makes our network translation-, rotation-, and scale- invariant to diverse photon distributions, leading to good generalization across different scenes and different numbers of emitted photons.

4.2 Network architecture

The inputs to our network are essentially a set of multi-feature 3D points in a unit sphere. In the set, there is no meaningful inherent point ordering and the number of points () is not fixed. We thus leverage PointNet [qi2017pointnet]

style neural networks with multi-layer perceptrons, which accept an arbitrary number of inputs and are invariant to permutations of inputs. As shown in Fig. 

1, our network consists of two sub-networks, a feature extractor and a kernel predictor; they are both fully connected neural networks and process each photon individually.

The feature extractor first processes each individual photon; it considers the pre-processed photon properties (9 channels including positions, directions and contributions) as input, and extracts meaningful features using multilayer perceptrons. Specifically, we use three fully connected layers in the feature extractor, and each layer is followed by a ReLu activation layer. The feature extractor leverages linear and non-linear operations to transform the original input into a learned 32-channel feature vector. These per-photon features are then aggregated across photons by max-pooling and average-pooling operations which output the deep photon context vector. This vector represents the local photon statistics in a learned non-linearly transformed space. The kernel predictor then leverages the across-photon context and the per-photon features to predict a single scalar that represents the kernel weight for each photon. These per-photon kernel weights are the final output of our network and will be used to linearly combine the original photon contributions as expressed in Eqn. 

2. The kernel predictor is also a three-layer fully connected neural network with ReLU as activation layers, which is similar to the feature extractor but with different channels at each layer.

Note that, unlike previous work that treats each photon independently, we propose to correlate per-photon information with local context information across photons. Our feature extractor transforms photon properties into learned feature vectors, which allows for collecting photon statistics in the learned neural feature space to obtain the photon context for the following kernel prediction. Our whole network is very light-weight, and involves only six fully connected layers; this ensures a highly efficient inference process. We show that such a light-weight network is able to effectively reconstruct accurate photon density from sparse photons.

4.3 Training details

Data generation. Monte Carlo denoising usually requires a large number of images to train and is hard to generalize across different type of scenes. Our method focuses on local photon distributions; in other words, to learn proper data priors, we desire the diversity of photon distributions in terms of individual shading points and not necessarily of the entire scenes. This allows for good generalizability of our network with a relatively small number of training scenes, which can even be very different from our final testing scenes. Inspired by [xu2018deep, xu2019deep], we procedurally create shapes from primitives with random sizes and random bump maps; a set (randomly from 1 to 16) of such shapes are then placed in a box and distributed roughly as a grid. We also place multiple area lights with random locations and rotations in the scene, and randomly assign specular materials and diffuse materials to the scene objects.

Figure 2: Examples of our procedurally generated training scenes.

A few examples of these scenes are shown in Fig 2; complex light transport effects with diverse photon distributions are simulated. To sample shading points in each scene, we shoot rays from a camera through an image plane with pixels and select the first diffuse intersections as target shading points. We trace photons from light sources and keep the ones that contribute to the indirect lighting. Progressive photon mapping [hachisuka2008progressive] is then applied to compute ground truth photon densities for each point with a total number of about 1 billion photon paths. For each scene, we store 10 million photon paths and a multi-channel image that contains the ground truth radiances and other necessary information (positions, normals and BRDFs) of shading points. We create 500 scenes for training our neural networks and test our network on scenes that are significantly different from our training data (see Fig. [ and Fig. 7).

Loss function. We supervise our network with the ground truth radiance estimates. The final radiances are in high dynamic range, which can easily make the training dominated by high-intensity values; we therefore tone-map the radiance estimates using the -law as in [kalantari2017deep]. The mapping function is given by:


and we set following [kalantari2017deep]. We tone-map both our estimated radiance and the ground truth radiance, and we apply loss on the mapped values.

Training parameters. We randomly select

from 100 to 800 and use from 0.3 million to 4 million photons to train our network, which makes it generalize well to various bandwidths and photon counts. We use Adam to train our network for 6000 epochs with an initial learning rate of

and a batch size of 2000 random shading points.

5 Experiments

We now present a comprehensive evaluation of our method.

Ablation study. We first justify the choices of our network design. In particular, we compare our network with a baseline network that estimates the final radiance without predicting kernels; this comparison network has a similar network architecure but directly outputs the final irradiance from the across-photon deep context vector. Figure 3 shows the training processes of these networks; our network converges significantly faster than the baseline method. This demonstrates the effectiveness of combining kernel density estimation and deep learning and is consistent with previous results on denoising for path tracing [bako2017kernel, vogels2018denoising, gharbi2019sample].

Figure 3: Optimization speed. We compare the optimization speed of our kernel-prediction network, and a baseline direct-estimation network. Our network converges faster to the lower loss value.

Evaluation scenes and photon generation. We evaluate our method on six challenging scenes (Glass egg, Red wine, Rings, Water pool1, Water pool2, Dragon) that involve complex caustics and other diffuse-specular interactions with LS paths. In theory, LS paths can never be reconstructed by path tracing if we use a point light source; we therefore use area lights in the scenes to allow for reasonable comparisons with PT. For each scene, we shoot photons for 0.1 second, which generates about 0.8M photon paths with at maximum five photons per path; we only keep those photons that involve light-specular paths in the scenes. We denote the number of valid photons we consider as , which is a number that is different from the total emitted photon paths in Eqn. 1. Because of various compositions of scenes, there are 15k (Glass egg), 85k (Red wine), 77k (Rings), 50k (Water pool1), 100k (Water pool1) and 125k (Water pool1) valid photons that are used in the six scenes respectively. We also evaluate with the number of photons that are traced in one second—corresponding to ten times the number of photons traced in 0.1 seconds—to justify the generalization of our network to different numbers of emitted photons, and compare with the other methods with photons that are traced in ten seconds to justify the quality of our sparse reconstruction.

Figure 4: Path tracing and (PT) without light-specular paths (LS). We show PT and denoising results using 100 spp with and without light-specular paths. The noise can be seen more clearly when zooming into the electronic PDF.

() Ours-50 Ours-L-50 PM-50 Ours-500 Ours-L-500 PM-500 PPM APPM
Glass egg (15k) 0.013 0.006 0.085 0.013 0.006 0.165 0.085 0.080
(150k) 0.012 0.006 0.036 0.008 0.004 0.079 0.065 0.043
(1.5M) 0.013 0.007 0.031 0.006 0.003 0.027 0.030 0.030
Red wine (85k) 0.052 0.028 0.116 0.044 0.021 0.222 0.134 0.111
(850k) 0.035 0.027 0.053 0.023 0.014 0.102 0.064 0.047
(8.5M) 0.032 0.030 0.045 0.014 0.011 0.037 0.031 0.026

(77k) 0.042 0.023 0.069 0.023 0.008 0.153 0.137 0.143
(770k) 0.041 0.024 0.046 0.011 0.006 0.042 0.050 0.049
(7.7M) 0.045 0.020 0.066 0.012 0.009 0.023 0.017 0.014

Water pool1
(50k) 0.244 0.174 0.281 0.214 0.146 0.323 0.327 0.277
(500k) 0.214 0.173 0.221 0.135 0.115 0.244 0.249 0.193
(5.0M) 0.237 0.186 0.259 0.107 0.105 0.124 0.206 0.125

Water pool2
(102k) 0.178 0.125 0.226 0.167 0.095 0.260 0.262 0.224
(1.0M) 0.132 0.121 0.147 0.115 0.080 0.221 0.211 0.155
(10.2M) 0.134 0.128 0.159 0.066 0.061 0.102 0.163 0.088

(125k) 0.066 0.054 0.073 0.052 0.043 0.089 0.126 0.102
(1.2M) 0.056 0.054 0.061 0.034 0.033 0.044 0.083 0.054
(12.5M) 0.059 0.059 0.078 0.028 0.027 0.031 0.059 0.035
Table 1: Quantitative RMSE evaluation. We test our networks trained with different ( and , denoted with Ours-) on six novel scenes with different numbers of valid photons (). We also test a variant of our network architecture with enlarged four times capacity (Ours-Large) using the same . We compare RMSE against standard photon mapping (PM) [jensen1996global] under the same conditions, and also progressive PM (PPM) [hachisuka2008progressive] and adaptive PPM (APPM) [kaplanyan2013adaptive]. We highlight the best and the second best results in red and blue for each row; note that, all of them are our results. We also highlight the best result of the comparison methods in yellow, which is often worse than any of our network settings.

Combining MC denoising and deep photon mapping. We evaluate our deep photon density estimation by combining our method with MC denoising. Specifically, we apply our learning-based density estimation to only compute the challenging light transport effects which involve LS paths that are extremely hard to trace in PT and likely to introduce caustics. In addition, we use path tracing with relatively low sample counts to compute the remaining light transport paths, and use modern learning-based denoising—the Optix built-in denoiser based on [chaitanya2017interactive]—to remove the MC noise.

By removing LS paths in PT, we also make PT and MC denoising much easier. As shown in Fig. 4, PT without LS paths can be effectively denoised using modern learning-based denoising techniques with 100 spp, whereas full PT with LS paths introduces extensive noise with the same 100 spp, causing denoising to fail completely. In fact, the standard PT plus denoising pipeline is not able to recover the complex light transport effects with even 1000 spp (see Figs. [,5). In contrast, we demonstrate a practical way of combining our efficient deep photon mapping with MC denoising for photorealistic image synthesis, in which we leverage the benefits of low-sample reconstruction in both scene-space particle density estimation and screen-space MC integration.

Figure 5: We show our final results in full images (a). Our final results are computed by combining our deep photon mapping results and path tracing with denoising. We compare against pure path tracing using 1000 spp with (c) and without(d) denoising on insets. Obviously, path tracing alone even with 1000 spp cannot handle the LS paths.
Figure 6: We show results of our method with different numbers of input photons (). We compare against PM, SPPM and APPM with the same number of total photons () on insets marked in the left-top ground truth image. We also show the results of APPM and PPM with ten times the largest number of photons our method uses (j, k). The PSNRs and SSIMs of the insets are shown correspondingly.
Figure 7: We show our results on full images (a). We compare against PM with the same input photons (d) and SPPM with the same (f) and ten times (g) the total photon counts on insets. PSNRs and SSIMs are also calculated for all insets and listed below.

Photon tracing
Photon gathering Number of photons DPM-50 DPM-L-50 DPM-500 DPM-L-500
0.1s 0.12s 0.5s 15k 125k 0.3s 1.0s 3.0s 10.0s
1.0s 1.2s 5.0s 150k 1.2M 0.3s 1.0s 3.0s 10.0s
10.0s 12s 50s 1.5M 12M 0.3s 1.0s 3.0s 10.0s
Table 2: Timing. We show the corresponding running time in seconds for each photon mapping component. Our experiments are run with photons that are traced within 0.1s, 1.0s and 10.0s in each scene. We list the corresponding gathering time to find the neighboring photons for about surface shading points. The numbers of total photons are also shown, corresponding to the in Tab. 1. We list the network inference time for surface shading points for our regular network (DPM) and a large network (DPM-L) with and 500. Note that, the network inference time is determined by its capacity and , and is independent of the number of total photons in the scene.
Mean DSSIM Ours-50 Ours-Large-50 PM-50 Ours-500 Ours-Large-500 PM-500
0.0346 0.0342 0.0337 0.0281 0.0277 0.0260

Table 3: Temporal stability. We show the mean DSSIM between pairs of adjacent frames over a sequence of 30 rendered frames. Results have been averaged over all the different scenes and amount of photons.

Parameters of our network and comparison methods. We observe that it is very hard for a single network to generalize across different numbers of input photons (). We thus use a fixed when training per network, and specifically we train two networks with and for the evaluation. We also compare with a variant of our network that has four times the channels at each layer in our network architecture to evaluate if larger network capacity leads to higher performance. This large network generally leads to better performance (see Tab. 1), but it requires about three times longer inference time (see Tab. 2); please see the following parts in the section for more discussion about quality and performance. In the experiments, we use DPM (deep photon mapping) to denote the network with regular capacity and DPM-L (or Ours-L) to denote the one with larger capacity.

In all experiments, we compare with the classical photon mapping (PM) with the same k-NN photons as inputs. We also compare with various progressive methods that are designed to progressively reduce the bandwidth with large photon counts. In particular, for density estimation at fixed surface points, we compare with progressive photon mapping (PPM) [hachisuka2008progressive]. Given a certain number of input photons, the quality of PPM is influenced by the initial radius and the number of photons per iteration. To make a fair comparison, we compare 30 different variants (10 radii and 3 photon counts per iteration) of the two parameters and choose the best settings (with lowest RMSEs) for each scene. We also compare with adaptive progressive photon mapping [kaplanyan2013adaptive] similarly using the best radius and number of photons per iteration from 30 different variants of parameters. For visual comparisons, we compare with stochastic PPM (SPPM) [hachisuka2009stochastic], when there are transparent surfaces in a scene which require sampling multiple surface points per pixel.

Quantitative and qualitative evaluation. We now evaluate our method quantitatively and qualitatively with different numbers of photons counts () and different variations of training parameters (input photon number and capacity). Table  1 shows quantitative RMSE evaluation of photon density estimation on the six testing scenes; the numbers are averaged across about 260k surface shading points sampled by tracing rays from a camera and selecting the first diffuse hit points in the scenes. Note that, across all these different scenes with different photon counts, our method with performs consistently better than all the comparison PM methods, including standard PM [jensen1996global], PPM [hachisuka2008progressive] and APPM [kaplanyan2013adaptive], with the same number of total photons. Most of our results are better than PM’s and PPM’s results with ten times the photon counts as ours. APPM leverages traditional statistical information of local photons to improve the density estimation of PPM, which is able to achieve fairly good results; however, it requires the number of photons to be large enough to obtain good statistics. In contrast, our method leverages learned statistics in the network, which achieves significantly better results than APPM with the same number of photons; ours is actually comparable to the APPM that uses ten times the total number of photons. Note that, the APPM and PPM results are selected from the results of tens of APPM and PPM variants with different hyper-parameters for their best performance; yet, our method still outperforms the best of these variants.

To visually illustrate the numbers in Tab. 1, we demonstrate all the rendering results of Rings and Red wine with the first two rows (first two ) in Fig. 6; we also show the visual results of APPM and PPM with larger in Fig. 6.j, k. Additionally, we show results of three testing scenes in Fig. 7, where we compare our our DPM-500 with PM and SPPM. In Fig. [, we show the result of our DPM-50 and compare with PT, PM, SPPM and APPM. In general, our method with outperforms the comparison methods with the same number of photons qualitatively and quantitatively. And our results are comparable to (if not better than) the comparison methods that use ten times the number of photons in the scene. While the larger network with (Ours-L-500) performs better than the regular network, the larger one also requires longer inference time (see Tab. 2). Therefore, our regular network with is generally the best choice for most cases, which stably achieves high-quality results. However, when timing is not a critical issue, the large network will be a better choice for higher accuracy.

In most cases, our network is in favor of more nearest neighbor photons () as input to the network; the Ours-500 results are usually better than the Ours-50 ones. Essentially, a larger allows for better local deep statistics in the deep context feature, which enables better kernel predictions. Note that, this is not the case for standard PM using the same nearest neighbor strategy for bandwidth selection. Photon mapping either introduces obvious non-smooth artifacts with a small bandwidth (Fig. 6.i) or outputs over-smooth results without details with a large bandwidth (Fig. 6

.d). APPM tends to resolve this issue by wisely reducing the bandwidth according to the photon statistics. In contrast, our method achieves significantly better results than APPM when there are only sparsely distributed photons. Our method is able to leverage a relatively large bandwidth without introducing any obvious over-smoothing issues. This is thanks to our learning based context-aware kernel prediction approach. In particular, our approach allows for every single photon to leverage across-photon information in the learned deep context feature to tell if it is an outlier or an important contributing element to the shading point’s reflected radiance; a corresponding kernel weight is assigned to each photon based on the decision made by data priors in the network. Therefore, our method is able to effectively utilize the sparse photons in a large area to generate photorealistic images that are of high smoothness and have many details.

Timing. We use Optix to trace photons and do path tracing for all the results. All experiments are run on one NVIDIA 1080 Ti GPU. Path tracing runs at about 50 spp per second in all six scenes with an image resolution of 512. It takes about 0.1, 1.0 and 10 seconds to emit photons. We show the corresponding photon gathering time and network inference time for surface shading points in Tab. 2. In particular, we build Kd-Trees to do the neighboring search at each shading point and all methods take similar time to gather neighboring photons. Note that the running time of our network is linear with the number of input photons ; it is also determined by the number of shading points that are required to be computed, and the listed timing corresponds to shading points. The total running time for our method is the summation of the photon tracing, gathering and the network inference time; the total running time for the other methods is the summation of tracing and gathering. Note that, across all the experiments (Tab. 1, Fig. 6, Fig. 7), our results of DPM-500 with photons traced in 1 second are comparable to the best results of comparison methods with photons traced in 10 seconds; however, to achieve the comparable results, our DPM-500 takes about 5.2s 9s total time, whereas the comparison methods require 22s 60.0s total time to compute the same number of shading points. Our method takes significantly shorter time to achieve the comparable quality.

Network capacity and . While our network is mainly trained with relatively sparse photons (small ), our network with =500 overall generalizes well across different numbers of total photons () and, in most cases, achieves better performance when increases. However, for =50, there is too little information for the network to leverage and higher performance is often not ensured with a larger . Nonetheless, our network with =50 still works well and performs better than the comparison methods when there are tens of thousands of photons. We also observe that a larger network (Ours-L) with larger capacity leads to clearly better results than our regular network. Of course, a larger network requires higher computational cost or longer inference time as shown in Tab. 2. Yet, the larger network with can already often achieve reasonably good results, which takes shorter running time than . We leave the exploration of more variants of the network capacity and as future work.

Temporal consistency. Since our method deals with shading points in 3D space and is independent of view directions, we have observed that it has good across-frame consistency when changing the view in a scene with a fixed set of photons. We follow [vogels2018denoising] and use the mean DSSIM between consecutive frames to evaluate the temporal consistency when moving the camera. Results in Table 3 show comparable temporal stability between our results and standard PM outputs. We leave the extensions of our network to recurrent architectures and general temporal consistency with other dynamic components in a scene as future work.

Progressive density estimation. Our current framework requires a fixed number of input photons for each trained network. Progressive photon mapping accepts different numbers of photons per iteration with reduced bandwidth. Nonetheless, we have demonstrated that our network architecture supports accurate photon density estimation with various fixed photon numbers. In other words, a progressive method can potentially be achieved by training a sequence of networks with different numbers of inputs. A universal network for any given number of input photons may require introducing recurrent networks in the framework, which is an interesting direction of future work.

6 Conclusions and Future Work

In this paper, we have presented the first deep learning-based method for density estimation in particle-based rendering. We introduce a deep neural network that learns a kernel function to aggregate photons at each shading point and renders accurate caustics with significantly fewer photons than previous approaches, with minimal overhead. Learning-based MC denoising has significantly improved path tracing results and our work extends these benefits to the popular photon mapping method.

Our method could be improved in the future with more advanced machine learning approaches, perhaps based on generative adversarial networks (GANs), just as has been done with path tracing

[xu2019adversarial]. More broadly, we believe this paper points towards denoisers specialized to many other approaches for realistic image synthesis such as Metropolis Light Transport and Vertex Connection and Merging.