1 Introduction
Computing global illumination is crucial for photorealistic image synthesis. Ray tracingbased methods have been widely used to simulate complex light transport effects with global illumination in film, animation, video game and other industrial fields. The most successful approaches are based on either Monte Carlo (MC) integration, like path tracing [kajiya1986rendering, veach1997robust], or particle density estimation, like photon mapping [jensen1996global]. Photon mapping techniques are able to efficiently simulate caustics and other challenging light transport effects, which are very hard and even impossible for pure Monte Carlobased methods to simulate.
In general, both MCbased and particlebased methods require numerous samples to render noisefree images, and are thus computationally expensive. Recently, significant progress has been made in denoising MC images rendered with low sample counts using deep learning techniques [chaitanya2017interactive, bako2017kernel]. However, there is relatively little work in particlebased methods for lowsample reconstruction and current photon mapping techniques still require a very large number of traced photons to achieve accurate, artifactfree radiance estimation.
We present the first deep learningbased approach for particlebased rendering that enables efficient, highquality global illumination with a small number of photons. Our approach is particularly good at reconstructing diffusespecular interactions like caustics, for which previous photon mapping methods require large photon sample counts (and pathtracing at reasonable sample counts can miss altogether). We focus on photon density estimation—a key component of all particlebased methods—and introduce a novel deep neural network that can estimate accurate photon density at any surface points in a scene given only sparsely distributed photons.
Previously, the most successful density estimation methods for photon mapping are kernelbased methods that use traditional kernel functions (like a uniform or cone kernel) to compute output radiance at a surface point as a weighted sum of nearby photons. While previous methods have improved the kernels by controlling the kernel bandwidths or shapes [kaplanyan2013adaptive, schjoth2008diffusion, kang2016adaptive], traditional kernel functions still require a large enough count of photons located in a small enough bandwidth around every surface shading point, for which a very large number of photons need to be traced, to compute accurate photon density. In contrast, we propose to learn to predict a kernel function at each shading point to effectively aggregate nearby photon contributions. Our predicted kernels leverage data priors and are able to compute accurate photon density estimation for complex global illumination from photon counts that are an order of magnitude fewer than traditional methods.
Our network considers local photons around a queried surface point within a predefined bandwidth as input. Unlike traditional methods that often treat photons individually or leverage standard statistics to aggregate photons, we leverage learned local photon statistics—encoded as a deep photon context vector inferred by the network—around a surface point for per photon kernel weight estimates. Specifically, the network first processes individual photons to extract perphoton features and aggregates them across photons using pooling operations to obtain a deep photon context feature that represents the local photon statistics. The network processes the individual perphoton features concatenated with the local context to compute perphoton kernel weights, which are used to perform density estimation by a weighted sum. We demonstrate that this approach of learning kernel prediction is more efficient than a baseline that directly estimates photon density from the aggregated deep context vector.
To train our network, we create diverse photon distributions by tracing photons in 500 procedurally generated scenes with complex shapes and materials. We sample surface points on diffuse surfaces, which form a 512512 image (one pixel per point) in each scene, and we compute the ground truth photon density of each point using progressive photon mapping [hachisuka2008progressive] with billions of photons. Note that, our network focuses on local photon distribution properties of surface points. Hence, every surface point in a scene is a training datum, allowing us to train a generalizable network without a lot of images.
In Fig. [, we demonstrate that, using only 15k photons, our method can synthesize highquality images. Conversely, variations of path tracing and photon mapping fail to do so; even when combined with advanced progressive and adaptive techniques, SPPM and APPM require significantly more samples (1.5M photons) to achieve comparable results. This makes our approach an important step towards making photon mapping computationally efficient. Moreover, our experiments leverages an effective practical hybrid approach: using our method for reconstructing lightspecular (LS) paths – the light transport paths that interact with specular surfaces before arriving at light sources—and low samplecount path tracing with learningbased denoising for all other light transport paths. This leverages the advantages of both MC denoising and our efficient photon density estimation technique.
2 Related Work
Monte Carlo path integration. Kajiya kajiya1986rendering introduced the rendering equation and Monte Carlo (MC) path tracing. Since then, various methods for MC path integration have been developed, including light tracing [dutre1993monte], bidirectional path tracing (BDPT) [lafortune1993bi, veach1995optimally], and Metroplis light transport (MLT) [veach1997robust, pauly2000metropolis, cline2005energy]
. These methods are able to simulate complex light transport with accurate global illumination in an unbiased way. However, pure MC based methods typically require a very large number of samples (traced paths), especially for very low probability paths like the classical caustic or speculardiffusespecular (SDS) paths. We base our method on the photon mapping technique, which is efficient for caustics and SDS, and we aim to achieve sparse reconstruction.
Monte Carlo denoising. While there is little progress in sparse reconstruction with low sample counts in photon mapping, many approaches have been proposed to achieve MC rendering with low sample counts. A recent survey of sparse sampling and reconstruction is presented by Zwicker et al. zwicker2015recent. MC denoising methods can be categorized into apriori methods that rely on prior theoretical knowledge [durand2005frequency, egan2009frequency, yan2015fast, wu2017multiple], and aposteriori methods that filter out the noise in rendered images with few assumptions about the image signal [overbeck2009adaptive, rousselle2013robust, kalantari2015machine].
Recently, deep learning techniques have been introduced to achieve MC denoising [chaitanya2017interactive, bako2017kernel], and many methods utilize kernel prediction[bako2017kernel, vogels2018denoising, xu2019adversarial]
. Kalantari et al. kalantari2015machine propose to predict the parameters of fixed filtering functions using fullyconnected neural networks. Bako et al. bako2017kernel leverage deep convolution neural networks to predict kernels to linearly combine the original noisy radiances of neighboring pixels. Gharbi et al. gharbi2019sample make use of individual screenspace path samples and predict a kernel for each sample that splats the radiance contributions to its neighboring pixels. In contrast, we apply deep learning in photon density estimation and leverage local photon statistics for density estimation from sparse photons. Our network considers individual scenespace photon samples around each shading point and predicts a kernel to gather perphoton contributions. Our approach is the first that introduces deep learning in photon mapping and demonstrates learningbased kernel prediction in this context.
Photon density estimation. The rendering equation [kajiya1986rendering, immel1986radiosity] can be approximated by particle density estimation [shirley1995global, jensen1996global, walter1997global]. Most particlebased methods are based on the original photon mapping framework [jensen1996global]
; it first traces rays from light sources to distribute photons in a scene, and then gathers neighboring photons at individual shading points to approximate radiance estimates. Photon mapping achieves low variance in the rendered images and leads to blurred, less noticeable artifacts at the cost of introducing bias in the estimates. Photon mapping is able to consistently converge to the correct solution by increasing the number of photons towards infinity and reducing the bandwidth towards zero.
Previous work has investigated progressive methods to overcome the memory bottleneck and enable arbitrarily large photon numbers [hachisuka2008progressive, hachisuka2009stochastic, hachisuka2010progressive, knaus2011progressive], bidirectional methods to improve rendering glossy objects [vorba2011bidirectional], adaptive methods to optimize photon tracing [hachisuka2011robust], and the combination of unbiased MC methods and photon mapping [georgiev2011bidirectional, hachisuka2012path, georgiev2012light]. Many relevant works have been presented to improve the kernel density estimation by utilizing standard statistics for adaptive kernel bandwidth [jc95, kaplanyan2013adaptive, kang2016adaptive] or anisotropic kernel shapes [schjoth2008diffusion]. Other works leverage ray differentials [schjoth2007photon], blue noise distribution [spencer2009into, spencer2013photon, spencer2013progressive], and Gaussian mixtures fitting [jakob2011progressive] to improve the reconstruction. In contrast, we focus on accurately computing photon density with sparse photons, which hasn’t been explored in previous work. Essentially, we replace the traditional kernel density estimation with a novel learning based module, and keep the rest unchanged in the standard photon mapping framework. This potentially enables the combination of our technique and previous photon mapping techniques that focus on other components in the framework.
3 Background: Density estimation
Photon mapping techniques compute reflected radiance via density estimation. Kernel density estimation [wand1994kernel] is the most widely used density estimation method in statistics, and has been widely applied in photon mapping. Early works use the uniform kernel that treats nearby photons equally [jensen1996global, hachisuka2008progressive]; subsequent works extend photon density estimation to support arbitrary smooth kernels [hachisuka2010progressive, knaus2011progressive]. In general, the reflected radiance at a shading location is computed by:
(1) 
where is the total number of photon paths that are emitted in a scene, is the reflected direction, is the location of a photon, is the photon contribution and represents the kernel function with bandwidth . In general, the photon contribution is the product of the BRDF and the photon energy. In this work, we only compute photon density on diffuse surfaces, as is done in many classical photon mapping methods. In this case, the BRDF at a shading point is , where is the albedo. Correspondingly, , where represents the accumulated path contribution divided by the sampling probability, which can be also interpreted as the energy flux carried by the photon. Therefore, can be removed and can be taken out of the summation in Eqn. 1. We therefore consider the photon energy as the photon contribution in this work.
The kernel assigns linear weights to photons, which are used to linearly combine the contributions of photons in a local window with radius . Traditionally, is a uniform function () or a function of the distance from the shading point to a photon (). Instead, we propose to leverage data priors to predict kernels to aggregate photon contributions.
4 Learning to compute photon density
In this section, we present our learningbased approach for photon density estimation. Our approach is lightweight and focuses on density estimation only; we keep the main framework of standard photon mapping and upgrade the traditional, distancebased and photonindependent kernel functions ( in Eqn. 1) to novel, learned and localcontextaware kernel functions represented by a deep neural network (see Fig. 1).
In particular, given a shading point, our network considers its nearest neighbor photons, which adaptively selects the bandwidth . Multiple properties of individual photons are used as input for the network, including photon positions , photon directions and photon contributions . We also supply the number of nearest photons to the network to let it better understand the local photon distribution. Our network (denoted as ) regresses perphoton kernel weights to compute radiance estimates via a weighted sum similar to Eqn. 1:
(2) 
where represents the predicted kernel weight for photon . Note that, our network uses information about all photons in a local neighborhood for perphoton kernel prediction; it obtains deep photon statistics and associates perphoton information with statistical context to compute kernels for photon aggregation.
4.1 Input preprocessing
Photon distributions are highly diverse across shading points and across scenes, making it challenging to design a network that generalizes across different inputs. Besides, deep neural networks are known to benefit from normalized input data to correlate values from different domains. Therefore, we preprocess the input photon properties to allow for better generalizability and performance.
Since light intensities can have very high dynamic range (HDR), the photon contributions can vary widely in range, which is highly challenging for a network to process. We introduce a mapping function to preprocess the photon contributions,
(3) 
where is an additional parameter. Essentially, maps HDR values from to . We further linearly map these values to and provide them as network input. We observe that such a mapping process facilitates the network learning.
For photon positions and directions , we first transform them into the local coordinate frame of the shading point; the coordinate frame is constructed using the position and normal of the shading point and two orthogonal directions that are randomly selected in the tangent plane. This transforms the network inputs into a consistent coordinate system and improves generalizability.
The bandwidth of our learned kernel is determined by the distance of the nearest photon. This leads to a large range of bandwidth values given various photon distributions, which is highly challenging for a deep neural network to process. Motivated by the bandwidth normalization used in traditional kernels [wand1994kernel, shirley1995global], we divide the photon positions in the local coordinates by the bandwidth , and scale the final density estimates by , which is shown in Eqn. 2. This normalizes all input photon positions into a unit sphere and postscales the computed photon density by the actual window area. As a result, our network is invariant to the actual bandwidths, and effectively generalizes to different photon distributions and supports different numbers of total emitted photons that will introduce different bandwidths for the same .
Note that, different terms of our network input are all normalized into the range of , which enables our network to correlate and leverage different photon properties from various domains in an efficient way. Our input preprocessing also makes our network translation, rotation, and scale invariant to diverse photon distributions, leading to good generalization across different scenes and different numbers of emitted photons.
4.2 Network architecture
The inputs to our network are essentially a set of multifeature 3D points in a unit sphere. In the set, there is no meaningful inherent point ordering and the number of points () is not fixed. We thus leverage PointNet [qi2017pointnet]
style neural networks with multilayer perceptrons, which accept an arbitrary number of inputs and are invariant to permutations of inputs. As shown in Fig.
1, our network consists of two subnetworks, a feature extractor and a kernel predictor; they are both fully connected neural networks and process each photon individually.The feature extractor first processes each individual photon; it considers the preprocessed photon properties (9 channels including positions, directions and contributions) as input, and extracts meaningful features using multilayer perceptrons. Specifically, we use three fully connected layers in the feature extractor, and each layer is followed by a ReLu activation layer. The feature extractor leverages linear and nonlinear operations to transform the original input into a learned 32channel feature vector. These perphoton features are then aggregated across photons by maxpooling and averagepooling operations which output the deep photon context vector. This vector represents the local photon statistics in a learned nonlinearly transformed space. The kernel predictor then leverages the acrossphoton context and the perphoton features to predict a single scalar that represents the kernel weight for each photon. These perphoton kernel weights are the final output of our network and will be used to linearly combine the original photon contributions as expressed in Eqn.
2. The kernel predictor is also a threelayer fully connected neural network with ReLU as activation layers, which is similar to the feature extractor but with different channels at each layer.Note that, unlike previous work that treats each photon independently, we propose to correlate perphoton information with local context information across photons. Our feature extractor transforms photon properties into learned feature vectors, which allows for collecting photon statistics in the learned neural feature space to obtain the photon context for the following kernel prediction. Our whole network is very lightweight, and involves only six fully connected layers; this ensures a highly efficient inference process. We show that such a lightweight network is able to effectively reconstruct accurate photon density from sparse photons.
4.3 Training details
Data generation. Monte Carlo denoising usually requires a large number of images to train and is hard to generalize across different type of scenes. Our method focuses on local photon distributions; in other words, to learn proper data priors, we desire the diversity of photon distributions in terms of individual shading points and not necessarily of the entire scenes. This allows for good generalizability of our network with a relatively small number of training scenes, which can even be very different from our final testing scenes. Inspired by [xu2018deep, xu2019deep], we procedurally create shapes from primitives with random sizes and random bump maps; a set (randomly from 1 to 16) of such shapes are then placed in a box and distributed roughly as a grid. We also place multiple area lights with random locations and rotations in the scene, and randomly assign specular materials and diffuse materials to the scene objects.
A few examples of these scenes are shown in Fig 2; complex light transport effects with diverse photon distributions are simulated. To sample shading points in each scene, we shoot rays from a camera through an image plane with pixels and select the first diffuse intersections as target shading points. We trace photons from light sources and keep the ones that contribute to the indirect lighting. Progressive photon mapping [hachisuka2008progressive] is then applied to compute ground truth photon densities for each point with a total number of about 1 billion photon paths. For each scene, we store 10 million photon paths and a multichannel image that contains the ground truth radiances and other necessary information (positions, normals and BRDFs) of shading points. We create 500 scenes for training our neural networks and test our network on scenes that are significantly different from our training data (see Fig. [ and Fig. 7).
Loss function. We supervise our network with the ground truth radiance estimates. The final radiances are in high dynamic range, which can easily make the training dominated by highintensity values; we therefore tonemap the radiance estimates using the law as in [kalantari2017deep]. The mapping function is given by:
(4) 
and we set following [kalantari2017deep]. We tonemap both our estimated radiance and the ground truth radiance, and we apply loss on the mapped values.
Training parameters. We randomly select
from 100 to 800 and use from 0.3 million to 4 million photons to train our network, which makes it generalize well to various bandwidths and photon counts. We use Adam to train our network for 6000 epochs with an initial learning rate of
and a batch size of 2000 random shading points.5 Experiments
We now present a comprehensive evaluation of our method.
Ablation study. We first justify the choices of our network design. In particular, we compare our network with a baseline network that estimates the final radiance without predicting kernels; this comparison network has a similar network architecure but directly outputs the final irradiance from the acrossphoton deep context vector. Figure 3 shows the training processes of these networks; our network converges significantly faster than the baseline method. This demonstrates the effectiveness of combining kernel density estimation and deep learning and is consistent with previous results on denoising for path tracing [bako2017kernel, vogels2018denoising, gharbi2019sample].
Evaluation scenes and photon generation. We evaluate our method on six challenging scenes (Glass egg, Red wine, Rings, Water pool1, Water pool2, Dragon) that involve complex caustics and other diffusespecular interactions with LS paths. In theory, LS paths can never be reconstructed by path tracing if we use a point light source; we therefore use area lights in the scenes to allow for reasonable comparisons with PT. For each scene, we shoot photons for 0.1 second, which generates about 0.8M photon paths with at maximum five photons per path; we only keep those photons that involve lightspecular paths in the scenes. We denote the number of valid photons we consider as , which is a number that is different from the total emitted photon paths in Eqn. 1. Because of various compositions of scenes, there are 15k (Glass egg), 85k (Red wine), 77k (Rings), 50k (Water pool1), 100k (Water pool1) and 125k (Water pool1) valid photons that are used in the six scenes respectively. We also evaluate with the number of photons that are traced in one second—corresponding to ten times the number of photons traced in 0.1 seconds—to justify the generalization of our network to different numbers of emitted photons, and compare with the other methods with photons that are traced in ten seconds to justify the quality of our sparse reconstruction.
Scene 
()  Ours50  OursL50  PM50  Ours500  OursL500  PM500  PPM  APPM 
Glass egg  (15k)  0.013  0.006  0.085  0.013  0.006  0.165  0.085  0.080 
(150k)  0.012  0.006  0.036  0.008  0.004  0.079  0.065  0.043  
(1.5M)  0.013  0.007  0.031  0.006  0.003  0.027  0.030  0.030 
Red wine  (85k)  0.052  0.028  0.116  0.044  0.021  0.222  0.134  0.111 
(850k)  0.035  0.027  0.053  0.023  0.014  0.102  0.064  0.047  
(8.5M)  0.032  0.030  0.045  0.014  0.011  0.037  0.031  0.026 
Rings 
(77k)  0.042  0.023  0.069  0.023  0.008  0.153  0.137  0.143 
(770k)  0.041  0.024  0.046  0.011  0.006  0.042  0.050  0.049  
(7.7M)  0.045  0.020  0.066  0.012  0.009  0.023  0.017  0.014 
Water pool1 
(50k)  0.244  0.174  0.281  0.214  0.146  0.323  0.327  0.277 
(500k)  0.214  0.173  0.221  0.135  0.115  0.244  0.249  0.193  
(5.0M)  0.237  0.186  0.259  0.107  0.105  0.124  0.206  0.125 
Water pool2 
(102k)  0.178  0.125  0.226  0.167  0.095  0.260  0.262  0.224 
(1.0M)  0.132  0.121  0.147  0.115  0.080  0.221  0.211  0.155  
(10.2M)  0.134  0.128  0.159  0.066  0.061  0.102  0.163  0.088 
Dragon 
(125k)  0.066  0.054  0.073  0.052  0.043  0.089  0.126  0.102 
(1.2M)  0.056  0.054  0.061  0.034  0.033  0.044  0.083  0.054  
(12.5M)  0.059  0.059  0.078  0.028  0.027  0.031  0.059  0.035 
Combining MC denoising and deep photon mapping. We evaluate our deep photon density estimation by combining our method with MC denoising. Specifically, we apply our learningbased density estimation to only compute the challenging light transport effects which involve LS paths that are extremely hard to trace in PT and likely to introduce caustics. In addition, we use path tracing with relatively low sample counts to compute the remaining light transport paths, and use modern learningbased denoising—the Optix builtin denoiser based on [chaitanya2017interactive]—to remove the MC noise.
By removing LS paths in PT, we also make PT and MC denoising much easier. As shown in Fig. 4, PT without LS paths can be effectively denoised using modern learningbased denoising techniques with 100 spp, whereas full PT with LS paths introduces extensive noise with the same 100 spp, causing denoising to fail completely. In fact, the standard PT plus denoising pipeline is not able to recover the complex light transport effects with even 1000 spp (see Figs. [,5). In contrast, we demonstrate a practical way of combining our efficient deep photon mapping with MC denoising for photorealistic image synthesis, in which we leverage the benefits of lowsample reconstruction in both scenespace particle density estimation and screenspace MC integration.
Photon tracing 
Photon gathering  Number of photons  DPM50  DPML50  DPM500  DPML500 
0.1s  0.12s 0.5s  15k 125k  0.3s  1.0s  3.0s  10.0s 
1.0s  1.2s 5.0s  150k 1.2M  0.3s  1.0s  3.0s  10.0s 
10.0s  12s 50s  1.5M 12M  0.3s  1.0s  3.0s  10.0s 
Mean DSSIM  Ours50  OursLarge50  PM50  Ours500  OursLarge500  PM500 
0.0346  0.0342  0.0337  0.0281  0.0277  0.0260  

Parameters of our network and comparison methods. We observe that it is very hard for a single network to generalize across different numbers of input photons (). We thus use a fixed when training per network, and specifically we train two networks with and for the evaluation. We also compare with a variant of our network that has four times the channels at each layer in our network architecture to evaluate if larger network capacity leads to higher performance. This large network generally leads to better performance (see Tab. 1), but it requires about three times longer inference time (see Tab. 2); please see the following parts in the section for more discussion about quality and performance. In the experiments, we use DPM (deep photon mapping) to denote the network with regular capacity and DPML (or OursL) to denote the one with larger capacity.
In all experiments, we compare with the classical photon mapping (PM) with the same kNN photons as inputs. We also compare with various progressive methods that are designed to progressively reduce the bandwidth with large photon counts. In particular, for density estimation at fixed surface points, we compare with progressive photon mapping (PPM) [hachisuka2008progressive]. Given a certain number of input photons, the quality of PPM is influenced by the initial radius and the number of photons per iteration. To make a fair comparison, we compare 30 different variants (10 radii and 3 photon counts per iteration) of the two parameters and choose the best settings (with lowest RMSEs) for each scene. We also compare with adaptive progressive photon mapping [kaplanyan2013adaptive] similarly using the best radius and number of photons per iteration from 30 different variants of parameters. For visual comparisons, we compare with stochastic PPM (SPPM) [hachisuka2009stochastic], when there are transparent surfaces in a scene which require sampling multiple surface points per pixel.
Quantitative and qualitative evaluation. We now evaluate our method quantitatively and qualitatively with different numbers of photons counts () and different variations of training parameters (input photon number and capacity). Table 1 shows quantitative RMSE evaluation of photon density estimation on the six testing scenes; the numbers are averaged across about 260k surface shading points sampled by tracing rays from a camera and selecting the first diffuse hit points in the scenes. Note that, across all these different scenes with different photon counts, our method with performs consistently better than all the comparison PM methods, including standard PM [jensen1996global], PPM [hachisuka2008progressive] and APPM [kaplanyan2013adaptive], with the same number of total photons. Most of our results are better than PM’s and PPM’s results with ten times the photon counts as ours. APPM leverages traditional statistical information of local photons to improve the density estimation of PPM, which is able to achieve fairly good results; however, it requires the number of photons to be large enough to obtain good statistics. In contrast, our method leverages learned statistics in the network, which achieves significantly better results than APPM with the same number of photons; ours is actually comparable to the APPM that uses ten times the total number of photons. Note that, the APPM and PPM results are selected from the results of tens of APPM and PPM variants with different hyperparameters for their best performance; yet, our method still outperforms the best of these variants.
To visually illustrate the numbers in Tab. 1, we demonstrate all the rendering results of Rings and Red wine with the first two rows (first two ) in Fig. 6; we also show the visual results of APPM and PPM with larger in Fig. 6.j, k. Additionally, we show results of three testing scenes in Fig. 7, where we compare our our DPM500 with PM and SPPM. In Fig. [, we show the result of our DPM50 and compare with PT, PM, SPPM and APPM. In general, our method with outperforms the comparison methods with the same number of photons qualitatively and quantitatively. And our results are comparable to (if not better than) the comparison methods that use ten times the number of photons in the scene. While the larger network with (OursL500) performs better than the regular network, the larger one also requires longer inference time (see Tab. 2). Therefore, our regular network with is generally the best choice for most cases, which stably achieves highquality results. However, when timing is not a critical issue, the large network will be a better choice for higher accuracy.
In most cases, our network is in favor of more nearest neighbor photons () as input to the network; the Ours500 results are usually better than the Ours50 ones. Essentially, a larger allows for better local deep statistics in the deep context feature, which enables better kernel predictions. Note that, this is not the case for standard PM using the same nearest neighbor strategy for bandwidth selection. Photon mapping either introduces obvious nonsmooth artifacts with a small bandwidth (Fig. 6.i) or outputs oversmooth results without details with a large bandwidth (Fig. 6
.d). APPM tends to resolve this issue by wisely reducing the bandwidth according to the photon statistics. In contrast, our method achieves significantly better results than APPM when there are only sparsely distributed photons. Our method is able to leverage a relatively large bandwidth without introducing any obvious oversmoothing issues. This is thanks to our learning based contextaware kernel prediction approach. In particular, our approach allows for every single photon to leverage acrossphoton information in the learned deep context feature to tell if it is an outlier or an important contributing element to the shading point’s reflected radiance; a corresponding kernel weight is assigned to each photon based on the decision made by data priors in the network. Therefore, our method is able to effectively utilize the sparse photons in a large area to generate photorealistic images that are of high smoothness and have many details.
Timing. We use Optix to trace photons and do path tracing for all the results. All experiments are run on one NVIDIA 1080 Ti GPU. Path tracing runs at about 50 spp per second in all six scenes with an image resolution of 512. It takes about 0.1, 1.0 and 10 seconds to emit photons. We show the corresponding photon gathering time and network inference time for surface shading points in Tab. 2. In particular, we build KdTrees to do the neighboring search at each shading point and all methods take similar time to gather neighboring photons. Note that the running time of our network is linear with the number of input photons ; it is also determined by the number of shading points that are required to be computed, and the listed timing corresponds to shading points. The total running time for our method is the summation of the photon tracing, gathering and the network inference time; the total running time for the other methods is the summation of tracing and gathering. Note that, across all the experiments (Tab. 1, Fig. 6, Fig. 7), our results of DPM500 with photons traced in 1 second are comparable to the best results of comparison methods with photons traced in 10 seconds; however, to achieve the comparable results, our DPM500 takes about 5.2s 9s total time, whereas the comparison methods require 22s 60.0s total time to compute the same number of shading points. Our method takes significantly shorter time to achieve the comparable quality.
Network capacity and . While our network is mainly trained with relatively sparse photons (small ), our network with =500 overall generalizes well across different numbers of total photons () and, in most cases, achieves better performance when increases. However, for =50, there is too little information for the network to leverage and higher performance is often not ensured with a larger . Nonetheless, our network with =50 still works well and performs better than the comparison methods when there are tens of thousands of photons. We also observe that a larger network (OursL) with larger capacity leads to clearly better results than our regular network. Of course, a larger network requires higher computational cost or longer inference time as shown in Tab. 2. Yet, the larger network with can already often achieve reasonably good results, which takes shorter running time than . We leave the exploration of more variants of the network capacity and as future work.
Temporal consistency. Since our method deals with shading points in 3D space and is independent of view directions, we have observed that it has good acrossframe consistency when changing the view in a scene with a fixed set of photons. We follow [vogels2018denoising] and use the mean DSSIM between consecutive frames to evaluate the temporal consistency when moving the camera. Results in Table 3 show comparable temporal stability between our results and standard PM outputs. We leave the extensions of our network to recurrent architectures and general temporal consistency with other dynamic components in a scene as future work.
Progressive density estimation. Our current framework requires a fixed number of input photons for each trained network. Progressive photon mapping accepts different numbers of photons per iteration with reduced bandwidth. Nonetheless, we have demonstrated that our network architecture supports accurate photon density estimation with various fixed photon numbers. In other words, a progressive method can potentially be achieved by training a sequence of networks with different numbers of inputs. A universal network for any given number of input photons may require introducing recurrent networks in the framework, which is an interesting direction of future work.
6 Conclusions and Future Work
In this paper, we have presented the first deep learningbased method for density estimation in particlebased rendering. We introduce a deep neural network that learns a kernel function to aggregate photons at each shading point and renders accurate caustics with significantly fewer photons than previous approaches, with minimal overhead. Learningbased MC denoising has significantly improved path tracing results and our work extends these benefits to the popular photon mapping method.
Our method could be improved in the future with more advanced machine learning approaches, perhaps based on generative adversarial networks (GANs), just as has been done with path tracing
[xu2019adversarial]. More broadly, we believe this paper points towards denoisers specialized to many other approaches for realistic image synthesis such as Metropolis Light Transport and Vertex Connection and Merging.