Despite the progress of real-time graphics, interactive 3D content with truly photorealistic scenes and objects are still time consuming and costly to produce due to the necessity of optimized 3D assets and dedicated shaders. Instead, many graphics applications opt for image-based solutions. E-commerce websites often use a fixed set of views to showcase their products; VR experiences often rely on 360 video recordings to avoid the costly production of real 3D scenes, and mapping services such as Google Street View stitch images into panoramic views limited to 3-DOF.
Recent advances in neural rendering, such as neural volumes  and neural radiance fields (NeRFs) , open a promising new avenue to model arbitrary objects and scenes in 3D from a set of calibrated images. NeRFs in particular can faithfully render detailed scenes and appearances with non-Lambertian effects from any view, while simultaneously offering a high degree of compression in terms of storage. Partly due to these exciting properties, of late, there has been an explosion of research based on NeRF.
Nevertheless, for practical applications, runtime performance remains a critical limitation of NeRFs: due to the extreme sampling requirements and costly neural network queries, rendering a NeRF is agonizingly slow. For illustration, it takes roughly 30 seconds to render an 800x800 image from a NeRF using a high performance GPU, making it impractical for real-time interactive applications.
In this work, we propose a method for rendering a NeRF in real time, achieved by distilling the NeRF into a hierarchical 3D volumetric representation. Our approach preserves NeRF’s ability to synthesize arbitrarily complex geometry and view-dependent effects from any viewpoint and requires no additional supervision. In fact, our method achieves and in many cases surpasses the quality of the original NeRF formulation, while providing significant acceleration. Our model allows us to render an 800x800 image at 167.68 FPS on a NVIDIA V100 GPU and does not rely on a deep neural network during test time. Moreover, our representation is amenable to modern web technologies, allowing interactive rendering in a browser on consumer laptops.
Naive NeRF rendering is slow because it requires dense sampling of the scene, where every sample requires inference through a deep MLP. Because these queries depend on the viewing direction as well as the spatial position, one cannot naively cache these color values for all viewing directions.
We overcome these challenges and enable real-time rendering by pre-sampling the NeRF into a tabulated view-dependent volume which we refer to as a PlenOctree, named after the plenoptic functions of Adelsen and Bergen . Specifically, we use a sparse voxel-based octree where every leaf of the tree stores the appearance and density values required to model the radiance at a point in the volume. In order to account for non-Lambertian materials that exhibit view-dependent effects, we propose to represent the RGB values at a location with spherical harmonics (SH), a standard basis for functions defined on the surface of the sphere. The spherical harmonics can be evaluated at arbitrary query viewing directions to recover the view dependent color.
Although one could convert an existing NeRF into such a representations via projection onto the SH basis functions, we show that we can in fact modify a NeRF network to predict appearances explicitly in terms of spherical harmonics. Specifically, we train a network that produces coefficients for the SH functions instead of raw RGB values, so that the predicted values can later be directly stored within the leaves of the PlenOctree. We also introduce a sparsity prior during NeRF training to improve the memory efficiency of our octrees, consequently allowing us to render higher quality images. Furthermore, once the structure is created, the values stored in PlenOctree can be optimized because the rendering procedure remains differentiable. This enables the PlenOctree to obtain similar or better image quality compared to NeRF. Our pipeline is illustrated in Fig. 2.
Additionally, we demonstrate how our proposed pipeline can be used to accelerate NeRF model training, making our solution more practical to train than the original NeRF approach. Specifically, we can stop training the NeRF model early to convert it into a PlenOctree, which can then be trained significantly faster as it no longer involves any neural networks.
Our experiments demonstrate that our approach can accelerate NeRF-based rendering by 5 orders of magnitude without loss in image quality. We compare our approach on standard benchmarks with scenes and objects captured from views, and demonstrate state-of-the-art level performance for image quality and rendering speed.
Our interactive viewer can enable operations such as object insertion, visualizing radiance distributions, decomposing the SH components, and slicing the scene. We hope that these real-time operations can be useful to the community for visualizing and debugging NeRF-based representations.
To summarize, we make the following contributions:
The first method that achieves real-time rendering of NeRFs with similar or improved quality.
NeRF-SH: a modified NeRF that is trained to output appearance in terms of spherical basis functions.
PlenOctree, a data structure derived from NeRFs which enables highly efficient view-dependent rendering of complex scenes.
Accelerated NeRF training method using an early training termination, followed by a direct fine-tuning process on PlenOctree values.
2 Related Work
Novel View Synthesis. The task of synthesizing novel views of a scene given a set of photographs is a well-studied problem with various approaches. All methods predict an underlying geometric or image-based 3D representation that allows rendering from novel viewpoints. Mesh based methods represent the scene with surfaces, and have been used to model Lambertian (diffuse)  and non-Lambertian scenes [57, 5, 3].
Mesh based representations are compact and easy to render; however, optimizing a mesh to fit a complex scene of arbitrary topology is challenging. Image-based rendering methods [18, 40, 57], on the other hand, enable easy capture as well as photo-realistic and fast rendering, however are often bounded in the viewing angle and do not allow easy editing of the underlying scene.
Volume rendering is a classical technique with a long history of research in the graphics community . Volume-based representations such as voxel grids [39, 17, 23, 13, 52, 41] and multi-plane images (MPIs) [46, 33, 61, 45, 27] are a popular alternative to mesh representations due to their topology-free nature: gradient-based optimization is therefore straightforward, while rendering can still be real-time. However, such naive volumetric representations are often memory bound, limiting the maximum resolution that can be captured. Volumetric octrees are a popular approach for reducing memory and compute in such cases. We refer the reader to this survey  for a historical perspective on octree volume rendering. Octrees have been used in recent work to decrease the memory requirements during training for other 3D tasks [36, 11, 49, 54]. Concurrent with this work, Nex  extends MPIs to encode spherical basis functions that enable view-dependent rendering effects in real-time. However, unlike our representation, their approach is limited in the viewing direction due to their use of MPIs. Also concurrently, Lombardi et al.  propose to model data using geometric primitives, which allows for fast rendering while conserving space; however, they require a coarse mesh to initialize the primitives.
Coordinate-Based Neural Networks.
Recently, coordinate-based neural networks have emerged as a popular alternative to explicit volumetric representations, as they are not limited to a fixed voxel representation. These methods train a multilayer perceptron (MLP) whose input is a coordinate and output is some property of space corresponding to that location. These networks have been used to predict occupancy[26, 4, 32, 37, 29, 19], signed distance fields [30, 10, 58, 59], and radiance 
. Coordinate-based neural networks have been used for view synthesis in Scene Representation Networks, NeRFs , and many NeRF extensions [25, 31, 38, 44]. These networks represent a continuous function that can be sampled at arbitrarily fine resolutions without increasing the memory footprint. Unfortunately, this compactness is achieved at the expense of computational efficiency as each sample must be processed by a neural network. As a result, these representations are often slow and impractical for real-time rendering.
NeRF Accelerations. While NeRFs are able to produce high quality results, their computationally expensive rendering leads to slow training and inference. One way to speed up the process of fitting a NeRF to a new scene is to incorporate priors learned from a dataset of similar scenes. This can be accomplished by conditioning on predicted images features [50, 60, 55] or meta-learning . To improve inference speed, Neural Sparse Voxel Fields (NSVF)  learns a sparse voxel grid of features that are input into a NeRF like model. The sparse voxel grid allows the renderer to skip over empty regions when tracing a ray which improves the render time 10x. Decomposed Radiance Fields  spatially decomposes a scene into multiple smaller networks. This method focuses on forward facing scenes. AutoInt 
modifies the architecture of the NeRF so that inference requires fewer samples but produces lower quality results. None of these approaches achieve real-time. The concurrent work DoNeRF adds a depth classifier to NeRF in order to drastically improve the efficiency of sampling, but requires ground-truth depth for training. Although not based on NeRF, recently Takikawa propose a method to accelerate neural SDF rendering with an octree. Note that this work does not model appearance properties. In contrast, we employ a volumetric representation that can capture photorealistic view-dependent appearances while achieving even higher framerates.
3.1 Neural Radiance Fields
Neural radiance fields (NeRF) 
are 3D representations that can be rendered from arbitrary novel viewpoints while capturing continuous geometry and view-dependent appearance. The radiance field is encoded into the weights of a multilayer perceptron (MLP) that can be queried at a positionfrom a viewing direction to recover the corresponding density and color . A pixel’s predicted color is computed by casting a ray,
, into the volume and accumulating the color based on density along the ray. NeRF estimates the accumulated color by takingpoint samples along the ray to perform volume rendering:
Where are the distances between point samples. To train the NeRF network, the predicted colors for a batch of rays corresponding to pixels in the training images are optimized using Adam  to match the target pixel colors:
To better represent high frequency details in the scene, positional encoding is applied to the inputs, and two stages of sampling are performed. We refer the interested reader to the NeRF paper  paper for details.
One notable consequence of this architecture is that each sample along the ray must be fed to the MLP to obtain the corresponding and . A total of 192 samples were taken for each ray in the examples presented in NeRF. This is inefficient as most samples are sampling free space which do not contribute to the integrated color. To render a single target image at resolution, the network must be run on over 100 million inputs. Therefore it takes about 30 seconds to render a single frame using a NVIDIA V100 GPU, making it impractical for real-time applications. Our use of a sparse voxel octree avoids excess compute in regions without content. Additionally we precompute the values for each voxel so that network queries are not performed during inference.
We propose a pipeline that enables real-time rendering of NeRFs. Given a trained NeRF, we can convert it into a PlenOctree, an efficient data structure that is able to represent non-Lambertian effects in a scene. Specifically, it is an octree which stores spherical harmonics (SH) coefficients at the leaves, encoding view-dependent radiance.
To make the conversion to PlenOctree more straightforward, we also propose NeRF-SH, a variant of the NeRF network which directly outputs the SH coefficients, thus eliminating the need for a view-direction input to the network. With this change, the conversion can then be performed by evaluating on a uniform grid followed by thresholding. We fine-tune the octree on the training images to further improve image quality, Please see Fig. 2 for a graphical illustration of our pipeline.
The conversion process leverages the continuous nature of NeRF to dynamically obtain the spatial structure of the octree. We show that even with a partially trained NeRF, our PlenOctree is capable of producing results competitive with the fully trained NeRF.
4.1 NeRF-SH: NeRF with Spherical Harmonics
SHs have been a popular low-dimensional representation for spherical functions and have been used to model Lambertian surfaces [34, 2] or even glossy surfaces . Here we explore its use in a volumetric context. Specifically, we adapt the NeRF network to output spherical harmonics coefficients , rather than RGB values.
Each is a set of 3 coefficients corresponding to the RGB components. In this setup, the view-dependent color at a point may be determined by querying the SH functions at the desired viewing angle :
is the sigmoid function for normalizing the colors. In other words, we factorize the view-dependent appearance with the SH basis, eliminating the view-direction input to the network and removing the need to sample view directions at conversion time. Please see the appendix for more technical discussion of SHs. With a single evaluation of the network, we can now efficiently query colors from arbitrary viewing angles at inference time. In Fig.7, it can be seen that NeRF-SH training speed is similar to, but slightly faster than, NeRF (by about 10%).
Note that we can also project a trained NeRF to SHs directly at each point by sampling NeRF at random directions and multiplying by the SH component values to form Monte Carlo estimates of the inner products. However, this sampling process takes several hours to achieve reasonable quality and imposes a quality loss of about 2 dB.111With 10000 view-direction samples per point, taking about 2 hours, the PSNR is 29.21 vs. 31.02 for our main method prior to optimization. Nevertheless, this alternative approach offers a pathway to convert existing NeRFs into PlenOctrees.
Other than SHs, we also experiment with Spherical Gaussians (SG) , a learnable spherical basis which have been used to represent all-frequency lighting [51, 43, 20]. We find that SHs perform better in our use case and provide an ablation in the appendix.
Sparsity prior. Without any regularization, the model is free to generate arbitrary geometry in unobserved regions. While this does not directly worsen image quality, it would adversely impact our conversion process as the extra geometry occupies significant voxel space.
To solve this problem, we introduce an additional sparsity prior during NeRF training. Intuitively, this prior encourages NeRF to choose empty space when both space and solid colors are possible solutions. Formally,
Here, are the evaluated density values at uniformly random points within the bounding box, and
is a hyperparameter. The final training loss is then, where is a hyperparameter. Fig. 3 illustrates the effect of the prior.
|Synthetic NeRF Dataset best second-best|
|AutoInt (8 sections)||25.55||0.911||0.170||0.380|
|PlenOctree from NeRF-SH||31.02||0.951||0.066||167.68|
|PlenOctree after fine-tuning||31.71||0.958||0.053||167.68|
|Tanks and Temples Dataset best second-best|
|PlenOctree from NeRF-SH||27.34||0.897||0.170||42.22|
|PlenOctree after fine-tuning||27.99||0.917||0.131||42.22|
4.2 PlenOctree: Octree-based Radiance Fields
Once we have trained a NeRF-SH model, we can convert it into a sparse octree representation for real time rendering. A PlenOctree stores density and SH coefficients modelling view-dependent appearance at each leaf. We describe the conversion and rendering processes below.
Rendering. To render the PlenOctree, for each ray, we first determine ray-voxel intersections in the octree structure. This produces a sequence of distances between voxel boundaries , each of which has constant density and color. NeRF’s volume rendering model (1) is then applied to assign a color to the ray. Note that compared to the uniform sampling employed Neural Volumes , this approach is able to skip large voxels in one step while also not missing small voxels.
At test-time, we further accelerate this rendering process by applying early-stopping when the ray has accumulated transmittance less than .
Conversion from NeRF-SH. The conversion process can be divided into three steps. At a high level, we evaluate the network on a grid, retaining only density values, then filter the voxels via thresholding. Finally we sample random points within each remaining voxel and average them to obtain SH coefficients to store in the octree leaves. More details are given below:
Evaluation. We first evaluate the NeRF-SH model to obtain values on a uniformly spaced 3D grid. The grid is automatically scaled to tightly fit the scene content.222 By pre-evaluating on a larger grid and finding the bounding box of all points with .
Filtering. Next, we filter this grid to obtain a sparse set of voxels centered at the grid points sufficient for representing the scene. Specifically, we render alpha maps for all the training views using this voxel grid, keeping track of the maximum ray weight at each voxel. We then eliminate the voxels whose weights are lower than a threshold . The octree is constructed to contain the remaining voxels as leaves at the deepest level while being empty elsewhere. Compared to naively thresholding by at each point, this method can eliminates non-visible voxels.
Sampling. Finally, we sample a set of random points in each remaining voxel and set the associated leaf of the octree to the mean of these values to reduce aliasing. Each leaf now contains the density
and a vector of spherical harmonics coefficients for each of the RGB color channels.
This full extraction process takes about 15 minutes.333 Note that sampling points instead of allows for extraction in about minutes, with minimal loss in quality.
4.3 PlenOctree Optimization
Since this volume rendering process is fully differentiable with respect to the tree values, we can directly fine-tune the resulting octree on the original training images using the NeRF loss (3) with SGD in order to improve the image quality. Note that the tree structure is fixed to that obtained from NeRF in this process. PlenOctree optimization operates at about million rays per second, compared to about
for NeRF training, allowing us to optimize for many epochs in a relatively short time. The analytic derivatives for this process are implemented in custom CUDA kernels. We defer technical details to the appendix.
|Ours-1.9G||Complete Model as in Table 1||1.93||31.71||168|
|Ours-0.4G||w/o Auto Bbox Scaling||0.44||30.70||329|
|Ours-0.3G||Grid Size 256||0.30||29.60||410|
The fast octree optimization indirectly allows us to accelerate NeRF training, as seen in Fig. 7, since we can elect to stop the NeRF-SH training at an earlier time for constructing the PlenOctree, with only a slight degradation in quality.
5.1 Experimental Setup
Datasets. For our experiments, we use the NeRF-synthetic  dataset and a subset of the Tanks and Temples dataset . The NeRF-synthetic dataset consists of 8 scenes where each scene has a central object that is imaged with 100 inward facing cameras distributed randomly on the upper hemisphere. The images are with provided ground truth camera poses. The Tanks and Temples subset is from NSVF  and contains 5 scenes of real objects captured by an inward facing camera that circles the scene. We use foreground masks provided by the NSVF authors. Each scene contains between 152-384 images of size .
Baselines. The principal baseline for our experiments is NeRF ; we report results for both the original NeRF implementation, denoted NeRF (original) as well as a reimplementation in Jax , denoted simply NeRF, which our NeRF-SH code is based off of. Unless otherwise stated, all NeRF results and timings are from the latter implementation. We compare also to two recent papers introducing NeRF accelerations, neural sparse voxel fields (NSVF)  and AutoInt , as well as two older methods, scene representation networks (SRN)  and Neural Volumes .
5.2 Quality Evaluation
We evaluate our approach against prior works on the synthetic and real datasets mentioned above. The results are in Tables 1 and Table 2 respectively. Note that none of the baselines achieve real-time performance; nevertheless, our quality results are competitive in all cases and better in terms of some metrics.
In Figures 4 and 6, we show qualitative examples that demonstrate that our PlenOctree conversion does not perceptually worsen the rendered images compared to NeRF; rather, we observe that the PlenOctree optimization process enhances fine details such as text. Additionally, we note that our modifications of NeRF to predict spherical function coefficients (NeRF-SH) does not significantly change the performance.
For the SH, we set (16 components) and (25 components) on the synthetic and Tanks & Temples datasets respectively. We use grid size in either case. Please refer to the appendix for training details. The inference time performance is measured on a Tesla V100 for all methods. Across both datasets we find that PlenOctrees perform inference over 3000 times faster than NeRF and at least 30 times faster than all other compared methods. PlenOctree performs either best, or second best for all image quality metrics.
5.3 Speed Trade-off Analysis
5.4 Indirect Acceleration of NeRF Training
Since we can efficiently fine-tune the octree on the original training data, as briefly discussed in §4.3, we can choose to stop the NeRF-SH training at an earlier time before converting it to a PlenOctree. Indeed, we have found that the image quality improvements gained during fine-tuning can often be greater than continuing to train the NeRF-SH an equivalent amount of time. Therefore it can be more time efficient to stop the NeRF-SH training before it has converged and transition to PlenOctree conversion and fine-tuning.
In Figure 7 we compare NeRF and NeRF-SH models trained for 2 million iterations each to a sequence of PlenOctree models extracted from NeRF-SH checkpoints. We find that given a time constraint, it is almost always preferable to stop the NeRF training and transition to PlenOctree optimization.
5.5 Real-time and In-browser Applications
Interactive demos. Within our desktop viewer, we are able to perform a variety of real-time scene operations on the PlenOctree representation. For example, it is possible to insert meshes while maintaining proper occlusion, slice the PlenOctree to visualize a cross-section, or render the depth map to verify the geometry. Other features include probing the radiance distribution at any point in space, and inspecting subsets of SH components. These examples are demonstrated in Figure 9. The ability to perform these actions in real-time is beneficial both for interactive entertainment and debugging NeRF-related applications.
Web renderer. We have implemented a web-based renderer enabling interactive viewing of converted PlenOctrees in the browser. This is achieved by rewriting our CUDA-based PlenOctree renderer as a WebGL-compatible fragment shader. We apply compressions to make serving the octrees more manageable. Please see the appendix for more information.
We have introduced a new data representation for NeRFs using PlenOctrees, which enables real-time rendering capabilities for arbitrary objects and scenes. Not only can we accelerate the rendering performance of the original NeRF method by more than 3000 times, but we can produce images that are either equal or better quality than NeRF thanks to our hierarchical data structure. As training time poses another hurdle for adopting NeRFs in practice (taking 1-2 days to fully converge), we also showed that our PlenOctrees can accelerate effective training time for our NeRF-SH. Finally, we have implemented an in-browser viewer based on WebGL to demonstrate real-time and 6-DOF rendering capabilities of NeRFs on consumer laptops. In the future, our approach may enable virtual online stores in VR, where any products with arbitrary complexity and materials can be visualized in real-time while enabling 6-DOF viewing.
Limitations and Future Work.
While we achieve state-of-the-art rendering performance and frame rates, the octree representation is much larger than the compact representation of the original NeRF model and has a larger memory footprint. The average uncompressed octree size for the full model is 1.93 GB on the synthetic dataset and 3.53 GB on the Tanks and Temples dataset. For online delivery, we use lower-resolution compressed models which are about 30-120 MB; please see the appendix for details. Although already possible in some form (Fig. 8), applying our method to unbounded and forward-facing scenes optimally requires further work as the data distribution is different for unbounded scenes. The forward-facing scenes inherently do not support 6-DOF viewing, and we suggest MPIs may be more appropriate in this case .
In the future, we plan to explore extensions of our method to enable real-time 6-DOF immersive viewing of large-scale scenes, as well as of dynamic scenes. We believe that real-time rendering of NeRFs has the potential to become a new standard for next-generation AR/VR technologies, as photorealistic 3D content can be digitized as easily as recording 2D videos.
-  (1991) The plenoptic function and the elements of early vision. Vol. 2, Vision and Modeling Group, Media Laboratory, Massachusetts Institute of …. Cited by: §1.
-  (2003) Lambertian reflectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence 25 (2), pp. 218–233. Cited by: §4.1.
-  (2001) Unstructured lumigraph rendering. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 425–432. Cited by: §2.
-  (2019) Learning implicit fields for generative shape modeling. In , pp. . Cited by: §2.
-  (1996) Modeling and rendering architecture from photographs: a hybrid geometry-and image-based approach. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 11–20. Cited by: §2.
-  JaxNeRF: an efficient JAX implementation of NeRF External Links: Cited by: §B.4, §5.1.
-  (1988) Volume rendering. ACM Siggraph Computer Graphics 22 (4), pp. 65–74. Cited by: §2.
-  (1953) Dispersion on a sphere. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 217 (1130), pp. 295–305. Cited by: §A.2, §B.1, §4.1.
-  Zlib External Links: Cited by: item 2.
-  (2020) Implicit geometric regularization for learning shapes. ICML. Cited by: §2.
-  (2017) Hierarchical surface prediction for 3d object reconstruction. In 2017 International Conference on 3D Vision (3DV), pp. 412–420. Cited by: §2.
-  (1982) Color image quantization for frame buffer display. ACM SIGGRAPH Proceedings. Cited by: item 1.
-  (2017) Learning a multi-view stereo machine. arXiv preprint arXiv:1708.05375. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §B.4, §3.1.
-  (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics (ToG) 36 (4), pp. 1–13. Cited by: §5.1.
-  (2006) A survey of octree volume rendering methods. GI, the Gesellschaft für Informatik, pp. 87. Cited by: §2.
-  (2000) A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: §2.
-  (1996) Light field rendering. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 31–42. Cited by: §2.
-  (2020) Monocular real-time volumetric performance capture. In European Conference on Computer Vision, pp. 49–67. Cited by: §2.
-  (2020) Inverse rendering for complex indoor scenes: shape, spatially-varying lighting and svbrdf from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2475–2484. Cited by: §B.1, §4.1.
-  (2020) AutoInt: automatic integration for fast neural volume rendering. arXiv preprint arXiv:2012.01714. Cited by: §2, §5.1.
-  (2020) Neural sparse voxel fields. NeurIPS. Cited by: §A.1, §2, §5.1, §5.1.
-  (2019-07) Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38 (4), pp. 65:1–65:14. Cited by: §A.1, §1, §2, §4.2, §5.1.
-  (2021) Mixture of volumetric primitives for efficient neural rendering. Note: preprint External Links: Cited by: §2.
-  (2021) NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, Cited by: §2.
-  (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2.
-  (2020) NeRF: representing scenes as neural radiance fields for view synthesis. ECCV. Cited by: §1, Figure 2, §2, §3.1, §3.1, §5.1, §5.1.
-  (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) Deformable neural radiance fields. arXiv preprint arXiv:2011.12948. Cited by: §2.
-  (2020) Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017) Soft 3d reconstruction for view synthesis. ACM Transactions on Graphics (TOG) 36 (6), pp. 1–11. Cited by: §2.
-  (2001) On the relationship between radiance and irradiance: determining the illumination from images of a convex lambertian object. JOSA A 18 (10), pp. 2448–2459. Cited by: §4.1.
-  (2020) DeRF: decomposed radiance fields. arXiv preprint arXiv:2011.12490. Cited by: §2.
-  (2017) OctNet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2019-10) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2020) Graf: generative radiance fields for 3d-aware image synthesis. arXiv preprint arXiv:2007.02442. Cited by: §2.
-  (1999) Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision 35 (2), pp. 151–173. Cited by: §2.
-  (1998) Layered depth images. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 231–242. Cited by: §2.
-  (2019) Deepvoxels: learning persistent 3d feature embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2446. Cited by: §2.
-  (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §A.1, §2, §5.1.
-  (2002) Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. In Proceedings of the 29th annual conference on Computer graphics and interactive techniques, pp. 527–536. Cited by: §B.1, §4.1, §4.1.
-  (2020) NeRV: neural reflectance and visibility fields for relighting and view synthesis. arXiv preprint arXiv:2012.03927. Cited by: §2.
-  (2019) Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 175–184. Cited by: §2.
-  (1998) Stereo matching with transparency and matting. In Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 517–524. Cited by: §2.
-  (2021) Neural geometric level of detail: real-time rendering with implicit 3D shapes. arXiv preprint arXiv:2101.10994. Cited by: §2.
-  (2021) Learned initializations for optimizing coordinate-based neural representations. In CVPR, Cited by: §2.
-  (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.
-  (2020) GRF: learning a general radiance field for 3d scene representation and rendering. arXiv preprint arXiv:2010.04595. Cited by: §2.
-  (2006) . ACM Transactions on graphics (TOG) 25 (3), pp. 967–976. Cited by: §B.1, §4.1.
-  (2017) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2626–2634. Cited by: §2.
-  (2014) Let there be color! large-scale texturing of 3d reconstructions. In ECCV, pp. 836–850. Cited by: §2.
O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–11. Cited by: §2.
-  (2021) IBRNet: learning multi-view image-based rendering. arXiv preprint arXiv:2102.13090. Cited by: §2.
-  (2021) NeX: real-time view synthesis with neural basis expansion. External Links: Cited by: §2, §6.
-  (2000) Surface light fields for 3d photography. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 287–296. Cited by: §2, §2.
-  (2019) DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 492–502. External Links: Cited by: §2.
-  (2020) Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems 33. Cited by: §2.
-  (2021) PixelNeRF: neural radiance fields from one or few images. In CVPR, Cited by: §2.
-  (2018) Stereo magnification: learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817. Cited by: §2.
Appendix A Additional Results
a.1 Detailed comparisons
Here we provide further qualitative comparisons with baselines: SRN , Neural Volumes , NSVF  in Figure 10. We show more qualitative results of our method in Figure 11 and Figure 12. We also report a per-scene breakdown of the quantitative metrics against all approaches in Table 5, 6, 7, 8.
a.2 Spherical Basis Function Ablation
We also provide ablation studies on the choice of spherical basis functions. We first ablate the effect on the number of spherical harmonics basis, then we explore the use of a learnable spherical basis functions. All experiments are conducted on NeRF-synthetic dataset and we report the average metric directly after training NeRF with spherical basis functions and after converting it to PlenOctrees with fine-tuning.
Number of SH basis functions
First, we ablate the number of basis functions used for spherical harmonics. Average metrics across the NeRF-synthetic dataset are reported both for the modified NeRF model and the corresponding PlenOctree. We found that switching between (SH-16) and (SH-25) makes very little difference in terms of metrics or visual quality.
Furthermore, we also experimented with spherical Gaussians (SGs) , which is another form of spherical basis functions similar to spherical harmonics, but with learnable Gaussian kernels. Please see §B.1 for a brief introduction of SHs and SGs. SG-25 denotes our model using 25 SG components instead of SH, all with learnable lobe axis and bandwidth . However, while this model has marginally better PSNR, the advantage disappears following PlenOctree conversion and fine-tuning.
Appendix B Technical Details
b.1 Spherical Basis Functions: SH and SG
In the main paper, we used the SH functions without defining their exact form. Here, we provide a brief technical discussion of both spherical harmonics (SH) and spherical Gaussians (SG) for completeness.
The Spherical Harmonics (SH) form a complete basis of functions . For and , the SH function of degree and order is defined as:
where are the associated Legendre polynomials. A real basis of SH can be defined in terms of its complex analogue by setting
Any real spherical function may then be expressed in the SH basis:
Spherical Gaussians (SGs), also known as the von Mises-Fisher distribution , is another form of spherical basis functions that have been widely adopted to approximate spherical functions. Unlike SHs, SGs are a learnable basis. A normalized SG is defined as:
Where is the lobe axis, and is the bandwidth (sharpness) of the Gaussian kernel. Due to the varying bandwidths supported by SGs, they are suitable for representing all-frequency signals such as lighting [51, 43, 20]. A spherical function represented using SGs is formulated as:
Where is the RGB coefficients for each SG.
b.2 PlenOctree Compression
The uncompressed PlenOctree file would be unpleasantly time-consuming for users to download for in-browser rendering. Thus, to minimize the size of PlenOctrees for viewing in the browser, we use SH-9 instead of SH-16 or SH-25 and apply a looser bounding box, which reduces the number of occupied voxels. On top of this, we compress the PlenOctrees directly in the following ways:
We quantize the SH coefficients in the tree using the popular median-cut algorithm . More specifically, the values are kept as is; for each SH basis function, we quantize the RGB coefficients into colors. Afterwards, separately for each SH basis function, we store a codebook (as float16) along with pointers from each tree leaf to a position in the codebook (as int16).
We compress the entire tree, including pointers, using the standard DEFLATE algorithm from ZLIB .
This process reduces the file size by as much as - times. The tree is fully decompressed before it is displayed in the web renderer. We will also release this code.
b.3 Analytic Derivatives of PlenOctree Rendering
In this section, we derive the analytic derivatives of the NeRF piecewise constant volume rendering model for optimizing PlenOctrees directly. Throughout this section we will consider a fixed ray with a given origin and direction.
For preciseness, we provide definitions of quantities used in NeRF volume rendering. The NeRF rendering model considers a ray divided into consecutive segments with endpoints , where and are the near and far bounds. The segments have constant densities where each . If we shine a light of intensity at , then at the camera position , the light intensity is given by
Where are segment lengths as in 3 of the main paper. Note that is also known as the accumulated transmittance from to , and is the same as the definition in (1). It can be shown that this precisely models the absorption within each segment in the piecewise-constant setting.
Let be the color associated with segments , and be the background light intensity; each is an RGB color. We are interested in the derivative of the rendered color with respect to and . Note (background) is typically considered a hyperparameter.
b.3.2 Derivation of the Derivatives
From the original NeRF rendering equation (1), we can express the rendered ray color as:
Where are segment weights, and .444 Note that the background color was omitted in equation (1) of the main paper for simplicity.
Since the rendered color are a convex combination of the segment colors, it’s immediately clear that
Handling spherical harmonics colors is straightforward by applying the chain rule, noting that the SH basis function values are constant across the ray.
This is slightly more tricky. We can write the derivative wrt. ,
Where the derivative of the intensity , is
denotes an indicator function whose value is if or else. Basically, we can delete any for from the original expression, then multiply by . Therefore we can simplify (16) as follows
Within the PlenOctree renderer, this gradient can be computed in two rendering passes; the second pass is needed due to dependency on “future” weights and colors not seen by the ray marching process. The first pass store , then subtracting a prefix from it. The overhead is still relatively small, and auxiliary memory use is constant.
If there are multiple colors, we simply add the density derivatives over all of them. In practice, usually the network outputs and we set , so we also need to take care of setting the gradient to if .
b.4 NeRF-SH Training Details
Our NeRF-SH model is built upon a Jax reimplementation of NeRF . In our experiments, we use a batch size of 1024 rays, each with 64 sampled points in the coarse volume and 128 additional sampled points in the fine volume. The model is optimized with the Adam optimizer  using a learning rate that starts at and decays exponentially to over the training process. All of our models are trained for 2M iterations under the same protocol. Training takes around 50 hours to converge for each model on a single NVIDIA V100 GPU.
b.5 PlenOctree Optimization Details
After converting the NeRF-SH model into a PlenOctree, we further optimize the PlenOctree on the training set with SGD using the NeRF loss; note we no longer apply the sparsity prior here since the octree is already sparse. For NeRF-synthetic dataset, we use a constant learning rate and optimize for maximum 80 epochs. For Tanks&Temples dataset, we set the learning rate to and the maximum epochs to 40. We applied early stopping for the optimization process by monitoring the PSNR on the validation set555For Tanks&Temples dataset, we hold out 10% of the training set as validation set only for PlenOctree optimization.. On average it takes around 10 minutes to finish the PlenOctree optimization for each scene on a single NVIDIA V100 GPU. The entire optimization process is performed in float32 for stability, but afterwards we storage the PlenOctree with float16 to reduce the model size.
|PlenOctree from NeRF-SH||33.19||25.01||30.56||36.15||32.12||29.56||33.01||28.58||31.02|
|PlenOctree after fine-tuning||34.66||25.31||30.79||36.79||32.95||29.76||33.97||29.42||31.71|
|PlenOctree from NeRF-SH||0.970||0.927||0.968||0.977||0.965||0.953||0.983||0.863||0.951|
|PlenOctree after fine-tuning||0.981||0.933||0.970||0.982||0.971||0.955||0.987||0.884||0.958|
|PlenOctree from NeRF-SH||0.039||0.088||0.038||0.044||0.046||0.063||0.023||0.189||0.066|
|PlenOctree after fine-tuning||0.022||0.076||0.038||0.032||0.034||0.059||0.017||0.144||0.053|
|PlenOctree from NeRF-SH||25.78||24.80||32.04||27.92||26.15||27.34|
|PlenOctree after fine-tuning||26.80||25.29||32.85||28.19||26.83||27.99|
|PlenOctree from NeRF-SH||0.820||0.889||0.948||0.940||0.889||0.897|
|PlenOctree after fine-tuning||0.856||0.907||0.962||0.948||0.914||0.917|
|PlenOctree from NeRF-SH||0.296||0.188||0.094||0.092||0.180||0.170|
|PlenOctree after fine-tuning||0.226||0.148||0.069||0.080||0.130||0.131|