The task of view synthesis — using observed images to recover a 3D scene representation that can render the scene from novel unobserved viewpoints — has recently seen dramatic progress as a result of using neural volumetric representations. In particular, Neural Radiance Fields (NeRF)  are able to render photorealistic novel views with fine geometric details and realistic view-dependent appearance by representing a scene as a continuous volumetric function, parameterized by a multilayer perceptron (MLP) that maps from a continuous 3D position to the volume density and view-dependent emitted radiance at that location. Unfortunately, NeRF’s rendering procedure is quite slow: rendering a ray requires querying an MLP hundreds of times, such that rendering a frame at resolution takes roughly a minute on a modern GPU. This prevents NeRF from being used for interactive view synthesis applications such as virtual and augmented reality, or even simply inspecting a recovered 3D model in a web browser.
In this paper, we address the problem of rendering a trained NeRF in real-time, see Figure 1. Our approach accelerates NeRF’s rendering procedure by three orders of magnitude, resulting in a rendering time of 12 milliseconds per frame on a single GPU. We precompute and store (i.e. “bake”) a trained NeRF into a sparse 3D voxel grid data structure, which we call a Sparse Neural Radiance Grid (SNeRG). Each active voxel in a SNeRG contains opacity, diffuse color, and a learned feature vector that encodes view-dependent effects. To render this representation, we first accumulate the diffuse colors and feature vectors along each ray. Next, we pass the accumulated feature vector through a lightweight MLP to produce a view-dependent residual that is added to the accumulated diffuse color.
We introduce two key modifications to NeRF that enable it to be effectively baked into this sparse voxel representation: 1) we design a “deferred” NeRF architecture that represents view-dependent effects with an MLP that only runs once per pixel (instead of once per 3D sample as in the original NeRF architecture), and 2) we regularize NeRF’s predicted opacity field during training to encourage sparsity, which improves both the storage cost and rendering time for the resulting SNeRG.
We demonstrate that our approach is able to increase the rendering speed of NeRF so that frames can be rendered in real-time, while retaining NeRF’s ability to represent fine geometric details and convincing view-dependent effects. Furthermore, our representation is compact, and requires less than 90 MB on average to represent a scene.
2 Related work
Our work draws upon ideas from computer graphics to enable the real-time rendering of NeRFs. In this section, we review scene representations used for view synthesis with a specific focus on their ability to support real-time rendering, and discuss prior work in efficient representation and rendering of volumetric representations within the field of computer graphics.
Scene Representations for View Synthesis
The task of view synthesis, using observed images of an object or scene to render photorealistic images from novel unobserved viewpoints, has a rich history within the fields of graphics and computer vision. The majority of prior work in this space has used traditional 3D representations from computer graphics which are naturally amenable to efficient rendering. For scenarios where the scene is captured by densely-sampled images, light field rendering techniques[8, 14, 21]
can be used to efficiently render novel views by interpolating between sampled rays. Unfortunately, the sampling and storage requirements of light field interpolation techniques are typically intractable for settings with significant viewpoint motion. Methods that aim to support free-viewpoint rendering from sparsely-sampled images typically reconstruct an explicit 3D representation of the scene. One popular class of view synthesis methods uses mesh-based representations, with either diffuse  or view-dependent [5, 9, 46] appearance. Recent methods have trained deep networks to increase the quality of mesh renderings, improving robustness to errors in the reconstructed mesh geometry [15, 42]. Mesh-based approaches are naturally amenable to real-time rendering with highly-optimized rasterization pipelines. However, gradient-based optimization of a rendering loss with a mesh representation is difficult, so these methods have difficulties reconstructing fine structures and detailed scene geometry.
Another popular class of view synthesis methods uses discretized volumetric representations such as voxel grids [25, 36, 37, 39] or multiplane images [12, 33, 40, 48]. While volumetric approaches are better suited to gradient-based optimization, discretized voxel representations are fundamentally limited by their cubic scaling. This restricts their usage to representing scenes at relatively low resolutions in the case of voxel grids, or rendering from a limited range of viewpoints in the case of multiplane images.
NeRF  proposes replacing these discretized volumetric representations with an MLP that represents a scene as a continuous neural volumetric function by mapping from a 3D coordinate to the volume density and view-dependent emitted radiance at that position. The NeRF representation has been remarkably successful for view synthesis, and follow-on works have extended NeRF for generative modeling [6, 35], dynamic scenes [22, 31], non-rigidly deforming objects [13, 32], and relighting [2, 38]. NeRF is able to represent detailed geometry and realistic appearance extremely efficiently (NeRF uses approximately 5 MB of network weights to represent each scene), but this comes at the cost of slow rendering. NeRF needs to query its MLP hundreds of times per ray, and requires roughly a minute to render a single frame. We specifically address this issue, and present a method that enables a trained NeRF to be rendered in real-time.
Recent works have explored a few strategies for improving the efficiency of NeRF’s neural volumetric rendering. AutoInt  designs a network architecture that automatically computes integrals along rays, which enables a piecewise ray-marching procedure that requires far fewer MLP evaluations. Neural Sparse Voxel Fields  store a 3D voxel grid of latent codes, and sparsifies this grid during training to enable NeRF to skip free space during rendering. Decomposed Radiance Fields  represents a scene using a set of smaller MLPs instead of a single large MLP. However, these methods only achieve moderate speedups of around at best, and are therefore not suited for real-time rendering. In contrast to these methods, we specifically focus on accelerating the rendering of a NeRF after it has been trained, which allows us to leverage precomputation strategies that are difficult to incorporate during training.
Efficient Volume Rendering Discretized volumetric representations have been used extensively in computer graphics to make rendering more efficient both in terms of storage and rendering speed. Our representation is inspired by this long line of prior work on efficient volume rendering, and we extend these approaches with a deferred neural rendering technique to model view-dependent effects.
Early works in volume rendering [17, 20, 45] primarily focused on fast rendering of dense voxel grids. However, as shown by Laine and Karras , sparse voxel grids can be an effective and efficient representation for opaque surfaces. In situations where large regions of space share the same data value or a prefiltered representation is required to combat aliasing, hierarchical representations such as sparse voxel octrees  are a popular choice of data structure to represent this sparse volumetric content. However, for scenes with detailed geometry and appearance and scenarios where a variable level-of-detail is not required during rendering, octrees’ intermediate non-leaf nodes and the tree traversals required to query them can incur a significant memory and time overhead.
Alternatively, sparse voxel grids can be efficiently represented with hash tables [19, 30]. However, hashing each voxel independently can lead to incoherent memory fetches when traversing the representation during rendering. We make a deliberate trade-off to use a block-sparse representation, which improves memory coherence but slightly increases the size of our representation.
Our work aims to combine the reconstruction quality and view-dependence of NeRF with the speed of these efficient volume rendering techniques. We achieve this by extending deferred neural rendering  to volumetric scene representations. This allows us to visualize trained NeRF models in real-time on commodity hardware, with minimal quality degradation.
3 Method Overview
Our overall goal is to design a practical representation that enables the serving and real-time rendering of scenes reconstructed by NeRF. This implies three requirements: 1) Rendering a resolution frame (the resolution used by NeRF) should require less than 30 milliseconds on commodity hardware. 2) The representation should be compressible to 100 MB or less. 3) The uncompressed representation should fit within GPU memory (approximately 4 GB) and should not require streaming.
Rendering a standard NeRF in real-time is completely intractable on current hardware. NeRF requires about 100 teraflops to render a single frame, which results in a best-case rendering time of 10 seconds per frame on an NVIDIA RTX 2080 GPU with full GPU utilization. To enable real-time rendering, we must therefore exchange some of this computation for storage. However, we do not want to precompute and store the entire 5D view-dependent representation [14, 21], as that would require a prohibitive amount of GPU memory.
We propose a hybrid approach that precomputes and stores some content in a sparse 3D data structure but defers the computation of view-dependent effects to rendering time. We jointly design a reformulation of NeRF (Section 4) as well as a procedure to bake this modified NeRF into a discrete volumetric representation that is suited for real-time rendering (Section 5).
4 Modifying NeRF for Real-time Rendering
We reformulate NeRF in three ways: 1) we limit the computation of view-dependent effects to a single network evaluation per ray, 2) we introduce a small bottleneck in the network architecture that can be efficiently stored as 8 bit integers, and 3) we introduce a sparsity loss during training, which concentrates the opacity field around surfaces in the scene. Here, we first review NeRF’s architecture and rendering procedure before describing our modifications.
4.1 Review of NeRF
NeRF represents a scene as a continuous volumetric function parameterized by a MLP. Concretely, the 3D position and viewing direction along a camera ray , are passed as inputs to an MLP with weights to produce the volume density of particles at that location as well as the RGB color corresponding to the radiance emitted by particles at the input location along the input viewing direction:
A key design decision made in NeRF is to architect the MLP such that volume density is only predicted as a function of 3D position, while emitted radiance is predicted as a function of both 3D position and 2D viewing direction.
To render the color of a pixel, NeRF queries the MLP at sampled positions
along the corresponding ray and uses the estimated volume densities and colors to approximate a volume rendering integral using numerical quadrature, as discussed by Max:
where is the distance between two adjacent points along the ray.
NeRF trains the MLP by minimizing the squared error between input pixels from a set of observed images (with known camera poses) and the pixel values predicted by rendering the scene as described above:
where is the color of pixel in the input images.
By replacing a traditional discrete volumetric representation with an MLP, NeRF makes a strong space-time tradeoff: NeRF’s MLP requires multiple orders of magnitude less space than a dense voxel grid, but accessing the properties of the volumetric scene representation at any location requires an MLP evaluation instead of a simple memory lookup. Rendering a single ray that passes through the volume requires hundreds of these MLP queries, resulting in extremely slow rendering times. This tradeoff is beneficial during training; since we do not know where the scene geometry lies during optimization, it is crucial to use a compact representation that can represent highly-detailed geometry at arbitrary locations. However, after a NeRF has been trained, we argue that it is prudent to rethink this space-time tradeoff and bake the NeRF representation into a data structure that stores pre-computed values from the MLP to enable real-time rendering.
4.2 Deferred NeRF Architecture
NeRF’s MLP can be thought of as predicting a 256-dimensional feature vector for each input 3D location, which is then concatenated with the viewing direction and decoded into an RGB color. NeRF then accumulates these view-dependent colors into a single pixel color. However, evaluating an MLP at every sample along a ray to estimate the view-dependent color is prohibitively expensive for real-time rendering. Instead, we modify NeRF to use a strategy similar to deferred rendering [10, 42]. We restructure NeRF to output a diffuse RGB color and a 4-dimensional feature vector (which is constrained to via a sigmoid so that it can be compressed, as discussed in Section 5.4) in addition to the volume density at each input 3D location:
To render a pixel, we accumulate the diffuse colors and feature vectors along each ray and pass the accumulated feature vector and color, concatenated to the ray’s direction, to a very small MLP with parameters (2 layers with 16 channels each) to produce a view-dependent residual that we add to the accumulated diffuse color:
This modification enables us to precompute and store the diffuse colors and 4-dimensional feature vectors within our sparse voxel grid representation discussed below. Critically, we only need to evaluate the to produce view-dependent effects once per pixel, instead of once per sample in 3D space as in the standard NeRF model.
4.3 Opacity Regularization
Both the rendering time and required storage for a volumetric representation strongly depend on the sparsity of opacity within the scene. To encourage NeRF’s opacity field to be sparse, we add a regularizer that penalizes predicted density using a Cauchy loss during training:
where indexes pixels in the input (training) images,
indexes samples along the corresponding rays, and hyperparametersand control the magnitude and scale of the regularizer respectively ( and in all experiments). To ensure that this loss is not unevenly applied due to NeRF’s hierarchical sampling procedure, we only compute it for the “coarse” samples that are distributed with uniform density along each ray.
5 Sparse Neural Radiance Grids
We now convert a trained deferred NeRF model, described above, into a representation suitable for real-time rendering. The core idea is to trade computation for storage, significantly reducing the time required to render frames. In other words, we are looking to replace the MLP evaluations in NeRF with fast lookups in a precomputed data structure. We achieve this by precomputing and storing, i.e. baking, the diffuse colors , volume densities , and 4-dimensional feature vectors in a voxel grid data structure.
It is crucial for us to store this volumetric grid using a sparse representation, as a dense voxel grid can easily fill up all available memory on a modern high-end GPU. By exploiting sparsity and only storing voxels that are both occupied and visible, we end up with a much more compact representation.
5.1 SNeRG Data Structure
|(a) Frame rendered by our method.||(b) Cross-section of (a).|
|(c) Trained without .||(d) No , no visibility culling.|
Our Sparse Neural Radiance Grid (SNeRG) data structure represents an voxel grid in a block-sparse format using two smaller dense arrays.
The first array is a 3D texture atlas containing densely-packed “macroblocks” of size each, corresponding to the content (diffuse color, feature vectors, and opacity) that actually exists in the sparse volume. Each voxel in the 3D atlas represents the scene at the full resolution of the dense grid, but the 3D texture atlas is much smaller than since it only contains the sparse “occupied” content. Compared to hashing-based data structures (where ), this approach helps keep spatially close content nearby in memory, which is beneficial for efficient rendering.
The second array is a low resolution indirection grid, which either stores a value indicating that the corresponding macroblock within the full voxel grid is empty, or stores an index that points to the high-resolution content of that macroblock within the 3D texture atlas. This structure crucially lets us skip blocks of empty space during rendering, as we describe below.
We render a SNeRG using a ray-marching procedure, as done in NeRF. The critical differences that enable real-time rendering are: 1) we precompute the diffuse colors and feature vectors at each 3D location, allowing us to look them up within our data structure instead of evaluating an MLP, and 2) we only evaluate an MLP to produce view-dependent effects once per pixel, as opposed to once per 3D location.
To estimate the color of each ray, we first march the ray through the indirection grid, skipping macroblocks that are marked as empty. For macroblocks that are occupied, we step at the voxel width through the corresponding block in the 3D texture atlas, and use trilinear interpolation to fetch values at each sample location. We further accelerate rendering and conserve memory bandwidth by only fetching features where the volume density is non-zero. We use standard alpha compositing to accumulate the diffuse color and features, terminating ray-marching once the opacity has saturated. Finally, we compute the view-dependent specular color for the ray by evaluating with the accumulated color, feature vector and the ray’s viewing direction. We then add the resulting residual color to the accumulated diffuse color, as described in Equation 7.
To minimize storage cost and rendering time, our baking procedure aims to only allocate storage for voxels in the scene that are both non-empty and visible in at least one of the training views. We start by densely evaluating the NeRF network for the full voxel grid. We convert NeRF’s unbounded volume density values, , to traditional opacity values , where is the width of a voxel. Next, we sparsify this voxel grid by culling empty space, i.e. macroblocks where the maximum opacity is low (below ), and culling macroblocks for which the voxel visibilities are low (maximum transmittance between the voxel and all training views is below ). In all experiments, we set and
. Finally, we compute an anti-aliased estimate for the content in the remaining macroblocks by densely evaluating the trained NeRF at 16 Gaussian distributed locations within each voxel () and averaging the resulting diffuse colors, feature vectors, and volume densities.
We quantize all values in the baked SNeRG representation to 8 bits and separately compress the indirection grid and the 3D texture atlas. We compress each slice of the indirection grid as a lossless PNG, and we compress the 3D texture atlas as either a set of lossless PNGs, a set of JPEGs, or as a single video encoded with H264. The quality versus storage tradeoff of this choice is evaluated in Table 3. For synthetic scenes, compressing the texture atlas results in approximately , , and compression rates for PNG, JPEG, and H264, respectively. We specifically choose a macroblock size of voxels to align the 3D texture atlas macroblocks with the blocks used in image compression. This reduces the size of the compressed 3D texture atlas because additional coefficients are not needed to represent discontinuities between macroblocks.
While the compression and quantization procedure described above is crucial for making SNeRG compact and easy to distribute, the quality of images rendered from the baked SNeRG is lower than the quality of images rendered from the corresponding deferred NeRF. Figure 5 visualizes how quantization affects view-dependent effects by biasing renderings towards a darker, diffuse-only color .
Fortunately, we are able to recoup almost all of that lost accuracy by fine-tuning the weights of the deferred per-pixel shading to improve the final rendering quality (Table 2). We optimize the parameters to minimize the squared error between the observed input images used to train the deferred NeRF and the images rendered from our SNeRG. We use the Adam optimizer  with a learning rate of
and optimize for 100 epochs.
6 Implementation Details
Our deferred NeRF model is based on JAXNeRF , an implementation of NeRF in JAX . As in NeRF, we apply a positional encoding  to positions and view directions. We train all networks for 250k iterations with a learning rate which decays log-linearly from to . To improve stability we use JAXNeRF’s “warm up” functionality to reduce the learning rate to for the first 2500 iterations, and we clip gradients by value (at ) and then by norm (also at ). We use a batch size of 8,192 for synthetic scenes and a batch size of 16,384 for real scenes.
As our rendering time is independent of ’s model size, we can afford to use a larger network for our experiments. To this end, we base our method on the JAXNeRF+ model, which was trained with 576 samples per ray (192 coarse, 384 fine) and uses 512 channels per layer in .
|(a) JAXNeRF+ (23.00)||(b) Deferred (22.75)|
|(c) SNeRG (21.45)||(d) Ground Truth|
We validate our design decisions with an extensive set of ablation studies and comparisons to recent techniques for accelerating NeRF. Our experiments primarily focus on free-viewpoint rendering of scenes (scenes captured by inwards-facing cameras on the upper hemisphere). Though acceleration techniques already exist for the special case in which all cameras face the same direction (see Broxton ), scenes represent a challenging and general use-case that has not yet been addressed. In Figure 7, we show an example of a real scene and present more results in our supplement, including the forward-facing scenes from Local Light Field Fusion (LLFF) .
We evaluate all ablations and baseline methods according to three criteria: render-time performance (measured by frames per second as well as GPU memory consumption in gigabytes), storage cost (measured by megabytes required to store the compressed representation), and rendering quality (measured using the PSNR, SSIM , and LPIPS  image quality metrics). It is important to explicitly account for power consumption when evaluating performance — algorithms that are fast on a high performance GPU are not necessarily fast on a laptop. We therefore adopt the convention used by the high performance graphics community of measuring performance relative to power consumption, i.e. FPS per watt, or equivalently, frames per joule .
Please refer to our video for screen captures of our technique being used for real-time rendering on a laptop.
7.1 Ablation Studies
In Table 1, we ablate combinations of three components of our method that primarily affect speed and GPU memory usage. Ablation 1 shows that removing the view-dependence MLP has a minimal effect on runtime performance. Ablation 2 shows that removing the sparsity loss greatly increases (uncompressed) memory usage. Ablation 3 shows that switching from our “deferred” rendering back to NeRF’s approach of querying an MLP at each sample along the ray results in prohibitively large render times.
Table 2 and Figure 6 show the impact on rendering quality of each of our design decisions in building a representation suitable for real-time rendering. Although our simplifications of using a deferred rendering scheme (“Deferred”) and a smaller network architecture (“Tinyview”) for view-dependent appearance do slightly reduce rendering quality, they are crucial for enabling real-time rendering, as discussed above. Note that the initial impact on quality from quantizing and compressing our representation is significant. However, after fine tuning (“FT”), the final rendering quality of our SNeRG model remains competitive with the neural model from which it was derived (“Deferred”).
In Table 3 we explore the impact of various compression schemes on disk storage space requirements. Our sparse voxel grid benefits greatly from applying compression techniques such as JPEG or H264 to its 3D texture atlas, achieving a file size over more compact than a naive 32 bit float array while sacrificing less than 1dB of PSNR. Because our sparsity loss concentrates opaque voxels around surfaces (see Figure 4), ablating it significantly increases model size. Our compressed SNeRG representations are small enough to be quickly loaded in a web page.
The positive impact of training our models using the sparsity loss is visible across these ablations — it more than doubles rendering speed, halves the storage requirements of both the compressed representation on disk and the uncompressed representation in GPU memory, and minimally impacts rendering quality.
|SNeRG (PNG, no )||30.22||0.949||0.050|
|SNeRG (PNG, no FT)||26.68||0.930||0.053|
|SNeRG (PNG, no )||30.22||0.949||0.050||176.0|
7.2 Baseline Comparisons
As shown in Table 4, the quality of our method is comparable to all other methods, while our run-time performance is an order of magnitude faster than the fastest competing approach (Neural Volumes ) and more than a thousand times faster than the slowest (NeRF). Note that we measure the run-time rendering performance of our method on a laptop with an 85W mobile GPU, while all other methods are run on servers or workstations equipped with much more powerful GPUs (over the power draw).
We have presented a technique for rendering Neural Radiance Fields in real-time by precomputing and storing a Sparse Neural Radiance Grid. This SNeRG uses a sparse voxel grid representation to store the precomputed scene geometry, but keeps storage requirements reasonable by maintaining a neural representation for view-dependent appearance. Rendering is accelerated by evaluating the view-dependent shading network only on the visible parts of the scene, achieving over 30 frames per second on a laptop GPU for typical NeRF scenes. We hope this ability to render neural volumetric representations such as NeRF in real time on commodity graphics hardware will help increase the adoption of these neural scene representations in vision and graphics applications.
-  Tomas Akenine-Möller and Björn Johnsson. Performance per what? Journal of Computer Graphics Techniques, 2012.
-  Sai Bi, Zexiang Xu, Pratul P. Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. Neural reflectance fields for appearance acquisition. arXiv cs.CV arXiv:2008.03824, 2020.
-  James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. http://github.com/google/jax.
-  Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew DuVall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation. ACM Transactions on Graphics, 2020.
-  Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. SIGGRAPH, 2001.
Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein.
pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis.CVPR, 2021.
-  Cyril Crassin, Fabrice Neyret, Sylvain Lefebvre, and Elmar Eisemann. GigaVoxels: Ray-guided streaming for efficient and detailed voxel rendering. Symposium on Interactive 3D Graphics and Games, 2009.
-  Abe Davis, Marc Levoy, and Fredo Durand. Unstructured light fields. Computer Graphics Forum, 2012.
-  Paul Debevec, C. J. Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. SIGGRAPH, 1992.
-  Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil Hunt. The triangle processor and normal vector shader: A VLSI system for high performance graphics. SIGGRAPH, 1988.
-  Boyang Deng, Jonathan T. Barron, and Pratul P. Srinivasan. JaxNeRF: an efficient JAX implementation of NeRF, 2020. http://github.com/google-research/google-research/tree/master/jaxnerf.
-  John Flynn, Michael Broxton, Paul Debevec, Matthew DuVall, Graham Fyffe, Ryan Overbeck, Noah Snavely, and Richard Tucker. DeepView: View synthesis with learned gradient descent. CVPR, 2019.
-  Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. CVPR, 2021.
-  Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Cohen. The lumigraph. SIGGRAPH, 1996.
-  Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering. ACM Transactions on Graphics, 2018.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
-  Philippe Lacroute and Marc Levoy. Fast volume rendering using a shear-warp factorization of the viewing transformation. SIGGRAPH, 1994.
-  Samuli Laine and Tero Karras. Efficient sparse voxel octrees. I3D, 2010.
-  Sylvain Lefebvre and Hugues Hoppe. Perfect spatial hashing. ACM Transactions on Graphics, 2006.
-  Marc Levoy. Efficient ray tracing of volume data. ACM Transactions on Graphics, 1980.
-  Marc Levoy and Pat Hanrahan. Light field rendering. SIGGRAPH, 1996.
-  Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. CVPR, 2021.
-  David B. Lindell, Julien N.P. Martel, and Gordon Wetzstein. Autoint: Automatic integration for fast neural rendering. CVPR, 2021.
-  Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. NeurIPS, 2020.
-  Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. SIGGRAPH, 2019.
-  Nelson Max. Optical models for direct volume rendering. IEEE TVCG, 1995.
-  Michael Goesele Michael Waechter, Nils Moehrle. Let there be color! Large-scale texturing of 3D reconstructions. ECCV, 2014.
-  Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima K. Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics, 2019.
-  Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. ECCV, 2020.
-  Matthias Nießner, Michael Zollhofer, Shahram Izadi, and Marc Stamminger. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics, 2013.
-  Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs for dynamic scenes. CVPR, 2021.
-  Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Deformable neural radiance fields. arXiv cs.CV 2011.12948, 2020.
-  Eric Penner and Li Zhang. Soft 3D reconstruction for view synthesis. ACM Transactions on Graphics, 2017.
-  Daniel Rebain, Wei Jiang, Soroosh Yazdani, Ke Li, Kwang Moo Yi, and Andrea Tagliasacchi. DeRF: Decomposed radiance fields. CVPR, 2021.
-  Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. GRAF: Generative radiance fields for 3D-aware image synthesis. NeurIPS, 2020.
-  Steven M. Seitz and Charles R. Dyer. Photorealistic scene reconstruction by voxel coloring. IJCV, 1999.
-  Vincent Sitzmann, Michael Zollhoefer, and Gordon Wetzstein. Scene representation networks: Continuous 3D-structure-aware neural scene representations. NeurIPS, 2019.
-  Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, and Jonathan T. Barron. NeRV: Neural reflectance and visibility fields for relighting and view synthesis. CVPR, 2021.
-  Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron, Richard Tucker, and Noah Snavely. Lighthouse: Predicting lighting volumes for spatially-coherent illumination. CVPR, 2020.
-  Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. CVPR, 2019.
-  Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
-  Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics, 2019.
-  Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P. Srinivasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. IBRNet: Learning multi-view image-based rendering. CVPR, 2021.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004.
-  Lee A Westover. Splatting: A parallel, feed-forward volume rendering algorithm. Technical report, University of North Carolina at Chapel Hill, USA, 1991.
-  Daniel Wood, Daniel Azuma, Wyvern Aldinger, Brian Curless, Tom Duchamp, David Salesin, and Werner Stuetzle. Surface light fields for 3D photography. SIGGRAPH, 2000.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric.CVPR, 2018.
-  Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images. ACM Transactions on Graphics, 2018.
Appendix A WebGL Implementation Details
Our web renderer is implemented in WebGL using the THREE.js library. To conserve memory bandwidth, we load the 3D texture atlas as three separate 8-bit 3D textures: one for alpha, one for RGB and one for features. We load the indirection grid as a low resolution 8-bit 3D texture.
During ray marching, we first query the intersection grid at the current location along the ray. If this value indicates that the macroblock is empty, we use a ray-box intersection test to skip ahead to the next macroblock along the ray.
For non-empty macroblocks, we first query the alpha texture using nearest neighbor interpolation. If alpha is zero, the current voxel within the macroblock contains empty space, and we do not fetch any additional information. If alpha is non-zero, we use trilinear interpolation to fetch the high-resolution alpha, colors and features at that voxel. This reduces the bandwidth requirement from 64 bytes per sample to 1 byte per sample for rays that are traversing empty space inside each occupied macroblock.
We implement the view-dependence MLP as simple nested for-loops in a GLSL shader. We load the network weights as 32-bit floating point textures and hard-code the network biases directly into the shader. Interestingly, we found that reducing precision lower than 32 bits did not improve the rendering performance noticeably. For added efficiency, we only evaluate the view-dependence MLP for pixels that have non-zero accumulated alpha.
Appendix B Performance Measurement
We measure performance using the Chrome browser running on a 2019 MacBook Pro Laptop equipped with an 85 watt AMD Radeon Pro 5500M GPU (8GB of GPU RAM).
For accurate performance measurements, we make sure the laptop is connected to the charger, close all other applications on the laptop, and restart our browser to disable frame-rate limiting from vertical synchronization:
--args --disable-gpu-vsync \ --disable-frame-rate-limit
In our results, we report the average frame time for rendering a 150-frame camera animation orbiting the scene (or rotating within the camera plane for forward-facing scenes). Our test-time renderings use the same image resolutions and camera intrinsics as the input training images:
field-of-view at for Synthetic 360 scenes,
field-of-view at for Real Forward-Facing, and
at field-of-view for Real 360 scenes.
Appendix C Additional Experiment Details
c.1 Experiments with Changing 3D Resolution
Table 5 demonstrates that our method is able to achieve even higher rendering speeds and lower storage costs by baking the 3D grids at a lower resolution, at the expense of a slight decrease in rendering quality.
c.2 Real 360 Scenes
We evaluate our method on the two real 360 scenes provided by the original NeRF paper (Flowers and Pine Cone) and two new scenes that we have captured ourselves (Toy Car and Spheres). All four datasets contain 100-200 images where the camera orbits around an object. Note that the Spheres scene contains glossy objects that are hard to model using diffuse geometry alone.
Tables 6, 7, and 8 demonstrate that our method is able to maintain rendering quality close to the trained NeRF models while rendering about 30 frames per second (Table 9). Table 12 studies the impact of using different image and video compression algorithms for these datasets, and shows that we are able to store these scenes using about 50 MB.
We train all NeRF models on this data by shifting and scaling the camera translations so that they approximately lie on a sphere around the origin, and sampling points linearly in disparity along each camera ray, as done by Mildenhall . After training, we manually set a bounding box to isolate the objects of interest in the scene and ignore the unbounded peripheral content that is not sampled well enough for NeRF to recover. During baking, we only evaluate the subset of the scene which is inside this bounding box. We change our quality measurements to reflect this, masking all of the images (our results, baseline results, and ground truth images) using the alpha mask generated by our method. Otherwise, the results would be significantly biased by the missing background geometry that was outside the scene bounding box. Interestingly, we find that a diffuse-only model without any view-dependent effects is surprisingly competitive for these scenes, potentially due to the low-frequency lighting conditions during capture. Additionally, the diffuse model is able to reasonably fake view-dependent effects in some cases by hiding mirrored versions of reflected content inside the objects’ surfaces.
|(a) JAXNeRF+ (25.50)||(b) Deferred (24.91)|
|(c) SNeRG (24.43)||(d) Ground Truth|
c.3 Real Forward-Facing Scenes
We also evaluate our approach on the real forward-facing scenes in the NeRF paper (Tables 10, 11, and 12). Since these scenes are only captured and viewed from a limited range of forward-facing viewpoints, layered representations such as multi-plane images [12, 28, 33, 40, 48] are a compelling option for real-time rendering. Note that the normalized device coordinate transformation used in NeRF for these forward-facing scenes can be interpreted as transforming NeRF into a continuous version of a multiplane image representation that supports larger viewpoint changes.
We found that our baking procedure sometimes reduces the total alpha mass in the scene, introducing small semi-transparent holes for these datasets. To overcome this, we partially un-premultiply alpha after ray marching. That is, after alpha compositing:
if . This fully saturates alpha values above 0.66, while still allowing for soft edges and a smooth fall-off.
Here we provide additional details for the baseline methods we use in our experiments.
NeRF  We directly use the results reported in the original paper by Mildenhall . Run-time was measured on a single NVIDIA V100 GPU.
JAXNeRF  is a JAX implementation of NeRF, with default settings (64 + 128 samples per ray, MLP width of 256). Run-time was measured on an NVIDIA V100 GPU.
JAXNeRF+ is a more compute-intensive version of JAXNeRF, trained with 192 + 384 samples per ray and an MLP width of 512 channels. Run-time was measured on a single NVIDIA V100 GPU. We use this architecture as a starting point for our modifications (deferred shading and baking), as using more samples per ray allows us to recover a sparser representation that better concentrates opacity near object surfaces.
JAXNeRF+ Tinyview This baseline measures the effects of using a smaller network architecture (same as ) for the view-dependent effects. It uses the same architecture for the view-dependent effects as our “Deferred” model, but evaluates view-dependent effects for every 3D sample instead of once per pixel.
JAXNeRF+ Diffuse This baseline measures the effects of modeling view-dependent appearance. It uses the same architecture as JAXNeRF+, but replaces the view dependence network with a single layer that directly outputs a color without any knowledge of the viewing direction.
AutoInt  We use the N=8 setting reported by Lindell , which achieves their highest ratio of quality to run-time. The authors did not mention what hardware they ran on, but we are assuming that they also run on an NVIDIA V100 GPU since they directly compare to NeRF runtimes.
Neural Volumes  We copy the rendering quality results reported in the NeRF paper and copy the rendering run-time results reported in the AutoInt paper. We assume that the run-times reported in the AutoInt paper are measured on an NVIDIA V100 GPU since the AutoInt paper directly compares these results with NeRF run-times.
NSVF  We use the average run-time of 1.537 seconds per frame reported by the authors, using early stopping. Performance was measured on an NVIDIA V100 GPU.
DeRF  We use the DeRF model with 8 heads and 96 channels per head, which achieves the highest ratio of quality to run-time according to the results in their paper. Run-times were measured on an NVIDIA V100 GPU.
IBRNet  We use the highest quality results (per-scene fine-tuned) in their paper. Run-times were estimated by scaling the NVIDIA V100 GPU NeRF run-times according to the TFLOPs in Table 3 of their paper.
LLFF  Run-times were measured using the original CUDA implementation on a GTX 2080 Ti (250W).
c.5 Experiments with Changing 3D Resolution
Table 5 demonstrates that our method is able to achieve even higher rendering speeds and lower storage costs by baking the 3D grids at a lower resolution, at the expense of a slight decrease in rendering quality.
c.6 Per-Scene Quality and Performance Metrics
Tables 13-15, provide a per-scene breakdown for the quality metrics in the Synthetic scenes. Similar breakdowns for the Real Forward Facing scene can be found in Tables 16-18. Table 19 shows the per-scene frame time and Table 20 shows the per-scene GPU memory consumption our performance ablations: 1) removing the view-dependence MLP, 2) removing the sparsity loss, and 3) switching from ‘deferred” rendering back to querying an MLP at each sample along the ray.
|Synthetic 360||Real Forward-Facing||Real 360|