The ability to render 3D scenes from arbitrary viewpoints can be seen as a big step in the evolution of digital multimedia, and has applications such as mixed reality media, graphic effects, design, and simulations. Often such renderings are based on a number of high resolution images of some original scene, and it is clear that to enable many applications, the data will need to be stored and transmitted efficiently over low-bandwidth channels (e.g., to a mobile phone for augmented reality).
Traditionally, the need to compress this data is viewed as a separate need from rendering. For example, light field images (LFI) consist of a set of images taken from multiple viewpoints. To compress the original views, often standard video compression methods such as HEVC (Sullivan et al., 2012) are repurposed (Jiang et al., 2017; Barina et al., 2019). Since the range of views is narrow, light field images can be effectively reconstructed by “blending” a smaller set of representative views (Astola & Tabus, 2018; Jiang et al., 2017; Zhao et al., 2018; Bakir et al., 2018; Jia et al., 2019). Blending based approaches, however, may not be suitable for the more general case of arbitrary-viewpoint 3D scenes, where a very diverse set of original views may increase the severity of occlusions, and thus would require storage of a prohibitively large number of views to be effective.
A promising avenue for representing more complete 3D scenes is through neural representation functions, which have shown a remarkable improvement in rendering quality (Mildenhall et al., 2020; Sitzmann et al., 2019; Liu et al., 2020; Schwarz et al., 2020). In such approaches, views from a scene are rendered by evaluating the representation function at sampled spatial coordinates and then applying a differentiable rendering process. Such methods are often referred to as implicit representations, since they do not explicitly specify the surface locations and properties within the scene, which would be required to apply some conventional rendering techniques like rasterization (Akenine-Möller et al., 2019). However, finding the representation function for a given scene requires training a neural network. This makes this class of methods difficult to use as a rendering method in the existing framework, since it is computationally infeasible on a low-powered end device like a mobile phone, which are often on the receiving side. Due to the data processing inequality, it may also be inefficient to compress the original views (the training data) rather than the trained representation itself, because the training process may discard some information that is ultimately not necessary for rendering (such as redundancy in the original views, noise, etc.).
In this work, we propose to apply neural representation functions to the scene compression problem by compressing the representation function itself. We use the NeRF model (Mildenhall et al., 2020), a method which has demonstrated the ability to produce high-quality renders of novel views, as our representation function. To reduce redundancy of information in the model, we build upon the model compression approach of Oktay et al. (2020), applying an entropy penalty to the set of discrete reparameterized neural network weights. The compressed NeRF (cNeRF) describes a radiance field, which is used in conjunction with a differentiable neural renderer to obtain novel views (see Fig. 1). To verify the proposed method, we construct a strong baseline method based on the approaches seen in the field of light field image compression. cNeRF consistently outperforms the baseline method, producing simultaneously superior renders and lower bitrates. We further show that cNeRF can be improved in the low bitrate regime when compressing multiple scenes at once. To achieve this, we introduce a novel parameterization which shares parameters across models and optimize jointly across scenes.
We define a multi-view image dataset as a set of tuples , where is the camera pose and is the corresponding image from this pose. We refer to the 3D ground truth that the views capture as the scene. In what follows, we first provide a brief review of the neural rendering and the model compression approaches that we build upon while introducing the necessary notation.
Neural Radiance Fields (NeRF) The neural rendering approach of Mildenhall et al. (2020) uses a neural network to model a radiance field. The radiance field itself is a learned function , mapping a 3D spatial coordinate and a 2D viewing direction to a RGB value and a corresponding density element. To render a view, the RGB values are sampled along the relevant rays and accumulated according to their density elements. The learned radiance field mapping
is parameterized with two multilayer perceptrons (MLPs), whichMildenhall et al. (2020) refer to as the “coarse” and “fine” networks, with parameters and respectively. The input locations to the coarse network are obtained by sampling regularly along the rays, whereas the input locations to the fine network are sampled conditioned on the radiance field of the coarse network. The networks are trained by minimizing the distance from their renderings to the ground truth image:
Where is the Euclidean norm and the are the rendered views. Note that the rendered view from the fine network relies on both the camera pose and the coarse network to determine the spatial locations to query the radiance field. We drop the explicit dependence of on in the rest of the paper to avoid cluttering the notation. During training, we render only a minibatch of pixels rather than the full image. We give a more detailed description of the NeRF model and the rendering process in Appendix Sec. A.
Model Compression through Entropy Penalized Reparameterization The model compression work of Oktay et al. (2020) reparameterizes the model weights into a latent space as . The latent weights are decoded by a learned function , i.e. . The latent weights are modeled as samples from a learned prior , such that they can be entropy coded according to this prior. To minimize the rate, i.e. length of the bit string resulting from entropy coding these latent weights, a differentiable approximation of the self-information of the latent weights is penalized. The continuous
are quantized before being applied in the model, with the straight-through estimator(Bengio et al., 2013)
used to obtain surrogate gradients of the loss function. FollowingBallé et al. (2017), uniform noise is added when learning the continuous prior where . This uniform noise is a stand-in for the quantization, and results in a good approximation for the self-information through the negative log-likelihood of the noised continuous latent weights. After training, the quantized weights are obtained by rounding,
, and transmitted along with discrete probability tables obtained by integrating the density over the quantization intervals. The continuous weightsand any parameters in itself can then be discarded.
To achieve a compressed representation of a scene, we propose to compress the neural scene representation function itself. In this paper we use the NeRF model as our representation function. To compress the NeRF model, we build upon the model compression approach ofOktay et al. (2020) and jointly train for rendering as well as compression in an end-to-end trainable manner. We subsequently refer to this approach as cNeRF. The full objective that we seek to minimize is:
where denotes the parameters of as well any parameters in the prior distribution , and we have explicitly split into the coarse and fine components such that }. is a trade-off parameter that balances between rate and distortion. A rate–distortion (RD) plot can be traced by varying to explore the performance of the compressed model at different bitrates.
Compressing a single scene When training cNeRF to render a single scene, we have to choose how to parameterize and structure and the prior distribution over the network weights. Since the networks are MLPs, the model parameters for a layer consist of the kernel weights and biases . We compress only the kernel weights , leaving the bias uncompressed since it is much smaller in size. The quantized kernel weights are mapped to the model weights by , i.e. . is constructed as an affine scalar transformation, which is applied elementwise to :
We take the prior to be factored over the layers, such that we learn a prior per linear kernel . Within each kernel, we take the weights in to be i.i.d. from the univariate distribution , parameterized by a small MLP, as per the approach of Ballé et al. (2017). Note that the parameters of this MLP can be discarded after training (once the probability mass functions have been built).
Compressing multiple scenes While the original NeRF model is trained for a single scene, we hypothesize that better rate–distortion performance can be achieved for multiple scenes, especially if they share information, by training a joint model. For a dataset of scenes, we parameterize the kernel weights of model , layer as:
Compared to Eqn. 3
, we have added a shift, parameterized as a scalar linear transformation of a discrete shift, that is shared across all models . has the same dimensions as the kernel , and as with the discrete latent kernels,
is coded by a learned probability distribution. The objective for the multi-scene model becomes:
where is the set of all discrete shift parameters, and the losses, latent weights and affine transforms are indexed by scene and model . Note that this parameterization has more parameters than the total of the single scene models, which at first appears counter-intuitive, since we wish to reduce the overall model size. It is constructed as such so that the multi-scene parameterization contains the single scene parameterizations - they can be recovered by setting the shared shifts to zero. If the shifts are set to zero then their associated probability distributions can collapse to place all their mass at zero. So we expect that if there is little benefit to using the shared shifts then they can be effectively ignored, but if there is a benefit to using them then they can be utilized. As such, we can interpret this parameterization as inducing a soft form of parameter sharing.
Datasets To demonstrate the effectiveness of our method, we evaluate on two sets of scenes used by Mildenhall et al. (2020):
Synthetic. Consisting of pixel views taken from either the upper hemisphere or entire sphere around an object rendered using the Blender software package. There are 100 views taken to be in the train set and 200 in the test set.
Real. Consisting of a set of forward facing pixel photos of a complex scene. The number of images varies per scene, with of the images taken as the test images.
Since we are interested in the ability of the receiver to render novel views, all distortion results (for any choice of perceptual metric) presented are given on the test sets.
Architecture and Optimization We maintain the same architecture for the NeRF model as Mildenhall et al. (2020)
, consisting of 13 linear layers and ReLU activations. For cNeRF we use Adam(Kingma & Ba, 2015) to optimize the latent weights and the weights contained in the decoding functions . For these parameters we use initial learning rate of and a learning rate decay over the course of learning, as per Mildenhall et al. (2020). For the parameters of the learned probability distributions , we find it beneficial to use a lower learning rate of , such that the distributions do not collapse prematurely. We initialize the latent linear kernels using the scheme of Glorot & Bengio (2010), the decoders near the identity.
Baseline We follow the general methodology exhibited in light field compression and take the compressed representation of the scene to be a compressed subset of the views. The receiver then decodes these views, and renders novel views conditioned on the reconstructed subset. We use the video codec HEVC to compress the subset of views, as is done by Jiang et al. (2017). To render novel views conditioned on the reconstructed set of views, we choose the Local Light Field Fusion (LLFF) approach of Mildenhall et al. (2019). LLFF is a state-of-the-art learned approach in which a novel view is rendered by promoting nearby views to multiplane images, which are then blended. We refer to the full baseline subsequently as HEVC + LLFF.
|NeRF||cNeRF ()||HEVC + LLFF (QP=30)|
|Scene||PSNR||PSNR||Size (KB)||Reduction||PSNR||Size (KB)|
Single scene compression To explore the frontier of achievable rate–distortion points for cNeRF, we evaluate at a range of entropy weights
for four scenes – two synthetic (Lego and Ficus) and two real (Fern and Room). To explore the rate–distortion frontier for the HEVC + LLFF baseline we evaluate at a range of QP values for HEVC. We give a more thorough description of the exact specifications of the HEVC + LLFF baseline and the ablations we perform to select the hyperparameter values in Appendix Sec.B. We show the results in Fig. 4. We also plot the performance of the uncompressed NeRF model – demonstrating that by using entropy penalization the model size can be reduced substantially with a relatively small increase in distortion. For these scenes we plot renderings at varying levels of compression in Fig. 2 and Fig. 8. The visual quality of the renderings does not noticeably degrade when compressing the NeRF model down to bitrates of roughly 5-6 bits per parameter (the precise bitrate depends on the scene). At roughly 1 bit per parameter, the visual quality has degraded significantly, although the renderings are still sensible and easily recognisable. We find this to be a surprising positive result, given that assigning a single bit per parameter is extremely restrictive for such a complex regression task as rendering. Indeed, to our knowledge no binary neural networks have been demonstrated to be effective on such tasks.
Although the decoding functions (Eqn. 3) are just relatively simple scalar affine transformations, we do not find any benefit to using more complex decoding functions. With the parameterization given, most of the total description length of the model is in the coded latent weights, not the parameters of the decoders or entropy models. We give a full breakdown in Tab. 5.
shows that cNeRF clearly outperforms the HEVC + LLFF baseline, always achieving lower distortions at a (roughly) equivalent bitrate. Reconstruction quality is reported as peak signal-to-noise ratios (PSNR). The results are consistent with earlier demonstrations that NeRF produces much better renderings than the LLFF model(Mildenhall et al., 2020). However, it is still interesting to see that this difference persists even at much lower bitrates. To evaluate on the remaining scenes, we select a single value for cNeRF and QP value for HEVC + LLFF. We pick the values to demonstrate a reasonable trade-off between rate and distortion. The results are shown in Tab. 1. For every scene the evaluated approaches verify that cNeRF achieves a lower distortion at a lower bitrate. We can see also that cNeRF is consistently able to reduce the model size significantly without seriously impacting the distortion. Further, we evaluate the performance of cNeRF and HEVC + LLFF for other perceptual quality metrics in Tab. 3 and 4. Although cNeRF is trained to minimize the squared error between renderings and the true images (and therefore maximize PSNR), cNeRF also outperforms HEVC + LLFF in both MS-SSIM (Wang et al., 2003) and LPIPS (Zhang et al., 2018). This is significant, since the results of Mildenhall et al. (2020) indicated that for SSIM and LPIPS, the LLFF model had a similar performance to NeRF when applied to the real scenes. We display a comparison of renderings from cNeRF and HEVC + LLFF in Fig. 3.
Multi-scene compression For the multi-scene case we compress one pair of synthetic scenes and one pair of real scenes. We train the multi-scene cNeRF using a single shared shift per linear kernel, as per Eqn. 4. To compare the results to the single scene models, we take the two corresponding single scene cNeRFs, sum the sizes and average the distortions. We plot the resulting rate–distortion frontiers in Fig. 5. The results demonstrate that the multi-scene cNeRF improves upon the single scene cNeRFs at low bitrates, achieving higher PSNR values with a smaller model. This meets our expectation, since the multi-scene cNeRF can share parameters via the shifts (Eqn. 4) and so decrease the code length of the scene-specific parameters. At higher bitrates we see no benefit to using the multi-scene parameterization, and in fact see slightly worse performance. This indicates that in the unconstrained rate setting, there is no benefit to using the shared shifts, and that they may slightly harm optimization.
5 Related work
Scene Compression A 3D scene is typically represented as a set of images, one for each view. For a large number of views, compressing each image individually using a conventional compression method can require a large amount of space. As a result, there is a body of compression research which aims to exploit the underlying scene structure of the 3D scene to reduce space requirements. A lot of research has been focused on compressing light field image (LFI) data (Astola & Tabus, 2018; Jiang et al., 2017; Bakir et al., 2018; Jia et al., 2019; Zhao et al., 2018). LFI data generally consists of multiple views with small angular distances separating them. This set of views can be used to reconstruct a signal on the 4D domain of rays of the light field itself, thus permitting post-processing tasks such as novel view synthesis and refocusing. A majority of works select a representative subset of views to transmit from the scene. These are compressed and transmitted, typically using a video codec, with the receiver decoding these images and then rendering any novel view for an unobserved (during training) camera pose. Reconstruction for novel camera poses can be performed using traditional methods, such as optical flow (Jiang et al., 2017)
, or by using recent learned methods that employ convolutional neural networks(Zhao et al., 2018)2019). A contrasting approach to multi-view image compression is proposed by Liu et al. (2019), in which a pair of images from two viewpoints is compressed by conditioning the coder of the second image on the coder of the first image. It is important to emphasise that we are not studying this kind of approach in this work, since we wish the receiver to have the ability to render novel views.
Neural Rendering is an emerging research area which combines learned components with rendering knowledge from computer graphics. Recent work has shown that neural rendering techniques can generate high quality novel views of a wide range of scenes (Mildenhall et al., 2020; Sitzmann et al., 2019; Liu et al., 2020; Schwarz et al., 2020). In this work we build upon the method of Mildenhall et al. (2020), coined as a Neural Radiance Field (NeRF), for single scene compression and then extend it with a novel reparameterization for jointly compressing multiple scenes. Training neural representation networks jointly across different scenes (without compression) has been explored by Sitzmann et al. (2019) and Liu et al. (2020), who use a hypernetwork (Ha et al., 2017) to map a latent vector associated with each scene to the parameters of the representation network. Liu et al. (2020) note that the hypernetwork approach results in significant degradation of performance when applied to the NeRF model (a loss of more than 4 dB PSNR). In contrast, our approach of shared reparameterization is significantly different from these methods.
Model Compression There is a body of research for reducing the space requirements of deep neural networks. Pruning tries to find a sparse set of weights by successively removing a subset of weights according to some criterion (Han et al., 2016; Li et al., 2017). Quantization reduces the precision used to describe the weights themselves (Courbariaux et al., 2016; Li et al., 2016). In this work we focus instead on weight coding approaches (Havasi et al., 2019; Oktay et al., 2020) that code the model parameters to yield a compressed representation.
6 Discussion and Conclusion
Our results demonstrate that cNeRF produces far better results as a compressed representation than a state-of-the-art baseline, HEVC+LLFF, which follows the paradigm of compressing the original views. In contrast, our method compresses a representation of the radiance field itself. This is important for two reasons:
Practically, compressing the views themselves bars the receiver from using more complex and better-performing rendering methods such as NeRF, because doing this would require training to be performed at the receiving side after decompression, which is computationally infeasible in many applications.
Determining the radiance field and compressing it on the sending side may have coding and/or representational benefits, because of the data processing inequality: the cNeRF parameters are a function of the original views, and as such must contain equal to or less information than the original views (the training data). The method is thus relieved of the need to encode information in the original views that is not useful for the rendering task.
It is difficult to gather direct evidence for the latter point, as the actual entropy of both representations is difficult to measure (we can only upper bound it by the compressed size). However, the substantial performance improvement of our method compared to HEVC+LLFF suggests that the radiance field is a more economical representation for the scene.
The encoding time for cNeRF is long, given that a new scene must be trained from scratch. Importantly though, the decoding time is much less, as it is only required to render the views using the decompressed NeRF model. cNeRF enables neural scene rendering methods such as NeRF to be used for scene compression, as it shifts the complexity requirements from the receiver to the sender. In many applications, it is more acceptable to incur high encoding times than high decoding times, as one compressed data point may be decompressed many times, allowing amortization of the encoding time, and since power-constrained devices are often at the receiving side. Thus, our method represents a big step towards enabling neural scene rendering in practical applications.
- Akenine-Möller et al. (2019) T. Akenine-Möller, E. Haines, and N. Hoffman. Real-time rendering. Crc Press, 2019.
- Astola & Tabus (2018) P. Astola and I. Tabus. Wasp: Hierarchical warping, merging, and sparse prediction for light field image compression. In 2018 7th European Workshop on Visual Information Processing (EUVIP), pp. 1–6, 2018.
- Bakir et al. (2018) N. Bakir, W. Hamidouche, O. Déforges, K. Samrouth, and M. Khalil. Light field image compression based on convolutional neural networks and linear approximation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 1128–1132, 2018.
- Ballé et al. (2017) J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compression. In 5th Int. Conf. on Learning Representations (ICLR), 2017.
- Barina et al. (2019) D. Barina, T. Chlubna, M. Solony, D. Dlabaja, and P. Zemcik. Evaluation of 4d light field compression methods. In arXiv:1905.07432, 2019.
- Bengio et al. (2013) Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. In arXiv:1308.3432, 2013.
- Courbariaux et al. (2016) M. Courbariaux, Y. Bengio, and J.-P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In arXiv:1511.00363, 2016.
Glorot & Bengio (2010)
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
In Yee Whye Teh and Mike Titterington (eds.),
Proceedings of the thirteenth international conference on artificial intelligence and statistics, volume 9, pp. 249–256, 2010.
- Ha et al. (2017) D. Ha, A. Dai, and Q. Le. Hypernetworks. In International Conference on Learning Representations, 2017.
- Han et al. (2016) S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations, ICLR 2016, 2016.
- Havasi et al. (2019) M. Havasi, R. Peharz, and J. M. Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations, 2019.
- Jia et al. (2019) C. Jia, X. Zhang, S. Wang, S. Wang, and S. Ma. Light field image compression using generative adversarial network-based view synthesis. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(1):177–189, 2019.
- Jiang et al. (2017) X. Jiang, M. Le Pendu, and C. Guillemot. Light field compression using depth image based view synthesis. In 2017 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 19–24, 2017.
- Kingma & Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- Li et al. (2016) F. Li, B. Zhang, and B. Liu. Ternary weight networks. In arXiv:1605.04711, 2016.
- Li et al. (2017) H. Li, A. Kadav, I. Durdanovic, H. Samet, and H.P. Graf. Pruning filters for efficient convnets. In arXiv:1608.08710, 2017.
Liu et al. (2019)
J. Liu, S. Wang, and R. Urtasun.
Dsic: Deep stereo image compression.
2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3136–3145, 2019.
- Liu et al. (2020) L. Liu, J. Gu, K. Z. Lin, T.-S. Chua, and C. Theobalt. Neural sparse voxel fields. In arXiv:2007.11571, 2020.
- Mildenhall et al. (2019) B. Mildenhall, P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.
- Mildenhall et al. (2020) B. Mildenhall, P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Oktay et al. (2020) D. Oktay, J. Ballé, S. Singh, and A. Shrivastava. Scalable model compression by entropy penalized reparameterization. In International Conference on Learning Representations, 2020.
- Schwarz et al. (2020) K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. In arXiv:2007.02442, 2020.
- Sitzmann et al. (2019) V. Sitzmann, M. Zollhöfer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, 2019.
Sitzmann et al. (2020)
V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein.
Implicit neural representations with periodic activation functions.In arXiv, 2020.
- Sullivan et al. (2012) G. J. Sullivan, J. Ohm, W. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012.
- Tancik et al. (2020) M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In arXiv:2006.10739, 2020.
- Wang et al. (2003) Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, volume 2, pp. 1398–1402 Vol.2, 2003.
- Zhang et al. (2018) R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In arXiv:1801.03924, 2018.
Zhao et al. (2018)
Z. Zhao, S. Wang, C. Jia, X. Zhang, S. Ma, and J. Yang.
Light field image compression based on deep learning.In 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, 2018.
Appendix A Neural Radiance Fields
The neural rendering approach of Mildenhall et al. (2020) uses a neural network to model a radiance field. The radiance field itself is a learned mapping , where the input is a 3D spatial coordinate and a 2D viewing direction
. The NeRF model also makes use of a positional encoding into the frequency domain, applied elementwise to spatial and directional inputs
This type of encoding has been shown to be important for implicit models, which take as input low dimensional data which contains high frequency information (Tancik et al., 2020; Sitzmann et al., 2020).
The network output is an RGB value and a density element . To render a particular view, the RGB values are sampled along the relevant rays and accumulated according to their density elements. In particular, the color of a ray , in direction from the camera origin , is computed as
where is the output of the mapping evaluated at , where , is the distance of sample from the origin along the ray, and is the distance between samples. The color can be interpreted as the expected color of the point along the ray in the scene closest to the camera, if the points in the scene are distributed along the ray according to an inhomogeneous Poisson process. Since in a Poisson process with density , the probability that there are no points in an interval of length is . Thus is the probability that there are no points between and , and is the probability that there is a point between and . The rendered view comprises pixels whose colors are evaluated at rays emanating from the same camera origin but having slightly different directions , depending on the camera pose .
Appendix B HEVC + LLFF specification and ablations
There are many hyperparameters to select for the HEVC + LLFF baseline. The first we consider is the number of images to compress with HEVC. If too many images are compressed with HEVC then at some point the performance of LLFF will saturate and an unnecessary amount of space will be used. On the other hand, if too few images are compressed with HEVC, then LLFF will find it difficult to blend these (de)compressed images to form high quality renders. To illustrate this effect, we run an ablation on the Fern scene where we vary the number of images we compress with HEVC, rendering a held out set of images conditioned on the reconstructions. The results are displayed in Fig. 6. We can clearly see the saturation point at around 10 images, beyond which there is no benefit to compressing extra images. Thus when picking the number of images to compress for new scenes, we do not use more than 4 per test image (which corresponds to compressing 12 images in our ablation).
The second effect we study is the order in which images are compressed with HEVC, which affects the performance as HEVC is a video codec and thus sensitive to image ordering. It stands to reason that the more the sequence of images resemble a natural video, the better coding will be. As such, we consider two orderings: firstly the “snake scan” ordering, in which images are ordered vertically by their camera pose, going alternately left to right then right to left. The second is the “lozenge” ordering (Jiang et al., 2017), in which images are ordered by the camera pose in a spiral outwards from their centre. Both orderings appear sensible since they always step from a given camera pose to an adjacent pose. We compare results compressing and reconstructing a set of images using HEVC across a range of Quantization Parameter (QP) values for the Fern scene in Tab. 2. The difference between the two orderings is very small. Since snake scan is simpler to implement, we use this in all our experiments.
The effect of changing QP is demonstrated in Fig. 7, and we select QP=30 for the experiments in which we choose one rate–distortion point to evaluate, since it achieves almost the same performance as QP=20 and QP=10 with considerably less space.
Appendix C Extra results
Here we present some further results from our experiments, including results on different perception metrics, a breakdown of the cNeRF model size and extra comparisons of renderings.
|cNeRF ()||HEVC + LLFF (QP=30)|
|cNeRF ()||HEVC + LLFF (QP=30)|
|Entropy weight||Rate (KB)||Overhead (KB)|