DeepAI
Log In Sign Up

GARF: Gaussian Activated Radiance Fields for High Fidelity Reconstruction and Pose Estimation

Despite Neural Radiance Fields (NeRF) showing compelling results in photorealistic novel views synthesis of real-world scenes, most existing approaches require accurate prior camera poses. Although approaches for jointly recovering the radiance field and camera pose exist (BARF), they rely on a cumbersome coarse-to-fine auxiliary positional embedding to ensure good performance. We present Gaussian Activated neural Radiance Fields (GARF), a new positional embedding-free neural radiance field architecture - employing Gaussian activations - that outperforms the current state-of-the-art in terms of high fidelity reconstruction and pose estimation.

READ FULL TEXT VIEW PDF

page 8

page 9

page 10

page 11

page 13

page 14

11/29/2022

SparsePose: Sparse-View Camera Pose Regression and Refinement

Camera pose estimation is a key step in standard 3D reconstruction pipel...
11/21/2022

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

Neural Radiance Fields (NeRF) have achieved photorealistic novel views s...
11/22/2021

Neural Fields in Visual Computing and Beyond

Recent advances in machine learning have created increasing interest in ...
03/29/2021

GNeRF: GAN-based Neural Radiance Field without Posed Camera

We introduce GNeRF, a framework to marry Generative Adversarial Networks...
11/25/2022

Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image

A key challenge for novel view synthesis of monocular portrait images is...
08/08/2019

Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras

We present an approach to accurately estimate high fidelity markerless 3...
11/25/2020

Deformable Neural Radiance Fields

We present the first method capable of photorealistically reconstructing...

1 Introduction

Recent work by Lin et al. [12] – Bundle Adjusted neural Radiance Fields (BARF) – revealed that an architecturally-modified Neural Radiance Field (NeRF) [18]

could effectively solve the joint task of scene reconstruction and pose optimization. One crucial insight from this work is that the error backpropagation to the pose parameters in traditional NeRF is hampered by large gradients due to the high-frequency components in the positional embedding. To ameliorate this effect, the authors proposed a coarse-to-fine scheduler to gradually enable the frequency support of the positional embedding layer throughout the joint optimisation. Although achieving impressive results, this workaround requires careful tuning of the frequency scheduling process through a cumbersome

multi-dimensional parameter sweep. In this paper we investigate if this coarse-to-fine strategy can be bypassed through other means; simplifying the approach and potentially opening up new avenues for improvement.

NeRF is probably the most popular application of coordinate multi-layer perceptrons (MLPs). NeRF maps an input 5D coordinate (3D position and 2D viewing direction) to the scene properties (view-dependent emitted radiance and volume density) of the corresponding location. A crucial ingredient of most coordinate MLPs is positional encoding. Traditional MLPs suffer from

spectral-biasi.e., they are biased towards learning low-frequency functions – when used for signal reconstruction. Thus, MLPs, in their rudimentary form, are not ideal for encoding natural signals with fine detail, which entails modeling large fluctuations [23]. To circumvent this issue, NeRF architecturally modifies the MLPs by projecting the low-dimensional coordinate inputs to a higher dimensional space using a positional embedding layer, which allows NeRF to learn high-frequency components of the target function rapidly [35, 18].

Of late, there has been an increasing advocacy towards self-contained coordinate networks. Of particular note in this regard is the work of Sitzmann et al [30]

who advocated that by simply replacing conventional activation functions (e.g. ReLU) with sine – one can remove the need for any type of positional embedding. Although showing promise, such sine-MLPs have been found experimentally to be sensitive to weight initialization 

[30, 24]. While Sitzmann et al. [30] proposed an initialization scheme that aids sine-MLPs to achieve faster convergence when solving for signal reconstruction. Their deployment within NeRF has been limited, with most of the community still opting for positional embedding with conventional activations.

Contributions: In this paper we draw inspiration from recent work [24] that has advocated for a broader class of effective activation functions – beyond sine – that can also circumvent the need for positional encoding. Of particular note in this regard are Gaussian activations. To our knowledge, their use in joint signal recovery and pose estimation has not been previously explored. We illustrate that these activations can preserve the first-order gradients of the target function better than conventional activations enhanced with positional embedding layers. When applied to BARF – that is simultaneously solving for pose and radiance field reconstruction – sine-MLPs are quite susceptible to local minima (even with good initialization), but our proposed Gaussian Activated neural Radiance Fields (GARF) exhibit robust state-of-the-art performance.

In summary, we present the following contributions:

  • We present GARF, a self-contained approach for reconstructing neural radiance field from imperfect camera poses without cumbersome hyper-parameter tuning and model initialisation.

  • We establish theoretical insights of the effect of Gaussian activation in the joint optimisation problem of neural radiance field and camera poses, supported by an extensive empirical results.

We demonstrate that our proposed GARF can successfully recover scene representations from unknown camera poses, even on challenging scenes with low-textured region, paving the way for unlocking NeRF for real-world applications.

2 Related Work

2.1 Neural Scene Representations.

Recent works have demonstrated the potential of multi-layer perceptrons or also known as MLPs as continuous and memory efficient representation for 3D geometry, including shapes [5, 4], objects [16, 1, 20] or scene [31, 8, 30]. Using 3D data such as point clouds as supervision, these approaches typically optimise signed distance functions [20, 8] or binary occupancy fields [16, 4]. To alleviate the dependency of 3D training data, several methods formulate differentiable rendering functions which enables the networks to be optimised using multiview 2D images [31, 19, 18, 36]. Of particular interest is NeRF [18], which models the continuous radiance field of a scene using a coordinate-MLP in a volume rendering framework by minimising the photometric errors. Due to its simplicity and unprecedented high fidelity novel view synthesis, NeRF has attracted wide attention across the vision community [21, 2, 14, 37, 34, 44]. Numerous extensions have been made on many fronts, e.g., faster training and inference [2, 43, 27, 13], deformable fields [21], dynamic scene modeling [11, 40, 3], generalisation [38, 29] and pose estimation [12, 39, 42, 15, 7, 33, 32].

2.2 Positional Embedding for Pose Estimation.

Positional embedding is an integral component of MLPs [35, 25, 46] which enable them to learn high frequency functions in low dimensional domain. One of the earliest roots of this approach can be traced to the work by Rahimi et al. [23], who discovered that random Fourier Features can be used to approximate an arbitrary stationary kernel function. Leveraging such an insight, Mildenhall et al [18, 35] recently demonstrated that encoding input coordinates with sinusoidal allows MLPs to represent higher frequency content, enables a high-fidelity neural scene reconstruction in novel view synthesis.

Despite the ability of positional embedding in enabling MLPs to represent high frequency components, choosing the right frequency scale which often involves a cumbersome parameter tuning is critical, i.e.

, if the bandwidth of the signal is increased excessively, coordinate-MLP tend to produce noisy signal interpolations 

[35, 26, 6].

More recently, there has been an increasing interest in using coordinate-MLPs to tackle the joint problem of neural scene reconstruction and pose optimization [12, 39, 42, 15, 7, 33, 32, 47]. Remarkably, Lin et al. [12] demonstrated that coordinate-MLPs entail an unanticipated drawback in camera registration – i.e., large gradient due to high frequency components in positional encoding function could hamper the error backpropagation to the pose parameters. Based on this observation, they proposed a work-around to anneal each component of the frequency function in a coarse-to-fine manner. By enabling a smoother trajectory for the optimisation problem, they show that such a strategy can lead to better pose estimation, compared to full positional encoding. Unlike BARF, we take different stance – i.e, is there a self-contained architecture which can tackle the pose estimation problem optimally while attaining high fidelity neural scene reconstruction without positional embedding?

2.3 Embedding-free Coordinate-networks.

Sitzmann et al. [30] alternatively proposed sinusoidal activation functions which enable coordinate MLPs to encode high frequency functions without positional embedding layer. Despite its potential, networks that employ sinusoidal activations are hyper-sensitive to the initialisation scheme [30, 24, 26]. Taking a step further, Ramasinghe et al., recently broadened the understanding the effect of different activations in MLPs. They proposed a class of novel non-periodic activations that can enjoy more robust performance against random initialisation than sinuosoids. Our work significantly differs from the above-mentioned works. While we also advocate for a simple and robust embedding-free coordinate network, our work focuses on the joint problem of high fidelity neural scene reconstruction and pose estimation.

3 Method

In this section, we will provide an exposition of our problem formulation and different classes of coordinate networks, characterising the relative merits of each class for joint optimisation of neural scene reconstruction and pose estimation.

3.1 Formulation

We first present the formulation of recovering the 3D neural radiance field from NeRF [18] jointly with camera poses. We denote as the camera pose transformations, and as the network in NeRF, respectively. NeRF encodes the volumetric field of a 3D scene using a coordinate-network as , which maps each input 3D coordinate to its corresponding volume density and directional emitted colour , i.e., , where is the network weights 111 is also conditioned on viewing direction for modeling view-dependent effect, for which we omit here in the derivation for simplicity..

Let be the pixel coordinates, be the imaging function. Given a set of images , we aim to solve for a volumetric radiance field of a 3D scene and the camera poses by minimizing the photometric loss as

(1)

First, we assume the rendering operation of NeRF in the camera coordinate system. Expressing the pixel coordinate in its homogeneous coordinate as , we can define a 3D point along a camera ray sampled at depth as . The estimated RGB colour of at pixel coordinate is then computed by aggregating the predicted and as

(2)

where , and and are the bounds of the depth range of interest; see [10] for more details of volume rendering operation. In practice, the integral is commonly approximated using quadrature [18] which evaluates the network at a discrete set of points through stratified sampling [18] at depth . Therefore, this entails querying of the network , whose output are composited through volume rendering. Denoting the ray compositing function as , we can rewrite as . Given a camera pose , we can transform a 3D point in the camera coordinate system to the world coordinate system through a 3D rigid transformation to obtain the synthesized image as

(3)

We solve the optimization problem (1) using gradient descent. Next, we give a brief exposition of coordinate-networks and compare them.

3.2 Coordinate-networks

Coordinate-networks are a special class of MLPs that are used to encode signals as trainable weights. An MLP with layers can be formulated as

(4)

where , are trainable weights at the layer, is the bias, and is a non linear function. With this definition in hand, we briefly discuss several types of coordinate-networks below.

3.2.1 ReLU-MLPs:

employ the ReLU activation function . Despite being a universal approximator in theory, ReLU-MLPs are biased towards learning low-frequency functions [41, 22], making them sub-optimal candidates for encoding natural signals with high signal fidelity. To circumvent this issue, various methods have been proposed in the literature, which we shall discuss next.

3.2.2 PE-MLPs:

are the most widely adapted class of coordinate-networks and were popularized by the seminal work of [18] through the use of positional embedding (PE). In PE-MLPs, the low-dimensional input coordinates are projected to a higher-dimensional hypersphere via a positional embedding layer , which takes the form

(5)

where is a hyper-parameter that controls the total number of frequency bands. After computing (5), the embedded 3D input points are then passed through a conventional ReLU-MLPs to obtain .

3.2.3 Sine-MLPs:

are a coordinate-network type without a positional embedding; as proposed by  [30]. In sine-MLPs, the activation function is a sinusoid of the form

(6)

where

is a hyperparameter. A larger

increases the bandwidth of the network, allowing it to encode increasingly higher frequency functions.

3.2.4 Gaussian-MLPs:

are a recent class of positional-embedding less coordinate-networks [24], where the activation function is defined as

Here, is a hyperparameter that can be used to tune the bandwidth of the network: a larger corresponds to a lower bandwidth, and vise-versa.

3.3 GARF for Reconstruction and Pose Estimation

In this paper, we advocate the use of Gaussian-MLPs for jointly solving pose estimation and scene reconstruction, and show substantial empirical evidence that they yield better accuracy and easier optimization over the other choices. We speculate the reason for this superior performance as follows. The pose parameters are optimized using the gradients flow through the network. Hence, the ability to accurately represent the first-order derivatives of the encoded signal plays a key role in optimizing pose parameters. However, Sitzmann et al. [30] showed that PE-MLPs are incapable of accurately model first-order derivatives of the target signal, resulting in noisy artifacts. This impacts the Fourier spectrum of the network function, which is implicitly related to the derivatives. It was shown in [26]

that the Fourier transform

of a shallow Gaussian-MLP takes the form

(7)

where is the frequency index, is the Dirac delta distribution which concentrates along the line spanned by , and

are the weight vectors corresponding to the

layer. Note that Eq. (7) is a smooth distribution, which is parameterized by and ’s. In other words, for a suitably chosen , the bandwidth of the network can be increased in a continuous manner by appropriately learning the weights. Furthermore, as is a continuous parameter, it provides MLPs with the ability to smoothly manipulate the spectrum of the Gaussian MLP.

In contrast, [45] demonstrated that spectrum of a PE-MLP tends to consist of discrete spikes, where the spikes are placed on the integer harmonics of the positional embedding frequencies. Approximating the ReLU function via a polynomial in the form , where are constants, they showed that the spectrum is concentrated on the frequency set

Recall that in order to increase the frequency support of the positional embedding layer, one needs to increase . It is evident that increasing even by one adds many harmonic spikes on the spectrum at the high-frequency end, irrespective of the network weights. Therefore, it is not possible to manipulate the spectrum of the PE-MLP continuously under a controlled setting. This can result in unnecessary high-frequency components that lead to unwanted artifacts.

On the other hand, sine-MLPs are able to construct rich spectra and represent first-order derivatives accurately [30]. A drawback, however, is that sine-MLPs are extremely sensitive to initialization. Sitzmann et al. [30] proposed an initialization scheme for sine-MLPs in signal reconstruction, under which they show strong convergence properties. However, we empirically demonstrate that when jointly optimizing for the pose parameters and scene reconstruction, the above initialization yields sub-par performance, making sine-MLPs highly likely to get trapped in local minima. We also show that, in comparison, Gaussian-MLPs exhibit far superior convergence properties, indicating that they entail a simpler loss landscape.

4 Experiments

This section validates and analyses the effectiveness of our proposed GARF with other coordinate networks. We first unfold the analysis on a 2D planar image alignment problem, and demonstrate extensive results on learning NeRF from unknown camera poses.

Figure 1: A 2D planar image alignment instance. Left: Input image patches with . Right: The initial poses are initialised as identity.
Figure 2: Qualitative and quantitative results of the 2D planar image alignment problem. Left: Visualisation of the estimated poses with the error. Center: Reconstruction of each warped patches. Right: Final image reconstruction with the patch PSNR.
Figure 3: Comparison of the first-order derivatives of encoded signal on solving an image reconstruction problem. The first-order derivative of each function is computed using network’s output with respect to the coordinates. Note that only the groundtruth derivative is computed using Sobel Filter.
Figure 4: Comparison of the first-order derivatives of encoded signal . The first-order derivative of each function is computed using network’s output with respect to the coordinates. Note that only the groundtruth derivative is computed using Sobel Filter.
Figure 5: Left: Input images and the image reconstruction for SIREN and GARF, which correspond to the red and green curve, respectively. Right: Robustness of the initialisation at different . When , all the networks are initialised with optimal weights; When , all the networks are initialised with random weights. Note that for SIREN, we also investigate the case when SIREN strictly adheres to the initialisation scheme proposed by Sitzmann et al. [30] (red

). The shaded areas correspond to the two standard deviations over 10 runs.

4.1 2D Planar Image Alignment

To develop intuition, we first consider the case of 2D planar image alignment problem. More specifically, let be the 2D pixel coordinates and , we aim to optimize a neural image representation parameterised as the weights of coordinate network while also solving for warp parameters as

(8)

where denotes the warp function parameterised as . Given patches from the image generated with random homography perturbations, we aim to jointly estimate the unknown homography warp parameters and network weights . We fix the gauge freedom by anchoring the first patch as identity; see Fig. 1 for an example.

4.1.1 Experimental settings.

We compare our proposed GARF with the following networks: PE-MLP with a coarse-to-fine embedding annealer (BARF) [12] and sine-MLP (SIREN) [30]. We use a -layer MLP with 256-dimensional hidden units for all networks. We use the Adam optimizer to optimize both the network weights and the warp parameters . We use a learning rate that begins at for , and for , with both decaying exponentially to for GARF and BARF. For SIREN, we use a learning rate of for both and decaying exponentially to . For BARF, we use frequency bands (5), and linearly anneal the frequency from to over K iterations. Note that we use the same parameters as proposed in [12]. At each optimization step, we randomly sample of the pixel coordinates for each patch.

4.1.2 Initialisation.

For BARF and SIREN, we use the initialisation scheme proposed in the original paper [18, 12, 30], whereas for our proposed GARF we simply use randomly initialised weights. We initialise the warp parameters as identity for all models; see Fig. 1.

4.1.3 Results.

We show the quantitative and qualitative registration results in Fig. 2. As GARF is able to correctly estimate the warp parameters of all patches, GARF can reconstruct the image with high fidelity. On the other hand, BARF and SIREN struggle with the image reconstruction due to misalignment. It is important to note that the Gaussian-MLP initialisation protocol put the proposed method at a disadvantage. This further demonstrates the robustness of Gaussian-MLP towards initialisation.

4.1.4 First-order derivatives analysis.

For completeness, we first inspect the first-order derivations of each coordinate network when solving for an image reconstruction task as ; note that we use the same notations as in Eq.(8). As discussed in Sec. 3.3, the ability to accurately represent the first-order derivatives of the encoded signal plays a crucial role in optimizing pose parameters. Fig. 3 reinforces that the first-order derivative of the encoded signal of PE-MLP has a lot of noise artifacts – results in poor error backpropagation to pose parameters. While properly-initialised SIREN is capable of representing the derivatives of the signal when solving for signal reconstruction, the initialisation strategy of sine-activation is sub-optimal when jointly optimizing for neural image reconstruction and warp. As a result, the resulting function derivative is no longer well-defined; see Fig. 4. In contrast, GARF exhibit far superior convergence properties, albeit the model weights are initialised randomly.

4.1.5 Robustness of initialisation scheme.

Additionally, we run a simple experiment to investigate the sensitivity of SIREN and GARF to initialisation. We denote as the optimal model weights, which is obtained by solving Eq. (8) for a neural image representation by fixing the warp parameters, and as the randomly initialised model weights, i.e.

, weights are initialised using PyTorch default initialisation. Our goal is to solve the joint optimisation problem Eq. (

8) by initialising with different scaled model weights, i.e., by linearly adjusting . As shown in Fig 5, GARF (green curve) is marginally affected by the initialisation, while SIREN (blue curve) fails drastically (starting from =0.3). When SIREN is initialised carefully using the initialisation scheme proposed by Sitzmann et al. [30] (red curve), its performance decreases as gradually increases, i.e.

, as the perturbation to the optimal model weights increases. Note that the variance of performance in the GARF is much smaller compared to SIREN.

4.1.6 Generalisation of coarse-to-fine scheduling

We exhaustively search through the log-space for the optimal coarse-to-fine schedulers for BARF; see supp. material for more details. The optimal coarse-to-fine hyper-parameters for each image are data-dependent, i.e., the hyper-paramaters tuned for one image may not be optimal for another image. In contrast to multi-dimensional coarse-to-fine schedulers, Gaussian activation function involves only one-dimensional search space, i.e., in (7).

4.2 3D NeRF: Real World Scenes

This section investigates the task of jointly learning neural 3D representations with NeRF [18] on real world scenes where the camera poses are unknown. We evaluate all the methods on the standard benchmark LLFF dataset [17], which consists of 8 real world forward-facing scenes captured by hand-held cameras.

4.2.1 Experimental Settings.

We compare our proposed GARF with BARF and reference NeRF (ref-NeRF). As we empirically observe that PE-MLP with scheduler (BARF) achieves better performance compared to PE-MLP [39] in the joint optimisation of neural radiance field and camera poses, we opted not to include the comparisons with PE-MLP here; see  [12] or supp. for comparisons with PE-MLP. We parameterise the camera poses with the Lie algebra and initialise them as identity for GARF and BARF. We assume known intrinsics.

4.3 Implementation Details.

We implement our framework following the settings from [18, 12] with some modifications. For simplicity, we train a single 6-layer MLP with 256 hidden units in each layer and without hierarchical sampling. We resize the images to pixels and randomly sample 2048 pixel rays every iteration, each sampled at coordinates. We use the Adam optimizer [9] and train all models for 200K iterations, with a learning rate that begins at decaying exponentially to , and for the poses decaying to . We use the default coarse-to-fine scheduling for BARF [12]. We use the same network size and sampling strategy for all the methods throughout our evaluation. Note that for BARF and ref-NeRF, we use the implementation from BARF; all the hyperparameters are configured as per proposed in the paper.

4.3.1 Evaluation Details.

We evaluate the performance of each method in terms of pose accuracy for registration and view synthesis quality for the scene reconstruction. Following [12, 39], we evaluate the pose error by aligning optimized poses to groundtruth via Proscustes analysis which computes the similarity transformation Sim(3) between them. Note that as the “groundtruth” camera poses provided in LLFF real-world scenes are the estimations from Colmap [28], the pose accuracy is only an indicator how well the estimations agree with the classical method. We report the mean rotation and translation errors for pose, as well as PSNR, SSIM and LPIPS [18] for view synthesis in Table 1.

Figure 6: Qualitative results on test-views of real world scenes [17]. While BARF and GARF can jointly optimize pose and the scene representation, GARF produces results with higher fidelity. Note that we use 6-layer MLPs for all methods in this experiment.
Scene Pose accuracy View synthesis
Rotation Translation PSNR SSIM LPIPS
() ()
ref- ref- ref-
[12] Ours [12] Ours [12] Ours NeRF [12] Ours NeRF [12] Ours NeRF
flower 0.47 0.46 0.25 0.22 23.58 26.40 23.20 0.67 0.79 0.66 0.27 0.11 0.27
fern 0.16 0.47 0.20 0.25 23.53 24.51 23.10 0.69 0.74 0.71 0.34 0.29 0.29
leaves 1.00 0.13 0.30 0.23 18.15 19.72 14.42 0.48 0.61 0.24 0.40 0.27 0.58
horns 3.50 0.03 1.32 0.21 19.73 22.54 19.93 0.66 0.69 0.59 0.35 0.33 0.45
trex 0.42 0.66 0.36 0.48 22.63 22.86 21.42 0.75 0.80 0.69 0.24 0.19 0.32
orchids 0.71 0.43 0.42 0.41 19.14 19.37 16.54 0.55 0.57 0.46 0.33 0.26 0.37
fortress 0.17 0.03 0.32 0.27 28.48 29.09 25.62 0.80 0.82 0.78 0.16 0.15 0.19
room 0.27 0.42 0.20 0.32 31.43 31.90 31.65 0.93 0.94 0.94 0.11 0.13 0.09
Table 1: Quantitative comparison of GARF (Ours), BARF [12] and ref-NeRF on real-world scenes [17] given unknown camera poses.

4.3.2 Results.

Table 1 quantitatively contrasts the performance of GARF, BARF and ref-NeRF. As evident, Gaussian activations enable GARF to recover camera poses which matches the camera poses from off-the-shelf SfM methods. Moreover, even with shallower network, Gaussian activations can successfully recover the 3D scene representation with higher fidelity in the absence of positional embedding, compared to BARF and ref-NeRF; see the qualitative results in Fig. 6.

4.4 Real-World Demo

To showcase the practicability of GARF, we take one step further to test it on images of low-textured scene captured using an iPhone. Fig. 7

remarkably demonstrate the potential of GARF on a scene with a lot of low-textured region while ref-NeRF exhibits artifacts on the novel view due to existence of outliers in front-end of SfM pipeline, which results in unreliable camera pose estimations; see supp. for more results.

Figure 7: Novel view synthesis result on a low-textured scene captured using iPhone. Left banner: Training images. Top row: Rendered image and depth using ref-NeRF. Bottom row: Rendered image and depth using GARF.

5 Conclusions

We present Gaussian Activated neural Radiance Fields (GARF), a new positional embedding-free neural radiance field architecture that can reconstruct high fidelity neural radiance fields from imperfect camera poses without cumbersome hyper-parameter and model initialisation. By establishing theoretical intuition, we demonstrate that the ability of the model to preserve the first-order gradients of the target function plays an imperative role in the joint problem of optimizing for pose and radiance field reconstruction. Experimental results reinforced our theoretical intuition and demonstrated the superiority of GARF, even on challenging scenes with low textured region.

References

  • [1] R. Chabra, J. E. Lenssen, E. Ilg, T. Schmidt, J. Straub, S. Lovegrove, and R. Newcombe (2020) Deep local shapes: learning local sdf priors for detailed 3d reconstruction. In

    European Conference on Computer Vision

    ,
    pp. 608–625. Cited by: §2.1.
  • [2] K. Deng, A. Liu, J. Zhu, and D. Ramanan (2021) Depth-supervised nerf: fewer views and faster training for free. arXiv preprint arXiv:2107.02791. Cited by: §2.1.
  • [3] C. Gao, A. Saraf, J. Kopf, and J. Huang (2021) Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5712–5721. Cited by: §2.1.
  • [4] K. Genova, F. Cole, A. Sud, A. Sarna, and T. Funkhouser (2020) Local deep implicit functions for 3d shape. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4857–4866. Cited by: §2.1.
  • [5] K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser (2019) Learning shape templates with structured implicit functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7154–7164. Cited by: §2.1.
  • [6] A. Hertz, O. Perel, R. Giryes, O. Sorkine-Hornung, and D. Cohen-Or (2021) SAPE: spatially-adaptive progressive encoding for neural optimization. Advances in Neural Information Processing Systems 34. Cited by: §2.2.
  • [7] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park (2021) Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5846–5854. Cited by: §2.1, §2.2.
  • [8] C. Jiang, A. Sud, A. Makadia, J. Huang, M. Nießner, T. Funkhouser, et al. (2020) Local implicit grid representations for 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6001–6010. Cited by: §2.1.
  • [9] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • [10] M. Levoy (1990) Efficient ray tracing of volume data. ACM Transactions on Graphics (TOG) 9 (3), pp. 245–261. Cited by: §3.1.
  • [11] Z. Li, S. Niklaus, N. Snavely, and O. Wang (2021) Neural scene flow fields for space-time view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6498–6508. Cited by: §2.1.
  • [12] C. Lin, W. Ma, A. Torralba, and S. Lucey (2021) Barf: bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5741–5751. Cited by: Gaussian Activated Neural Radiance Fields for High Fidelity Reconstruction & Pose Estimation, §1, §2.1, §2.2, §4.1.1, §4.1.2, §4.2.1, §4.3.1, §4.3, Table 1.
  • [13] D. B. Lindell, J. N. Martel, and G. Wetzstein (2021) Autoint: automatic integration for fast neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14556–14565. Cited by: §2.1.
  • [14] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) Nerf in the wild: neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7210–7219. Cited by: §2.1.
  • [15] Q. Meng, A. Chen, H. Luo, M. Wu, H. Su, L. Xu, X. He, and J. Yu (2021) Gnerf: gan-based neural radiance field without posed camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361. Cited by: §2.1, §2.2.
  • [16] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §2.1.
  • [17] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar (2019) Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: Figure 6, §4.2, Table 1.
  • [18] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §1, §1, §2.1, §2.2, §3.1, §3.1, §3.2.2, §4.1.2, §4.2, §4.3.1, §4.3.
  • [19] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: §2.1.
  • [20] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §2.1.
  • [21] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla (2021) Nerfies: deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874. Cited by: §2.1.
  • [22] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019)

    On the spectral bias of neural networks

    .
    In International Conference on Machine Learning, pp. 5301–5310. Cited by: §3.2.1.
  • [23] A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. Advances in neural information processing systems 20. Cited by: §1, §2.2.
  • [24] S. Ramasinghe and S. Lucey (2021) Beyond periodicity: towards a unifying framework for activations in coordinate-mlps. arXiv preprint arXiv:2111.15135. Cited by: §1, §1, §2.3, §3.2.4.
  • [25] S. Ramasinghe and S. Lucey (2021) Learning positional embeddings for coordinate-mlps. arXiv preprint arXiv:2112.11577. Cited by: §2.2.
  • [26] S. Ramasinghe, L. MacDonald, and S. Lucey (2022) On regularizing coordinate-mlps. arXiv preprint arXiv:2202.00790. Cited by: §2.2, §2.3, §3.3.
  • [27] C. Reiser, S. Peng, Y. Liao, and A. Geiger (2021) Kilonerf: speeding up neural radiance fields with thousands of tiny mlps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14335–14345. Cited by: §2.1.
  • [28] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4104–4113. Cited by: §4.3.1.
  • [29] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger (2020) Graf: generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems 33, pp. 20154–20166. Cited by: §2.1.
  • [30] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein (2020) Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems 33, pp. 7462–7473. Cited by: §1, §2.1, §2.3, §3.2.3, §3.3, §3.3, Figure 5, §4.1.1, §4.1.2, §4.1.5.
  • [31] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems 32. Cited by: §2.1.
  • [32] S. Su, F. Yu, M. Zollhoefer, and H. Rhodin (2021) A-nerf: surface-free human 3d pose refinement via neural rendering. arXiv preprint arXiv:2102.06199. Cited by: §2.1, §2.2.
  • [33] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison (2021) Imap: implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6229–6238. Cited by: §2.1, §2.2.
  • [34] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar (2022) Block-nerf: scalable large scene neural view synthesis. arXiv preprint arXiv:2202.05263. Cited by: §2.1.
  • [35] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, pp. 7537–7547. Cited by: §1, §2.2, §2.2.
  • [36] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk, Y. Wang, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi, et al. (2021) Advances in neural rendering. arXiv preprint arXiv:2111.05849. Cited by: §2.1.
  • [37] H. Turki, D. Ramanan, and M. Satyanarayanan (2021) Mega-nerf: scalable construction of large-scale nerfs for virtual fly-throughs. arXiv preprint arXiv:2112.10703. Cited by: §2.1.
  • [38] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser (2021) Ibrnet: learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §2.1.
  • [39] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu (2021) NeRF–: neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064. Cited by: §2.1, §2.2, §4.2.1, §4.3.1.
  • [40] W. Xian, J. Huang, J. Kopf, and C. Kim (2021) Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9421–9431. Cited by: §2.1.
  • [41] Z. J. Xu, Y. Zhang, and Y. Xiao (2019)

    Training behavior of deep neural network in frequency domain

    .
    In International Conference on Neural Information Processing, pp. 264–274. Cited by: §3.2.1.
  • [42] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T. Lin (2021) Inerf: inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330. Cited by: §2.1, §2.2.
  • [43] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa (2021) Plenoctrees for real-time rendering of neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5752–5761. Cited by: §2.1.
  • [44] A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2021) Pixelnerf: neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4578–4587. Cited by: §2.1.
  • [45] G. Yüce, G. Ortiz-Jiménez, B. Besbinar, and P. Frossard (2021) A structured dictionary perspective on implicit neural representations. arXiv preprint arXiv:2112.01917. Cited by: §3.3.
  • [46] J. Zheng, S. Ramasinghe, and S. Lucey (2021) Rethinking positional encoding. arXiv preprint arXiv:2107.02561. Cited by: §2.2.
  • [47] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys (2021) NICE-slam: neural implicit scalable encoding for slam. arXiv preprint arXiv:2112.12130. Cited by: §2.2.