1 Introduction
Neural radiance fields (NeRFs; Mildenhall et al., 2020)
are remarkably good at estimating the 3D geometry of an object from 2D images of that object. A neural network (typically a modestsize multilayer perceptron) maps from 5D positiondirection inputs to a 4D colordensity output; this neural radiance field is plugged into a volumetric rendering equation
(Blinn, 1982) to obtain images of the field from various viewpoints, and trained to minimize the mean squared error in RGB space between the rendered images and the training images.This procedure works well when the training images are taken from enough viewpoints to fully constrain the geometry of the scene or object being modeled. But it fails when only one or two images are available; one cannot infer 3D geometry from a single 2D image without prior knowledge about what sorts of shapes are plausible. To address this issue, various extensions of NeRF have incorporated implicit and explicit priors, yielding impressive one or fewshot novelview reconstructions and/or unconditional samples (Yu et al., 2021; Kosiorek et al., 2021; Rebain et al., 2022a; Rematas et al., 2021; Dupont et al., 2022; Jang and Agapito, 2021; Wang et al., 2021; Trevithick and Yang, 2021; Chen et al., 2021).
But even with a good shape prior, there may still be uncertainty about the shape and appearance of unseen parts of the object. Although existing approaches can infer reasonable point estimates from a single image, they generally fail to account for this uncertainty.
We propose probabilistic NeRF (ProbNeRF), a system for learning priors on NeRF representations of 3D objects and for doing inference on those representations. At a high level, ProbNeRF is trained using the variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014) framework, using amortized variational inference to speed up training. At test time, we use Hamiltonian Monte Carlo (Neal et al., 2011) to sample from the posterior over NeRFs that are consistent with a set of input views. Several technical contributions proved necessary to achieving highfidelity reconstruction and robust shape uncertainty with this design:

We employ a temperatureannealing strategy in our HMC sampler to make it more robust to isolated modes that arise from the nonlogconcave likelihood.

We employ a twostage hypernetworkbased decoder, rather than a singlenetwork strategy such as latent concatenation. This design lets us represent each object using a relatively small NeRF, which dramatically reduces perpixel rendering costs (and therefore the cost of iterative testtime inference).

In addition to a lowdimensional latent code, we treat the raw weights of each object’s NeRF representation as random variables to be inferred. This eliminates the latentcode bottleneck from our model, allowing highfidelity reconstruction of novel objects.
2 Method
In this section, we describe the ProbNeRF generative process, training procedure, and testtime inference procedure, as well as the neural architectures that implement them.
2.1 Generative Process
Let be a function that, given some neural network weights , a position , and a viewing direction , outputs a density and an RGB color . Let be a rendering function that maps from a ray and the conditional field to a color by querying at various points along the ray . (The renderer we use is defined in detail in Section 2.5.)
ProbNeRF assumes that, given a set of rays a set of pixels is generated by the following process: sample an abstract object code
from a standard normal distribution pushed forward through an invertible RealNVP map
(Dinh et al., 2017), run it through a hypernetwork to get a set of NeRF weights , perturb those weights with lowvariance Gaussian noise, render the resulting model, and add some pixelwise Gaussian noise. More formally,(1)  
where is an invertible RealNVP (Dinh et al., 2017) function with parameters , is a latent code that summarizes the object’s shape and appearance, is a hypernetwork with parameters that maps from codes to NeRF weights , and and are scalar variance parameters. The generative process is summarized in Figure 1.
This generative process is similar to the one assumed by Kosiorek et al. (2021); Dupont et al. (2022): a latent code is sampled from a learned prior defined by a RealNVP, and used to index a learned family of NeRFs. There are two main differences. The first difference is architectural: we use a hypernetwork (Ha et al., 2017) to generate a full set of NeRF weights instead of concatenating the latent code to the input and activations^{1}^{1}1Dupont et al. (2022) frame their approach in terms of FiLMstyle modulations (Perez et al., 2018) rather than concatenation; we show in the supplement that their latentshift strategy is equivalent to concatenating a latent code to the activations at each layer.. This hypernetwork approach generalizes the latentconcatenation approach, and recent theoretical results (Galanti and Wolf, 2020) argue that hypernetworks should allow us to achieve a similar level of expressivity to the latentconcatenation strategy using a smaller architecture for —intuitively, putting many parameters into a large, expressive hypernetwork makes it easier to learn a mapping to a compact function representation. This leads to large savings at both train and test time if we need to render many rays per object, since we can amortize the cost of an expensive mapping from to over hundreds or thousands of rays, each of which requires many function evaluations to render. For comparison, the NeRF architecture employed by Dupont et al. (2022) is an MLP with 15 layers of 512 hidden units, whereas in our experiments we get competitive results using a fourhiddenlayer architecture with 64 hidden units—a cost savings of more than two orders of magnitude per function evaluation. Without this reduction in rendering costs, iterative MCMC methods for testtime inference would be impractical.
The second main difference between the ProbNeRF generative process and previous generative models of NeRFs is that we allow for small perturbations of the weights . This is essentially a measure to address misspecification (cf. e.g., Kleijn and van der Vaart, 2012); it ensures that our prior on NeRFs has positive support on the full range of functions , rather than the much smaller manifold of functions . We choose the variance on the weights to be small enough not to introduce noticeable artifacts, but large enough that the likelihood signal from a highresolution image can overwhelm the prior preference to stay near the manifold defined by the mapping from to
. That way, even if the range of the hypernetwork does not include a parameter vector
that accurately represents an object (for example, due to limited capacity or overfitting), the posterior will still concentrate around a good set of parameters with more data.2.2 Training Procedure
We train ProbNeRF models using a variational autoencoder (Kingma and Welling, 2014; Rezende et al., 2014) strategy, with a simplified generative process that omits the perturbation from to :
(2) 
We omit these perturbations at training time to force the model to learn hypernet parameters and RealNVP parameters that can explain the training data well without relying on perturbations. The perturbations are intended to allow the model as an inferencetime “last resort” to explain factors of variation that were not in the training set; at training time we do not want to explain away variations that could be explained using , since the model lacks a mechanism to learn a meaningful prior on .
To compute a variational approximation to the posterior
, we use a convolutional neural network
(CNN; LeCun and Bengio, 1998) to map from each RGB image and camera matrix to a diagonalcovariance dimensional Gaussian potential, parameterized as locations and precisions for the th image; these potentials are meant to approximate the influence of the likelihood function on the posterior (Johnson et al., 2016; Sønderby et al., 2016). We combine these potentials with a learned “prior” potential parameterized by location and precisions via the Gaussian update formulas(3) 
and set .
We train the encoder parameters , the hypernet parameters , and the RealNVP parameters by maximizing the evidence lower bound (ELBO) using Adam (Kingma and Ba, 2015):
(4) 
We train on minibatches of 8 objects and 10 randomly selected images per object to give the encoder enough information to infer a good latent code
. The encoder sees all 10 images, but to reduce rendering costs we compute an unbiased estimate of the loglikelihood
from a random subsample of 1024 rays per object.2.3 Architectures
For each object’s NeRF, we use two MLPs, each with two hidden layers of width 64. The first MLP maps from position to density; the second maps from position, view direction, and density to color. All positions and view directions are first transformed using a 10thorder sinusoidal encoding (Mildenhall et al., 2020). The number of parameters per object is 20,868, relatively few for a NeRF.
The RealNVP network that implements the mapping from to comprises two pairs of coupling layers. Each coupling layer is implemented as an MLP with one 512unit hidden layer that shifts and rescales half of the variables conditioned on the other half; each pair of coupling layers updates a complementary set of variables. The variables are randomly permuted after each pair of coupling layers.
The hypernetwork that maps from the 128dimensional code to the 20,868 NeRF weights is a twolayer 512hiddenunit MLP. This mapping uses a similar number of FLOPs to rendering a few pixels (see Section 2.5).
The encoder network applies a 5layer CNN to each image and a twolayer MLP to its cameraworld matrix, then linearly maps the concatenated the image and camera activations to locations and logscales for each image’s Gaussian potential. (Full architectural details in supplement.)
All networks use ReLU nonlinearities.
2.4 TestTime Inference
The training procedure outlined above is able to learn a good RealNVP prior on codes and to reconstruct trainingset examples accurately. However, we found that the trained encoder generalizes poorly to heldout examples—it is useful scaffolding for training the model, but fails to accurately reconstruct objects it was not trained on. Furthermore, variational inference is well known to underestimate posterior uncertainty (e.g., Yao et al., 2018), and one of our primary goals is to capture uncertainty about object shape and appearance.
As an alternative, at inference time we turn to Hamiltonian Monte Carlo (HMC; Neal et al., 2011)
, a gradientbased Markov chain Monte Carlo (MCMC) method that uses momentum to mitigate poor conditioning of the target logdensity function. Rather than sample in
space, we use the noncentered parameterization and sample from (Betancourt and Girolami, 2015), since the joint prior for and is a wellbehaved spherical normal. (Note that only the prior’s contribution to the posterior is simple; the likelihood still makes things difficult.)HMC is a powerful MCMC algorithm, but it can still get trapped in isolated modes of the posterior. Running multiple chains in parallel can provide samples from multiple modes, but it may be that some chains find (but cannot escape from) modes that have negligible mass under the posterior. A conditioning problem also arises in inverse problems where some degrees of freedom are poorly constrained by the likelihood: as the level of observation noise decreases it becomes necessary to use a smaller step size, but the distance in the latent space between independent samples may stay almost constant
(Langmore et al., 2021).To make our sampling procedure more robust to minor modes and poor conditioning, we use a temperatureannealing strategy (e.g., Kirkpatrick et al., 1983; Neal, 2001). Over the course of HMC iterations, we reduce the observationnoise scale logarithmically from a high initial value to a low final value , with (for a Gaussian likelihood, this is equivalent to annealing the “temperature” of the likelihood). That is, we start out targeting a distribution that is close to the prior, and gradually increase the influence of the likelihood until we are targeting the posterior. We also anneal the step size so that it is proportional to . This procedure lets the sampler explore the latent space thoroughly at higher temperatures before settling into a state that achieves low reconstruction error. In Section 3.3.2, we show that this annealing procedure yields moreconsistent results than running HMC at a low fixed temperature.
2.5 Exact Rendering
NeRFs (Mildenhall et al., 2020) generally employ a stochastic quadrature approximation of the rendering integral. Although this procedure is deterministic at test time, we have found empirically that its gradients are not reliable enough to use in HMC (see Section 3.3.1).
While stochasticgradient methods are robust to the noise from this procedure, standard HMC methods are not (Betancourt, 2015). Stochasticgradient HMC methods do exist (Ma et al., 2015), but require omitting the Metropolis correction, which perturbs the stationary distribution unless one uses a small step size and/or can accurately estimate the (highdimensional) covariance of the gradient noise.
Instead, we employ a simplified version of the renderer used in Chen et al. (2022). We assume all density is concentrated in a “foam” consisting of the surfaces of a 128x128x128 lattice of cubes. Since there is no density inside the cubes, we can render a ray by enumerating all raycube intersection points, computing opacities and colors at each intersection, and alphacompositing the result. This simplification avoids the need to map the latent code to grid vertices as in Chen et al. (2022). Rendering a ray requires at most function evaluations (not ). In Section 3.3.1 we show that this renderer works well with HMC, while HMC with the standard quadrature scheme cannot achieve high acceptance rates.
3 Experiments
In this section, we qualitatively and quantitatively evaluate ProbNeRF’s ability to generate realisticlooking conditional and unconditional samples, as measured by FID score (Heusel et al., 2017); accurately reconstruct the views we condition on; and, given a single image, generate diverse and plausible hypotheses about what an object looks like from other views. We also apply a set of ablations to demonstrate the value of using an exact renderer when doing MCMC, using temperature annealing in our HMC procedure, and doing posterior inference over the raw NeRF weights as well as the latent code.
We evaluate ProbNeRF on two datasets: SRN Cars (Sitzmann et al., 2019), and a set of renderings from the GHUM generative model (Xu et al., 2020)
. We use the standard SRN Cars dataset with 2458 train cars and 704 test cars. GHUM is a generative model fit on a dataset of 60,000 photorealistic body scans that comprises of normally distributed latent codes corresponding to the facial expression, body pose, and body shape which are decoded into a 3D mesh which can be rendered given a camera pose. To isolate the representation of uncertainty over body poses, we construct a dataset by sampling a code for body pose while setting the facial expression and body shape codes to zero. We generate 2500 training examples, each containing 50 views from cameras pointing to the middle of the body with poses sampled uniformly from a fixedradius circle around the body, parallel to the ground plane and elevated to the middle of the body. We generate 100 test examples, each containing 50 views in a similar way except the camera poses are equally spaced points on the circle. All images have resolution
. All models were trained for one million iterations with an observation noise scale .3.1 Posterior Inference
We want ProbNeRF to accurately reconstruct the view it is conditioned on and produce realistic and diverse reconstructions of held out views. To quantitatively assess this, we measure reconstruction quality using the PSNR metric (Wang et al., 2004), realism using the FID score (Heusel et al., 2017), and diversity using average perpixel variance.
We compare ProbNeRF with baseline methods along the inference and the modeling axes. For inference, we compare ProbNeRF’s HMC inference procedure with an iterative meanfield variational inference (VI) procedure fit using Adam and stickingthelanding gradients (Roeder et al., 2017). For modeling, we compare ProbNeRF with “Functa” (Dupont et al., 2022) which proposes separately extracting a compressed representation—a functum—for each training example and doing probabilistic modeling and inference on top of the functa.
To compare between HMC and VI, we use the same trained ProbNeRF model and for each test example pick a conditioning view and a set of held out views . For both HMC and VI, we produce a set of samples targeting the single view posterior which are used to produce renderings from the conditioning view and from the heldout view . We chose a perturbation variance of by rendering perturbed weights with different amounts of Gaussian noise and choosing the largest variance that did not produce significant artifacts.
In HMC, we obtain samples by running 8 independent chains with the annealing procedure described in Section 2.4 and taking the last 16 samples in each chain. In all experiments, we run annealed HMC for steps with 100 leapfrog steps per HMC step, for a total of 10,000 gradient evaluations. The noise scale is annealed from to In VI, we approximate the posterior using an independent normal distribution for each element of the latent variable parameterized by the location and log scale . We maximize the evidence lower bound with respect to using 1500 gradient steps and generate samples using the final parameters.
We find that while both HMC and VI produce highquality (high PSNR) and realistic (low FID) reconstructions, HMC produces significantly morediverse heldout reconstructions (higher mean perpixel variance) for both SRN Cars and GHUM (Fig. 3). This is qualitatively illustrated in Fig. 2 where we show three samples for each HMC and VI rendered from the conditioning view and two held out views alongside the perpixel variance shown as a heatmap (more HMC samples can be found in the supplement). HMC samples show diversity in the pose of the left arm and the left leg in the GHUM example and in the variation of the spoiler, bumper and taillights in the SRN Cars example. VI samples show almost no diversity.
To compare between ProbNeRF and Functa, we train a Functa model on SRN Cars using the opensourced code to obtain a functaset of trainingset modulations. We fit the RealNVP prior used for our model instead of a neural spline flow
(Durkan et al., 2019) or denoising diffusion probabilistic model (Ho et al., 2020), as done in Functa, since the priorfitting code is not released. We note that this may be a source of difference in our reproduction. For each test example, we obtain a modulation code by performing a 1000step gradientbased MAP search given the observed view as per equation 2 in (Dupont et al., 2022). These modulation codes are then rendered for the conditioned view and the heldout views.We find that Functa produces lessaccurate, lessrealistic, and lessdiverse reconstructions on both conditioned on and heldout views than ProbNeRF (Fig. 3).
3.2 Generative Modeling
Our unconditional samples look realistic for both GHUM and SRN Cars (Fig. 4; more in the supplementary material). SRN Cars samples are on par with Functa in FID when it is trained using a denoising diffusion probabilistic model (DDPM) prior (Ho et al., 2020) (Table 1). Both ProbNeRF and Functa have a worse FID than GAN (Chan et al., 2021) which focuses only on generation and cannot be trivially extended to do uncertaintyaware inference. We hypothesize that the gap between our retrained Functa model and the results reported by Dupont et al. (2022) are due to the lower expressivity of the RealNVP prior compared to a DDPM prior (although DDPMs are much more expensive to sample from and do posterior inference with).
ProbNeRF  Functa  Functa (DDPM)  GAN  

FID  84.6  158.1  80.3  36.7 
3.3 Ablations
3.3.1 Exact Renderer
To demonstrate the difficulty of doing HMC with the approximate quadraturebased volumetric renderer (Mildenhall et al., 2020) versus the exact foam renderer (Section 2.5), we first trained models using the quadrature and foambased renderers on SRN cars. We evaluate HMC on each of these models using the appropriate renderer as follows. We sample a fixed pair from the prior for each model and render a single view of the resulting NeRF; conditioned on this view, is a sample from the posterior . For each of a variety of step sizes, we then run 8 HMC chains with 10 leapfrog steps initialized from this sample and targeting for 20 iterations, and report the average Metropolis acceptance rate across the chains on the last iteration. HMC sampling using the quadrature renderer suffers from low acceptance rates even with tiny step sizes, while the foam renderer yields high acceptance rates for smallenough step sizes (Fig. 5).
3.3.2 Temperature Annealing
In this section we qualitatively demonstrate the value of our annealedHMC strategy. Conditioning on the rear view of an ambulance, we ran HMC with annealing as described in Section 2.4, and compare with HMC initialized in the same way but without annealing and with a fixed step size of 0.0005 (we found it necessary to use a lower step size for fixedtemperature HMC to ensure reasonable acceptance rates across all chains). We ran both methods with 8 parallel chains. Fig. 6 shows the last samples from each method.
The annealedHMC procedure’s samples are both more consistent and more faithful to the ground truth. This result is consistent with the hypothesis that the annealing procedure allows HMC to avoid lowmass modes of the posterior and focus on moreplausible explanations of the data.
3.3.3 Inference Over Raw NeRF Weights
In addition to running HMC on ProbNeRF targeting both the latent code and the raw NeRF weights, , we run HMC on ProbNeRF targeting only the latent code, . Like in Section 3.1, we obtain samples by running independent chains. We pick the 16 samples from each chain and render reconstructions from the conditioning view and heldout views.
We show that performing inference over the raw NeRF weights significantly increases the quality (higher PSNR) and realism (lower FID) of the conditionedon view reconstruction while not having negative effects on heldout view reconstruction performance (Fig. 3). Further, when conditioning on highinformation views, the quality and realism of heldout views improves relative to when we only infer the latent code. This supports our hypothesis that adding raw NeRF weights as latent variables increases the support with a positive prior over the radiance fields which lets our system adapt to novel views given sufficiently informative observations. Importantly, despite doing testtime inference in a model that is different from the model used during training does not harm performance on reconstruction of far away heldout views or prior sampling. For GHUM, the discrepancy between ProbNeRF and latentonly ProbNeRF conditionedon view reconstruction quality is not as large, which we attribute to (i) a smaller discrepancy between training and test distributions and (ii) both models being able to fit the training distribution well.
4 Related Work
Neural fields for novelscene inference. While classic NeRFs (Mildenhall et al., 2020) are only fit on single scenes, there have been many recent extensions allowing novel scene or novel object inference. CodeNeRF (Jang and Agapito, 2021) and LOLNeRF (Rebain et al., 2022a) condition neural fields by concatenating inputs with perscene latent codes, and learn priors that are able to generate coherent 3D geometry. PixelNeRF (Yu et al., 2021), IBRNet (Wang et al., 2021), GRF (Trevithick and Yang, 2021), and MVSNeRF (Chen et al., 2021) exploit the geometry of the conditioning view, known as “local conditioning”, to inform novelview reconstructions. As noted by Sajjadi et al. (2022), local conditioning generalizes less well to views far from observations, and is more computationally expensive. Sajjadi et al. (2022) explore using attentionbased mechanisms alongside a setbased latent representation which Rebain et al. (2022b) observes to be generally superior to concatenation or hypernetworkbased methods. However, hypernetworks are shown to perform nearly as well as attention mechanisms, and attention is expensive, especially for iterative posterior inference (Kosiorek et al., 2021)
. Scene representation networks
(Sitzmann et al., 2019) also use hypernetwork conditioning, although on a different neural representation. Similar to ShaRF (Rematas et al., 2021), we also find that testtime inference of NeRF weights alongside latent codes improves reconstructions, especially when input images are highly informative.Probabilistic neural scene representations. Like ProbNeRF, these methods go beyond single point estimates and can represent multiple plausible scenes consistent with the potentially lowinformation views. Generative Query Networks (Eslami et al., 2018; Rosenbaum et al., 2018) also train a VAE system but use a convolutional decoder which doesn’t enforce multiview consistency. NeRFVAE (Kosiorek et al., 2021) also combines VAEs and NeRFs, but relies on latentconcatenation conditioning and amortized VI, resulting in lowdiversity novelview reconstructions. Concurrently with our work, Anonymous (2023) extends NeRFVAE to produce larger posterior diversity by using normalizing flows, attention, and setbased latent representations; we were unable to evaluate it at the time of submission. GAUDI (Bautista et al., 2022) fits a diffusionbased conditional generative model that can sample diverse and plausible largescale real world scenes given few observed views, although the conditioning mechanism must be trained from paired data, whereas ProbNeRF can condition on arbitrary sets of pixels and camera positions. 3DiM (Watson et al., 2022) is a diffusionbased imagetoimage model that can synthesize diverse novel views of scenes, but it does not guarantee multiview consistency.
Other NeRF generative models. Schwarz et al. (2020); Niemeyer and Geiger (2021); Chan et al. (2021) train NeRF generative models using a discriminator loss from generative adversarial nets (GANs) (Goodfellow et al., 2014) which produce highquality, diverse samples. However, GANstyle training often results in models that struggle at reconstruction (Wu et al., 2017). DreamFusion (Poole et al., 2022) generates impressive NeRF scenes given a text prompt, but also does not focus on inference from images.
Exact rendering. DIVeR (Wu et al., 2021)
introduces a deterministic and exact renderer based on integrating trilinearly interpolated features exactly on a voxel grid. This requires four times as many function evaluations or table lookups per intersected voxel as the MobileNeRF strategy we adapt
(Chen et al., 2022).Probabilistic programming for computer vision.
Although several probabilistic programming approaches to computer vision use testtime Monte Carlo inference
(Mansinghka et al., 2013; Le et al., 2017; Kulkarni et al., 2015; Gothoskar et al., 2021), they mainly focus on finding one probable scene interpretation per 2D image (though
Mansinghka et al. (2013) demonstrate limited domainspecific uncertainty reporting on a restricted class of road scenes). In contrast, ProbNeRF characterizes shape and appearance uncertainty for openended classes of 3D objects.5 Discussion
Given a single lowinformation view of a novel object, ProbNeRF can not only produce reasonable point estimates of that object’s shape and appearance, but can also guess what range of shapes and appearances are consistent with the available data. Making these sorts of diverse (but coherent) guesses about unseen features of objects is a fundamental problem in in vision. ProbNeRF shows that it is possible to simultaneously achieve highfidelity reconstruction and robust characterization of uncertainty within the NeRF framework. One next step for future research could be to quantify tradeoffs between model fidelity, inference efficiency, and uncertainty characterization, to support variations on ProbNeRF suitable for realtime applications in robotics. More broadly, we hope ProbNeRF encourages more research at the interface of Bayesian inference, 3D graphics, and computer vision, enabling computer vision systems that entertain diverse hypotheses about the world.
Thanks to Katie Colins, Varun Jampani, Adam Kosiorek, Despoina Paschalidou, and Sharad Vikram for their helpful comments, and to Alex Alemi, Kevin Murphy, and Andrea Tagliasacchi for their timely and helpful feedback on the manuscript.
References
 Anonymous (2023) Anonymous (2023). Laser: Latent set representations for 3d generative modeling. In Submitted to The Eleventh International Conference on Learning Representations. under review.
 Bautista et al. (2022) Bautista, M. A., Guo, P., Abnar, S., Talbott, W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al. (2022). Gaudi: A neural architect for immersive 3d scene generation. arXiv preprint arXiv:2207.13751.

Betancourt (2015)
Betancourt, M. (2015).
The fundamental incompatibility of scalable hamiltonian monte carlo
and naive data subsampling.
In
Proceedings of the 32nd International Conference on Machine Learning
, pages 533–540.  Betancourt and Girolami (2015) Betancourt, M. and Girolami, M. (2015). Hamiltonian monte carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79(30):2–4.
 Blinn (1982) Blinn, J. F. (1982). Light reflection functions for simulation of clouds and dusty surfaces. SIGGRAPH Comput. Graph., 16(3):21–29.

Chan et al. (2021)
Chan, E. R., Monteiro, M., Kellnhofer, P., Wu, J., and Wetzstein, G. (2021).
pigan: Periodic implicit generative adversarial networks for 3daware image synthesis.
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 5799–5809.  Chen et al. (2021) Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., and Su, H. (2021). Mvsnerf: Fast generalizable radiance field reconstruction from multiview stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133.
 Chen et al. (2022) Chen, Z., Funkhouser, T., Hedman, P., and Tagliasacchi, A. (2022). Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. arXiv preprint arXiv:2208.00277.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. (2017). Density estimation using real NVP. In 5th International Conference on Learning Representations.
 Dupont et al. (2022) Dupont, E., Kim, H., Eslami, S. A., Rezende, D. J., and Rosenbaum, D. (2022). From data to functa: Your data point is a function and you can treat it like one. In International Conference on Machine Learning, pages 5694–5725. PMLR.
 Durkan et al. (2019) Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural spline flows. Advances in neural information processing systems, 32.
 Eslami et al. (2018) Eslami, S. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. (2018). Neural scene representation and rendering. Science, 360(6394):1204–1210.
 Galanti and Wolf (2020) Galanti, T. and Wolf, L. (2020). On the modularity of hypernetworks. Advances in Neural Information Processing Systems, 33:10409–10419.
 Goodfellow et al. (2014) Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A. C., and Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
 Gothoskar et al. (2021) Gothoskar, N., CusumanoTowner, M., Zinberg, B., Ghavamizadeh, M., Pollok, F., Garrett, A., Tenenbaum, J., Gutfreund, D., and Mansinghka, V. (2021). 3dp3: 3d scene perception via probabilistic programming. Advances in Neural Information Processing Systems, 34:9600–9612.
 Ha et al. (2017) Ha, D., Dai, A. M., and Le, Q. V. (2017). Hypernetworks. In International Conference on Learning Representations.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). GANs trained by a two timescale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
 Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851.
 Jang and Agapito (2021) Jang, W. and Agapito, L. (2021). Codenerf: Disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12949–12958.
 Johnson et al. (2016) Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. (2016). Composing graphical models with neural networks for structured representations and fast inference. Advances in neural information processing systems, 29.
 Kingma and Ba (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational bayes. In 2nd International Conference on Learning Representations.
 Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt Jr, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598):671–680.

Kleijn and van der Vaart (2012)
Kleijn, B. J. and van der Vaart, A. W. (2012).
The bernsteinvonmises theorem under misspecification.
Electronic Journal of Statistics, 6:354–381.  Kosiorek et al. (2021) Kosiorek, A. R., Strathmann, H., Zoran, D., Moreno, P., Schneider, R., Mokrá, S., and Rezende, D. J. (2021). Nerfvae: A geometry aware 3d scene generative model. In International Conference on Machine Learning, pages 5742–5752. PMLR.
 Kulkarni et al. (2015) Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., and Mansinghka, V. (2015). Picture: A probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4390–4399.
 Langmore et al. (2021) Langmore, I., Dikovsky, M., Geraedts, S., Norgaard, P., and von Behren, R. (2021). Hamiltonian monte carlo in inverse problems; illconditioning and multimodality. arXiv preprint arXiv:2103.07515.
 Le et al. (2017) Le, T. A., Baydin, A. G., and Wood, F. (2017). Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pages 1338–1348. PMLR.
 LeCun and Bengio (1998) LeCun, Y. and Bengio, Y. (1998). Convolutional networks for images, speech, and time series. In The handbook of brain theory and neural networks, pages 255–258.
 Ma et al. (2015) Ma, Y.A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28.
 Mansinghka et al. (2013) Mansinghka, V. K., Kulkarni, T. D., Perov, Y. N., and Tenenbaum, J. (2013). Approximate bayesian image interpretation using generative probabilistic graphics programs. Advances in Neural Information Processing Systems, 26.
 Mildenhall et al. (2020) Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2):125–139.
 Neal et al. (2011) Neal, R. M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2.
 Niemeyer and Geiger (2021) Niemeyer, M. and Geiger, A. (2021). Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
 Perez et al. (2018) Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. C. (2018). Film: Visual reasoning with a general conditioning layer. In AAAI.
 Poole et al. (2022) Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. (2022). Dreamfusion: Textto3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
 Rebain et al. (2022a) Rebain, D., Matthews, M., Yi, K. M., Lagun, D., and Tagliasacchi, A. (2022a). Lolnerf: Learn from one look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1558–1567.
 Rebain et al. (2022b) Rebain, D., Matthews, M. J., Yi, K. M., Sharma, G., Lagun, D., and Tagliasacchi, A. (2022b). Attention beats concatenation for conditioning neural fields. arXiv preprint arXiv:2209.10684.
 Rematas et al. (2021) Rematas, K., MartinBrualla, R., and Ferrari, V. (2021). Sharf: Shapeconditioned radiance fields from a single view. In International Conference on Machine Learning, pages 8948–8958. PMLR.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.  Roeder et al. (2017) Roeder, G., Wu, Y., and Duvenaud, D. K. (2017). Sticking the landing: Simple, lowervariance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
 Rosenbaum et al. (2018) Rosenbaum, D., Besse, F., Viola, F., Rezende, D. J., and Eslami, S. M. A. (2018). Learning models for visual 3d localization with implicit mapping.
 Sajjadi et al. (2022) Sajjadi, M. S., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., Vora, S., Lučić, M., Duckworth, D., Dosovitskiy, A., et al. (2022). Scene representation transformer: Geometryfree novel view synthesis through setlatent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6229–6238.
 Schwarz et al. (2020) Schwarz, K., Liao, Y., Niemeyer, M., and Geiger, A. (2020). Graf: Generative radiance fields for 3daware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166.
 Sitzmann et al. (2019) Sitzmann, V., Zollhöfer, M., and Wetzstein, G. (2019). Scene representation networks: Continuous 3dstructureaware neural scene representations. Advances in Neural Information Processing Systems, 32.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. (2016). Ladder variational autoencoders. Advances in neural information processing systems, 29.
 Trevithick and Yang (2021) Trevithick, A. and Yang, B. (2021). Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192.
 Wang et al. (2021) Wang, Q., Wang, Z., Genova, K., Srinivasan, P. P., Zhou, H., Barron, J. T., MartinBrualla, R., Snavely, N., and Funkhouser, T. (2021). Ibrnet: Learning multiview imagebased rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699.
 Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612.
 Watson et al. (2022) Watson, D., Chan, W., MartinBrualla, R., Ho, J., Tagliasacchi, A., and Norouzi, M. (2022). Novel view synthesis with diffusion models.
 Wu et al. (2021) Wu, L., Lee, J. Y., Bhattad, A., Wang, Y., and Forsyth, D. (2021). DIVeR: Realtime and accurate neural radiance fields with deterministic integration for volume rendering.
 Wu et al. (2017) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2017). On the quantitative analysis of decoderbased generative models. In International Conference on Learning Representations.
 Xu et al. (2020) Xu, H., Bazavan, E. G., Zanfir, A., Freeman, W. T., Sukthankar, R., and Sminchisescu, C. (2020). Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6184–6193.
 Yao et al. (2018) Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. In International Conference on Machine Learning, pages 5581–5590. PMLR.
 Yu et al. (2021) Yu, A., Ye, V., Tancik, M., and Kanazawa, A. (2021). pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587.
References
 Anonymous (2023) Anonymous (2023). Laser: Latent set representations for 3d generative modeling. In Submitted to The Eleventh International Conference on Learning Representations. under review.
 Bautista et al. (2022) Bautista, M. A., Guo, P., Abnar, S., Talbott, W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al. (2022). Gaudi: A neural architect for immersive 3d scene generation. arXiv preprint arXiv:2207.13751.

Betancourt (2015)
Betancourt, M. (2015).
The fundamental incompatibility of scalable hamiltonian monte carlo
and naive data subsampling.
In
Proceedings of the 32nd International Conference on Machine Learning
, pages 533–540.  Betancourt and Girolami (2015) Betancourt, M. and Girolami, M. (2015). Hamiltonian monte carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79(30):2–4.
 Blinn (1982) Blinn, J. F. (1982). Light reflection functions for simulation of clouds and dusty surfaces. SIGGRAPH Comput. Graph., 16(3):21–29.

Chan et al. (2021)
Chan, E. R., Monteiro, M., Kellnhofer, P., Wu, J., and Wetzstein, G. (2021).
pigan: Periodic implicit generative adversarial networks for 3daware image synthesis.
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
, pages 5799–5809.  Chen et al. (2021) Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., and Su, H. (2021). Mvsnerf: Fast generalizable radiance field reconstruction from multiview stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133.
 Chen et al. (2022) Chen, Z., Funkhouser, T., Hedman, P., and Tagliasacchi, A. (2022). Mobilenerf: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile architectures. arXiv preprint arXiv:2208.00277.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. (2017). Density estimation using real NVP. In 5th International Conference on Learning Representations.
 Dupont et al. (2022) Dupont, E., Kim, H., Eslami, S. A., Rezende, D. J., and Rosenbaum, D. (2022). From data to functa: Your data point is a function and you can treat it like one. In International Conference on Machine Learning, pages 5694–5725. PMLR.
 Durkan et al. (2019) Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019). Neural spline flows. Advances in neural information processing systems, 32.
 Eslami et al. (2018) Eslami, S. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. (2018). Neural scene representation and rendering. Science, 360(6394):1204–1210.
 Galanti and Wolf (2020) Galanti, T. and Wolf, L. (2020). On the modularity of hypernetworks. Advances in Neural Information Processing Systems, 33:10409–10419.
 Goodfellow et al. (2014) Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A. C., and Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
 Gothoskar et al. (2021) Gothoskar, N., CusumanoTowner, M., Zinberg, B., Ghavamizadeh, M., Pollok, F., Garrett, A., Tenenbaum, J., Gutfreund, D., and Mansinghka, V. (2021). 3dp3: 3d scene perception via probabilistic programming. Advances in Neural Information Processing Systems, 34:9600–9612.
 Ha et al. (2017) Ha, D., Dai, A. M., and Le, Q. V. (2017). Hypernetworks. In International Conference on Learning Representations.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). GANs trained by a two timescale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
 Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851.
 Jang and Agapito (2021) Jang, W. and Agapito, L. (2021). Codenerf: Disentangled neural radiance fields for object categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12949–12958.
 Johnson et al. (2016) Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. (2016). Composing graphical models with neural networks for structured representations and fast inference. Advances in neural information processing systems, 29.
 Kingma and Ba (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Autoencoding variational bayes. In 2nd International Conference on Learning Representations.
 Kirkpatrick et al. (1983) Kirkpatrick, S., Gelatt Jr, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598):671–680.

Kleijn and van der Vaart (2012)
Kleijn, B. J. and van der Vaart, A. W. (2012).
The bernsteinvonmises theorem under misspecification.
Electronic Journal of Statistics, 6:354–381.  Kosiorek et al. (2021) Kosiorek, A. R., Strathmann, H., Zoran, D., Moreno, P., Schneider, R., Mokrá, S., and Rezende, D. J. (2021). Nerfvae: A geometry aware 3d scene generative model. In International Conference on Machine Learning, pages 5742–5752. PMLR.
 Kulkarni et al. (2015) Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., and Mansinghka, V. (2015). Picture: A probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4390–4399.
 Langmore et al. (2021) Langmore, I., Dikovsky, M., Geraedts, S., Norgaard, P., and von Behren, R. (2021). Hamiltonian monte carlo in inverse problems; illconditioning and multimodality. arXiv preprint arXiv:2103.07515.
 Le et al. (2017) Le, T. A., Baydin, A. G., and Wood, F. (2017). Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pages 1338–1348. PMLR.
 LeCun and Bengio (1998) LeCun, Y. and Bengio, Y. (1998). Convolutional networks for images, speech, and time series. In The handbook of brain theory and neural networks, pages 255–258.
 Ma et al. (2015) Ma, Y.A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28.
 Mansinghka et al. (2013) Mansinghka, V. K., Kulkarni, T. D., Perov, Y. N., and Tenenbaum, J. (2013). Approximate bayesian image interpretation using generative probabilistic graphics programs. Advances in Neural Information Processing Systems, 26.
 Mildenhall et al. (2020) Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. (2020). Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405–421. Springer.
 Neal (2001) Neal, R. M. (2001). Annealed importance sampling. Statistics and computing, 11(2):125–139.
 Neal et al. (2011) Neal, R. M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2.
 Niemeyer and Geiger (2021) Niemeyer, M. and Geiger, A. (2021). Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
 Perez et al. (2018) Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. C. (2018). Film: Visual reasoning with a general conditioning layer. In AAAI.
 Poole et al. (2022) Poole, B., Jain, A., Barron, J. T., and Mildenhall, B. (2022). Dreamfusion: Textto3d using 2d diffusion. arXiv preprint arXiv:2209.14988.
 Rebain et al. (2022a) Rebain, D., Matthews, M., Yi, K. M., Lagun, D., and Tagliasacchi, A. (2022a). Lolnerf: Learn from one look. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1558–1567.
 Rebain et al. (2022b) Rebain, D., Matthews, M. J., Yi, K. M., Sharma, G., Lagun, D., and Tagliasacchi, A. (2022b). Attention beats concatenation for conditioning neural fields. arXiv preprint arXiv:2209.10684.
 Rematas et al. (2021) Rematas, K., MartinBrualla, R., and Ferrari, V. (2021). Sharf: Shapeconditioned radiance fields from a single view. In International Conference on Machine Learning, pages 8948–8958. PMLR.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.  Roeder et al. (2017) Roeder, G., Wu, Y., and Duvenaud, D. K. (2017). Sticking the landing: Simple, lowervariance gradient estimators for variational inference. Advances in Neural Information Processing Systems, 30.
 Rosenbaum et al. (2018) Rosenbaum, D., Besse, F., Viola, F., Rezende, D. J., and Eslami, S. M. A. (2018). Learning models for visual 3d localization with implicit mapping.
 Sajjadi et al. (2022) Sajjadi, M. S., Meyer, H., Pot, E., Bergmann, U., Greff, K., Radwan, N., Vora, S., Lučić, M., Duckworth, D., Dosovitskiy, A., et al. (2022). Scene representation transformer: Geometryfree novel view synthesis through setlatent scene representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6229–6238.
 Schwarz et al. (2020) Schwarz, K., Liao, Y., Niemeyer, M., and Geiger, A. (2020). Graf: Generative radiance fields for 3daware image synthesis. Advances in Neural Information Processing Systems, 33:20154–20166.
 Sitzmann et al. (2019) Sitzmann, V., Zollhöfer, M., and Wetzstein, G. (2019). Scene representation networks: Continuous 3dstructureaware neural scene representations. Advances in Neural Information Processing Systems, 32.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. (2016). Ladder variational autoencoders. Advances in neural information processing systems, 29.
 Trevithick and Yang (2021) Trevithick, A. and Yang, B. (2021). Grf: Learning a general radiance field for 3d representation and rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15182–15192.
 Wang et al. (2021) Wang, Q., Wang, Z., Genova, K., Srinivasan, P. P., Zhou, H., Barron, J. T., MartinBrualla, R., Snavely, N., and Funkhouser, T. (2021). Ibrnet: Learning multiview imagebased rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699.
 Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612.
 Watson et al. (2022) Watson, D., Chan, W., MartinBrualla, R., Ho, J., Tagliasacchi, A., and Norouzi, M. (2022). Novel view synthesis with diffusion models.
 Wu et al. (2021) Wu, L., Lee, J. Y., Bhattad, A., Wang, Y., and Forsyth, D. (2021). DIVeR: Realtime and accurate neural radiance fields with deterministic integration for volume rendering.
 Wu et al. (2017) Wu, Y., Burda, Y., Salakhutdinov, R., and Grosse, R. (2017). On the quantitative analysis of decoderbased generative models. In International Conference on Learning Representations.
 Xu et al. (2020) Xu, H., Bazavan, E. G., Zanfir, A., Freeman, W. T., Sukthankar, R., and Sminchisescu, C. (2020). Ghum & ghuml: Generative 3d human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6184–6193.
 Yao et al. (2018) Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. In International Conference on Machine Learning, pages 5581–5590. PMLR.
 Yu et al. (2021) Yu, A., Ye, V., Tancik, M., and Kanazawa, A. (2021). pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587.
Appendix A Additional figures
Figures 7, 8, 9, and 10 show eight uncurated samples from HMC and VI for GHUM and SRN Cars (cf. Fig. 2). Figures 11, 12, 13, and 14 show eight uncurated HMC and VI samples generated by conditioning on uncropped singleview images of two SRN cars. Both HMC and VI produce models that are realistic and consistent with the observed data, but samples from the variational distribution obtained by VI are all essentially the same, while HMC produces samples with noticeable diversity in pose (for GHUM) and shape and color (for SRN Cars).
Appendix B Architecture details
In this section we briefly describe the architectural details of ProbNerf neural nets. Fig. 18 contains the architectures used in this work.
HyperNet is a simple MLP with two shared hidden layers, followed by a learnable linear projection and reshape operations to produce the parameters of the two NeRF networks.
RealNVP consists of 4 RealNVP blocks which act on a latent vector split into 2 parts (designated as and in the diagram). The split sense is reversed between the RealNVP blocks.
NeRF The NeRF is split into two subnetworks, one for density and one for color. The input position and ray direction are encoded using a 10th order sinusoidal positional encoding. For a scalar component of the input vector we produce a feature:
(5) 
We flatten and concatenate this array with the original input value to produce a 21 element feature vector for each . To convert output density to we squash it as , where 128 is the grid size.
Encoder Each potential of the variational posterior is modeled as a diagonal covariance Gaussian with mean and scale computed via a CNN.
Appendix C Equivalence of linear latentshift modulations and latent concatenation
The linear shiftonly modulations used by Dupont et al. (2022) work as follows for an MLP: given a latent vector , for each layer’s output prenonlinearity activations (treating the input as an activation vector ), add a shift vector that is a linear function of to get , and propagate forward through the network instead of .
The same effect can be achieved by concatenating to the activations at each level of the network. The resulting computation is
(6) 
where denotes the weight matrix at layer , denotes the biases at layer ,
denotes a nonlinear activation function, and we define
and to be the submatrices of that are multiplied by the previous layer’s activations and the concatenated latents respectively.