Log In Sign Up

FiG-NeRF: Figure-Ground Neural Radiance Fields for 3D Object Category Modelling

by   Christopher Xie, et al.

We investigate the use of Neural Radiance Fields (NeRF) to learn high quality 3D object category models from collections of input images. In contrast to previous work, we are able to do this whilst simultaneously separating foreground objects from their varying backgrounds. We achieve this via a 2-component NeRF model, FiG-NeRF, that prefers explanation of the scene as a geometrically constant background and a deformable foreground that represents the object category. We show that this method can learn accurate 3D object category models using only photometric supervision and casually captured images of the objects. Additionally, our 2-part decomposition allows the model to perform accurate and crisp amodal segmentation. We quantitatively evaluate our method with view synthesis and image fidelity metrics, using synthetic, lab-captured, and in-the-wild data. Our results demonstrate convincing 3D object category modelling that exceed the performance of existing methods.


page 3

page 6

page 7

page 8

page 9


Multi-Category Mesh Reconstruction From Image Collections

Recently, learning frameworks have shown the capability of inferring the...

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

We propose a new task towards more practical application for image gener...

Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

We present Panoptic Neural Fields (PNF), an object-aware neural scene re...

ScrewNet: Category-Independent Articulation Model Estimation From Depth Images Using Screw Theory

Robots in human environments will need to interact with a wide variety o...

ViewNeRF: Unsupervised Viewpoint Estimation Using Category-Level Neural Radiance Fields

We introduce ViewNeRF, a Neural Radiance Field-based viewpoint estimatio...

SegNeRF: 3D Part Segmentation with Neural Radiance Fields

Recent advances in Neural Radiance Fields (NeRF) boast impressive perfor...

Co-segmentation for Space-Time Co-located Collections

We present a co-segmentation technique for space-time co-located image c...

1 Introduction

Learning high quality 3D object category models from visual data has a variety of applications such as in content creation and robotics. For example, convincing category models might allow us to generate realistic new object instances for graphics applications, or allow a robot to understand the 3-dimensional structure of a novel object instance if it had seen objects of a similar type before [1, 41]. Reasoning about objects in 3D could also enable improved performance in general perception tasks. For example, most work in object detection and instance segmentation is limited to learning object categories in 2D. Using 3D category models for such tasks could enable enhanced reasoning, such as amodal segmentation [48] taking into account occlusions of multiple objects, or fusion of information over multiple views taken from different viewpoints

Figure 1:

Overview of our system. We take as input a collection of RGB captures of scenes with objects of a category. Our method jointly learns to decompose the scenes into foreground and background (without supervision) and a 3D object category model, that enables applications such as instance interpolation, view synthesis, and segmentation.

The majority of existing work in 3D category modelling from images has used supervision in the form of 3D models [5], segmentations [43, 14] or semantic keypoints [17]. Recently, some authors have attempted to learn 3D category models using images alone [23, 31]; however, these methods assume simple backgrounds or known silhouettes. An open problem is to learn 3D object category models using casually captured photography with unconstrained backgrounds and minimal supervision. In this work, we pursue this objective, using the recently released Objectron dataset [2] as a target for in-the-wild 3D object category modelling. In this setting, backgrounds are different for each instance, so our method must separate the object from its background as well as understand its particular shape and appearance.

We build on top of Neural Radiance Fields (NeRF) [22], which has shown excellent results for image-based view synthesis. We propose Figure-Ground Neural Radiance Fields (FiG-NeRF), which uses two NeRF models to model the objects and background, respectively. To enable separation of object (figure) from background (ground, as in the Gestalt principle of figure-ground perception), we adopt a 2-component model comprised of a deformable foreground model [28] and background model with a fixed geometry and variable appearance. We find that fixing the geometry of the background is often appropriate for the object categories we study. For example, cups typically rest on tables, and eyeglasses on faces. We show that our 2-component approach together with sparsity priors is sufficient for the model to successfully separate modelling of a foreground object and background. Since our model infers this separation between object and background, a secondary benefit, in addition to the 3D object category model, is a crisp amodal object/background segmentation. In our evaluation we show that the quality of our segmentations outperform both Mask R-CNN [12] and a bespoke matting algorithm [20] that uses additional images.

The key contributions of our work are:

  1. We jointly estimate object category models whilst separating objects from their backgrounds, using a novel 2-component, Deformable NeRF formulation.

  2. Our results in novel-view synthesis and instance interpolation outperforms baselines such as non-deformable NeRF variants and SRNs [34] on both synthetic and real image datasets.

  3. We demonstrate learning of object category models in-the-wild and with variable backgrounds, using the Objectron dataset [2].

To our knowledge, our method is the first to jointly estimate detailed object category models with backgrounds of variable and view-dependent appearance, without recourse to silhouettes or 2D segmentation predictions.

2 Related Work

Early work in 3D object category modelling reconstructed objects from the PASCAL VOC dataset 

[40]. Rigid SFM was used to find a mean shape, followed by intersection of a carefully selected subset of silhouettes. Kar et al. [18] also uses silhouette supervision, employing non-rigid SFM, followed by fitting a shape with mean and deformation modes.

More recent investigations adapt classical 3D modelling representations into a deep learning framework. Pixel2Mesh 

[42] uses an iterative mesh refinement strategy to predict a 3D triangle mesh from the input image. The method uses ground truth mesh supervision, deforming an initial 156 vertex ellipsoid via a series of graph convolutions and coarse-to-fine mesh refinement. Category Specific Mesh reconstruction [17] also works by deformation of a fixed initial mesh, but without requiring 3D ground truth. It uses silhouette and keypoint based loss terms, with a differentiable mesh renderer [16] employed to calculate gradients.

The above mesh methods are limited a spherical topology. Mesh-RCNN [10] removes this limitation and provides more flexible shape modelling by initially predicting a voxel grid in a manner similar to  [6], before adapting this towards a mesh model. The voxel-based 3D reconstruction has limited ability to model fine detail due to the high memory cost in storing a large voxel array. This is partly addressed in  [37], which use octrees to enable resolutions up to .

More recent work fuses classical computer graphics techniques with deep generative models [33, 38, 39, 27]. A promising recent trend involves the use of coordinate regression approaches, also known as Compositional Pattern Producing Networks (CPPNs) [35]. These are fully connected networks operating on 3D coordinates to produce an output that represents the scene. Examples include DeepSDF [26]

, which predicts a signed distance function at each 3D coordinate. Scene Representation Networks 


(SRNs) represent an embedding vector at each 3D position, with a learned recurrent renderer mimicking the role of sphere tracing. NeRF 

[22] also uses CPPNs, but within a volumetric rendering scheme. A novel positional encoding is key to the success of this approach [32, 36].

While much prior work makes use of supervision in the form of 3D ground truth, keypoints or silhouettes, [25] learns 3D shape using only image inputs, estimating viewpoint and depth, but limited to a point cloud reconstruction inferred from the depth map. [44] uses a symmetry prior and factored representation to generate improved results. Generative adversarial nets have also been used to model objects from image collections alone, using 3D convolutions [9, 13], and 3D+2D convolutions in [23], the latter giving more realistic results at the expense of true 3D view consistency. Shelf-supervised mesh prediction [45] combines image segmentation inputs with an adversarial loss, generating impressive results over a broad range of image categories.

Close to our approach, Generative Radiance Fields [31] use a latent-variable conditioned NeRF to model object categories, trained via adversarial loss. They are however, limited to plain backgrounds. GIRAFFE [24] extends this with a composition of NeRF-like models that generate a “feature field”, rendered via a 2D neural render to enable controllable image synthesis. They also show results in separating object from background, though since the feature images are only 16x16 (upsampled via a 2D neural render), the outputs are blurry and lack the fine geometric detail produced by our method. STaR [46] also performs foreground and background separation with NeRFs, but only considers a single rigidly-moving object.

3 Method

In this section, we introduce our model, Figure-Ground NeRF (FiG-NeRF), which is named after the Gestalt principle of figure-ground separation [29].

Figure 2: Example setups for Glasses (top) and Cups (bottom) datasets. For the lab-captured Glasses dataset [20], the background (left) is a mannequin, and each scene (right) is a different pair of glasses placed on the mannequin. For Cups, we build this from the Objectron [2] dataset of crowdsourced casual cellphone video captures, where we captured a planar surface with textures (colored papers) for the background.

3.1 Setup

We consider a setting with a collection of scenes, where each scene contains a single object instance. We assume that the scene backgrounds have identical geometry but not necessarily color. For example, different tabletops/planar surfaces, or the same scene with different lighting satisfy this assumption. Each instance is captured from viewpoints, resulting in a dataset for each instance . Additionally, we have images of a background scene with no object instance present, and designate this as . We denote the entire dataset as . We also assume that we have camera poses and intrinsics for each image. This can be obtained by standard structure-from-motion (SfM) software such as [30] or by running visual odometry during capture [2]. See Figure 2 for an example of our assumed setup.

3.2 Preliminaries

A Neural Radiance Field (NeRF) [22] is a function

comprised of multi-layer perceptrons (MLP) that map a 3D position

and 2D viewing direction to an RGB color and volume density . is comprised of a trunk, followed by separate color and density branches. Point coordinates are passed through a positional encoding layer before being input to the MLP. The radiance field is rendered using volume rendering to generate highly realistic results when trained on many views.

A drawback to NeRF is that it only learns to model a single scene. To extend NeRF to handle multiple scenes, previous works have introduced conditioning input latent codes to model object categories [31] and the different appearances of a scene [21]. They concatenate the inputs of the NeRF MLPs with additional latent codes that are sampled using a GAN [11] or optimized during training [4]. We dub such models conditional NeRFs. Our approach builds upon conditional NeRF models to handle object categories.

3.3 Object and Background Separation

Figure 3: FiG-NeRF consists of foreground and background models. The foreground model, which includes the deformation field and template NeRF, is shown in blue while the background NeRF is shown in red. denotes concatenation.

Many previous works on category modelling rely on ground truth segmentations that define the extent of the objects. These can be easily extracted for synthetic datasets, but it is difficult to obtain accurate segmentations for real datasets. Additionally, this limits the potential categories to classes that the object detector has trained on. Instead, we propose a model that learns a segmentation of the scene into a foreground component containing the object of interest, and a background component in an unsupervised manner.

The key to our approach is to decompose the neural radiance field into two components: a foreground component and a background component , each modeled by a conditional NeRF. The foreground latent code and the background latent code condition the foreground and background component, respectively.

We observe that many objects are often captured in backgrounds containing approximately the same geometry, such as cups on planar surfaces, or eyeglasses on faces. Our formulation exploits this observation, and assumes that all the object instances in are captured against backgrounds that share the same geometry, while their appearance might change due to texture or lighting changes. Such an assumption is not overly restrictive, and allows for distributed data capture, i.e. each scene can be captured by a different user (for example, tabletops in different homes), as is the case of the Objectron dataset [2]. We incorporate this inductive bias into the network by feeding to the branch of that affects the appearance only, not density (geometry). Note that with this formulation, the background model can capture background surface appearance variations induced by the object, such as shadows.

Model Cars Glasses Cups
SRNs [34] 34.96 36.42 .9860 .0142 213.8 17.20 0.6168 0.6580
NeRF+L 34.86 37.75 .9884 .0112 43.37 36.17 .9390 .1020 164.0 24.40 .9402 .0758
NeRF+L+S 35.48 37.78 .9882 .0107 .9555 39.17 36.24 .9397 .0986 .5572 126.0 25.10 .9437 .0666 .8023
NeRF+L+S+D 26.02 38.02 .9889 .0097 .9590 39.56 36.26 .9402 .0968 .5796 106.4 25.05 .9430 .0651 .8535
Table 1: Quantitative results on all datasets. We show instance interpolation metrics and heldout view metrics. Our proposed method outperforms the baselines on virtually every metric. We color code each row as best and second best.

More formally, our model is composed of the conditional NeRF models:


where we learn the foreground and background latent codes using generative latent optimization (GLO) [4].

To render images using the 2-component model, we follow the composite volumetric rendering scheme of [21]. That is, for a ray , we compute the color as


where , and are the near and far integration bounds, respectively. are the density and color values at for , respectively.

With this separation, we can compute a foreground segmentation mask by rendering the depth of the foreground and background models separately and selecting the pixels in which the foreground is closer to the camera than background. We show in Section 4.5.5 that our learned model produces accurate and crisp amodal segmentation masks.

3.4 Objects as Deformed Template NeRFs

We would like our category models to (1) allow for smooth interpolation between objects, and (2) be robust to partial observations of specific instances. Understanding these goals, we make the observation that instances of an object category are typically structurally very similar. For example, it is possible to think of a generic cup, shoe, or camera without specifying a specific instance. This observation has been used to motivate methods such as morphable models [3] which deform a canonical mesh to fit objects with small intra-class variations e.g., faces. Inspired by the success of such methods, we propose to model object instances as deformations from a canonical radiance field. While morphable models assume a shared topology and vertex structure, we are not bound by these limitations due to our use of a continuous, volumetric representation i.e., NeRF.

We incorporate canonical object templates into our model by adapting the concept of Deformable NeRF or nerfies [28] to modelling object categories. Deformable NeRFs are comprised of two parts: a canonical template NeRF , which is a standard 5D NeRF model, and a deformation field which warps a point in observation-space coordinates to a point in template-space. The deformation field is a function conditioned by a deformation code defined for time step . We represent it with a residual translation field such that where is a coordinate-based MLP that uses a positional encoding layer.

Instead of associating deformation fields to time steps , our model associates a deformation field to each object instance represented by a shape deformation code . Because all objects share the template NeRF model yet may have different textures, we condition with a per-instance appearance code that, similarly to , only affects the color branch of the model. We define the concatenation of shape and appearance latent codes as the object instance code defined in the previous section.

Our resulting foreground object model is thus a conditional NeRF that takes the following form:


We visualize our complete network architecture in Fig. 3.

3.5 Loss Functions

Photometric Loss

We apply the standard photometric L2 loss


where and is the ground truth RGB value for ray . Additionally, to ensure our background model learns the appropriate background geometry, we apply the same loss to background:

Separation Regularization

The structure of our model (Figure 3) allows us to disentagle foreground and background when the dataset satisfies the assumption of identical background geometry across different scenes. However, this separation doesn’t naturally appear during the optimization, thus we apply a regularization loss to encourage it. We consider the accumulated foreground density along a ray:


where . We impose an L1 sparsity penalty . This encourages to represent as much of the scene as possible. However, since the background model does not vary its density, is forced to represent any varying geometry of the scenes, which includes the objects. Thus, the separation is not supervised, and occurs due to this sparsity prior and model structure.

While helps to separate foreground and background, it tends to encourage

to pick up artifacts such as shadows induced by the object and lighting changes. These artificats typically manifest as haze-like artifacts, thus we suppress them with a beta prior loss function inspired by



In our experiments, we set , which biases towards 0, which is in line with our sparsity loss.

Deformation Regularization

Finally, we apply a simple L2 loss on the residual translation to penalize the deformation from being arbitrarily large:


where the expectation is approximated with Monte Carlo sampling at the sampled points in the volume rendering procedure [22].

Our full loss function is described by: .

4 Experiments

4.1 Implementation Details

Our architecture deviates slightly from the original NeRF paper. We add a density branch, and set the backbones of to 2 hidden layers with 256 units. The color and density branches have 8 layers of 128 hidden units each, with a skip connection at layer 5. Our deformation field MLP has 6 hidden layers with 128 units each and a skip connection at layer 4. Following [28], we use coarse and fine background and template models, but only one deformation field. We jointly train all models together for 500k iterations using the same schedule as the original NeRF [22] while using a batch size of 4096 rays on 4 V100 GPUs, which takes approximately 2 days. We set except for ablations. We apply a top- schedule to so that it focuses more on hard negatives after every 50k iterations. We additionally apply the coarse-to-fine regularization scheme of [28] to all models for 50k iterations to prevent overfitting to high frequencies. The latent code dimension is set to 64 for all experiments. Finally, we use the same number of coarse and fine samples as NeRF, except for the Glasses dataset which uses 96 coarse samples to better capture thin structures. More details can be found in the appendix.

4.2 Datasets

We use 3 datasets: Cars (synthetic), Glasses (real, controlled lab capture), Cups (real, hand-held capture) as described below:


We render 100 cars from ShapeNet [5] using Blender [7] at a 128x128 resolution. We randomly sample an elevation, then sample viewpoints by linearly spaced azimuth points. The cars are rendered against a gray canvas; the background can be assumed to be an infinitely long gray plane without any appearance variation.


We use 60 glasses from the dataset of [20], which consists of a lab-capture setup where eyeglasses are placed on a mannequin. The images exhibit real-world phenomena such as lighting variations and shadows, making it more challenging to learn compared to synthetic data. Additionally, glasses are challenging to model due to thin geometric structures that they possess. We use images cropped to 512x512. The mannequin setup has a backlight which was used to capture additional images in order to extract foreground alpha mattes [20]; we arbitrarily threshold these at .5 to obtain segmentation masks. However, we show that the backlit-captured segmentation masks can be quite noisy, and our learned segmentations are much cleaner and more accurate without the need for the backlight nor additional images. Note that we use these masks purely for evaluation, not for training. We do not use the backlit images either.

40.00 36.98 .9420 .0970 .0410
39.56 36.26 .9402 .0968 .5796
102.8 25.23 .9441 .0634 .1403
101.8 25.20 .9440 .0630 .8560
Table 2: Loss function ablation with our full model. For qualitative results of the effect of , see Figure 4.
Figure 4: Foreground renderings of ablated models. Without (left), the model fails to separate the background from the foreground. Lingering artifacts, such as faint halos around mannequin silhouette (middle, circled in red), are further suppressed with (right). Best viewed zoomed in.

We build this dataset from the cups class from Objectron [2]. Objectron consists of casually-captured video clips in-the-wild which are accompanied by camera poses and manually annotated bounding boxes. In each video, the camera moves around the object, capturing it from different angles. While the cups are filmed while resting on planar surfaces, they typically have varying background geometries in the scene, which violates our assumption. Thus, we manually selected 100 videos in which the 2D projection of the bounding box annotation covers a planar region. We only use pixels within this region for both training and evaluation. We crop and resize the frames to 512x512, and use the coordinate frames centered at the object, giving us alignment between the instances. To obtain segmentation masks for evaluation, we use Mask R-CNN [12] with a SpineNet96 backbone [8]. In order to ground the background to a planar geometry, we manually capture a hardwood floor with textured papers for (see Figure 2).

All datasets use 50 images per object instance (including background) for training, and 20 images per instance for heldout validation. The near and far bounds are selected for each dataset manually, while for Cups we use per-instance values since the camera can be at a different distance from the object in each video.

Figure 5: Comparison of methods under limited viewpoint ranges. The x-axis is the limit of the azimuth range per-instance during training. See text for how this is set.

4.3 Baselines

We establish a baseline that simply adds a latent code to NeRF (NeRF+L). We implement this by essentially using only for the composite rendering and conditioning it with on its density branch (see Figure 3). Additionally, we train Scene Representation Networks (SRNs) [34], a recent state-of-the-art method for object category modelling, as a baseline. We trained SRNs for 300k iterations with a batch size of 8, which took approximately 3 days. The latent embedding is set to 64.

Since SRNs and NeRF+L cannot separate foreground and background, we also compare against a third baseline that uses our architecture without the deformation field. For this model (NeRF+L+S, S for separation), we take NeRF+L and add the background model. Comparisons with this baseline allows us to evaluate the efficacy of the deformation field in modelling object categories. Note that NeRF+L+S+D (D for deformation) = FiG-NeRF.

4.4 Metrics

To evaluate how well a model learns an object category, we test its ability to synthesize novel viewpoints and its ability to interpolate instances. We report PSNR, SSIM, and LPIPS [47] for view synthesis and Frechet Inception Distance (FID) [15] for instance interpolation. Note that we sample random instance interpolation points in the latent space and fix these for all models for a more fair comparison. Lastly, we show Intersection over Union (IoU) with respect to the obtained segmentation masks on heldout viewpoints to evaluate the foreground/background separation. Section 4.2 describes the segmentation reference methods.

4.5 Results

Figure 6: Optimizing camera positions jointly with FiG-NeRF leads to better geometry on cups. Columns 1 and 2 show the foreground rendering and segmentation mask from FiG-NeRF without camera optimization, and columns 3 and 4 show FiG-NeRF with camera optimization.

4.5.1 Model Comparison

We compare our proposed method with the baselines on all datasets in Table 1. On the clean synthetic Cars dataset, all models and baselines perform well. The NeRF-based models (NeRF+*) outperform SRNs on the heldout metrics including PSNR/SSIM/LPIPS. While the NeRF-based models show similar performance on the heldout metrics, NeRF+L+S+D (FiG-NeRF) significantly outperforms all other baselines on FID, demonstrating its ability to interpolate between instances and model the object category.

On the Glasses dataset, the region of the image occupied by the glasses themselves is fairly small. We believe this to be the reason that the results are very similar for the NeRF-based models. However, the models with separation clearly outperform NeRF+L on FID. While this baseline only has half of the parameters of our model, it cannot decompose the scenes into foreground and background, and these results suggest that having a dedicated model to foreground aids performance when interpolating instances. Additionally, NeRF+L+S+D (FiG-NeRF) shows a slight performance increase compared to the non-deformable baseline on the heldout metrics. Please see the project website for qualitative differences between NeRF+L+S and NeRF+L+S+D (FiG-NeRF). Lastly, note that SRNs failed on this dataset since it cannot handle varying camera intrinsics.

For Cups, SRNs perform poorly for this complex real-world dataset, indicated by a 30% decrease in PSNR from the NeRF-based models. NeRF+L+S performs very similarly to NeRF+L+S+D (FiG-NeRF) on PSNR/SSIM/LPIPS, but is significantly outperformed on FID and IoU. As this dataset is the most challenging, this suggests that the deformation field notably aids in learning how to interpolate instances, and also the geometry. NeRF+L has much more trouble interpolating instances as shown in FID. These results suggest that our proposed model is adept not just in synthesizing new views, but also at interpolating between instances.

Figure 7: Segmentation Comparison. The segmentations in the row are produced by our model, and the row is produced by the backlit matte extraction [20] and Mask R-CNN [12]. Our method learns more accurate and crisp segmentation masks without supervision.
Figure 8: Shape and color interpolations on Glasses and Cups. We show cropped training images on the left (green box) and right (red box) columns. In the middle, we show interpolations rendered from the foreground model. Best viewed zoomed in.

4.5.2 Loss Function Ablation

In this section, we demonstrate that the network structure alone is not enough for decomposition, and that is required to induce the separation. We perform these experiments with our full model on the Glasses and Cups and show results in Table 2. On both datasets, the IoU shows that it is clear that is needed to learn the separation, and this leaves the composite rendering relatively unaffected as evidenced by FID, PSNR, SSIM, and LPIPS. Additionally, is also useful in learning in the separation. It aids in suppressing artifacts such as shadows and lighting variation from appearing in the foreground model, which results in a clear delineation between foreground and background. These effects are too subtle to influence the quantitative metrics, thus we omit these numbers from Table 2. However, we demonstrate these effects qualitatively in Figure 4.

4.5.3 Robustness to Limited Viewpoints

Having a large range of viewpoints for each object helps in learning object geometries. However, in many real-world settings, this may not hold. For example, most of the videos in Cups only show half of the cup. We show that our deformable formulation fares better in low viewpoint range regimes. We test this with Cars and Cups by creating datasets from them that have 4 levels of varying azimuth ranges, denoted by . For each instance in Cars, we render from viewpoints that are constrained with . For Cups

, we do not have such control over the data. Instead, for each video capture we sort all viewpoints by their azimuth angle and keep only the first 25%, 50%, 75% and 100% of the viewpoints. This results in mean and standard deviations of

for each dataset, where

means the random variable

is distributed with mean and standard deviation .

We evaluate instance interpolation and geometry reasoning in Figure 5 with FID and IoU (as a proxy to geometry). Obviously, as the azimuth range decreases, so does the performance of each model. However, on the synthetic Cars dataset, our proposed model fares better on both metrics in these low viewpoint range settings. Additionally, on Cups, our model still performs well as reaches 1.5 on average, and still retains an IoU of 0.835 while NeRF+L+S drops to 0.595. These results show that our formulation better captures the nature of object categories, and is desirable to deploy in these viewpoint constrained settings.

4.5.4 Optimizing Camera Poses

offsets FID IoU offsets FID IoU
106.4 0.8543 85.67 0.8967
Table 3: Due to inaccurate camera poses in Objectron [2], jointly optimizing FiG-NeRF and camera extrinsics leads to better object modelling on cups. Note that IoU is evaluated on training set.

Camera poses from Objectron [2] are obtained via visual-inertial odometry, which are prone to drift and can result in inaccurate camera estimates. For the instances in cups, this causes significant jitter in the camera poses between consecutive frames of the same video, thus the instance appears to move slightly within the 3D volume. This results in the model learning to put mass in an envelope that contains all of these offsets, as seen in Figure 6. We address this issue by optimizing the camera extrinsics during training. In particular, we leave the camera parameters fixed for the first 50k iterations of training to let the FiG-NeRF separate foreground from background, and then learn offsets to each image’s camera extrinsics for the rest of the training procedure. Additionally, we set since object scale is ambiguous as we optimize cameras. In Table 3, we see that training the cups model with the camera optimization results in much better FID and IoU, indicating better interpolations and geometry. Note that because we optimized camera poses, IoU is computed on the training set since we no longer have ground truth cameras for the heldout views.

4.5.5 Segmentation Results

Since our reference segmentations are noisy, we qualitatively compare them with our learned segmentations in Figure 7. We show that our model can effectively capture thin structures in the foreground component on Glasses. Additionally, on Cups our model frequently learns to correctly label the free space in the handle of the cup as background. Unfortunately, we do not have access to segmentations that are accurate enough to reflect this quantitatively. These results are learned by leveraging the model structure and sparsity priors without any supervision for the separation.

Because our foreground component learns a representation of the object from multiple views, in settings where the background occludes the foreground (e.g. the mannequin head in glasses can occlude the glasses temples from certain viewpoints), we can easily compute an amodal segmentation by thresholding the accumulated foreground density . We show examples of this on the project website.

4.5.6 Interpolations

Figure 9: We show examples of rendering Glasses and Cups while keeping either the shape or color fixed. The left column and top row shows training images, while the middle shows foreground renderings. Best viewed in color and zoomed in.

We visualize the ability of our model to interpolate between instances in Figure 8 for Glasses and Cups. In the rendered foreground, we demonstrate smooth interpolations between the instances. The midpoints of the interpolations give plausible objects. For example, in the glasses interpolation, the midpoint shows a generated pair of glasses that exhibits the shape of the left pair (in the green box) while the frame has the thinness of the right pair (in the red box). In the cup interpolation, we see the cup size smoothly increasing from left to right. Additionally, the midpoint shows a curved handle which is a feature of the cup on the right, but the handle size reflects the cup on the left.

Since color and shape are disentangled by our network structure, we can fix one while interpolating the other. In Figure 9, we demonstrate this on Glasses and Cups. Our model is able to adapt the texture to the other shapes. In particular, row 3 of Glasses shows multi-colored glasses, where the color boundaries can be evidently seen in the thicker glasses. Additionally, the texture of the black glasses with the beads in the corner ( row) is successfully transferred to all of the other glasses shapes. On Cups, we successfully transfer the metal and purple textures to the other cups. Note that the cup geometry has no handle. Our model struggles to transfer the texture of the yellow cup to the other geometries, however the outline of the smiley face is visible.

4.5.7 Failure Cases

Figure 10: We show 2 failure modes. See text for discussion.

We discuss two failure cases in Figure 10, showing foreground renderings and segmentation masks. First, Figure 10 (left) shows that if the background has a complicated texture such as a quilt pattern, it can cause some of the background to leak into the foreground. Secondly, if the object has a similar color as background, can push this into the background model to achieve lower loss. In Figure 10 (right), the foreground did not capture the top bridge of the glasses.

5 Conclusion and Future Work

We have demonstrated a 2-component Deformable Neural Radiance Field model, FiG-NeRF, for jointly modeling object categories in 3D whilst separating them from backgrounds of variable appearance. We applied this to modeling object categories from handheld captures. Our results suggest that modeling 3D objects “in-the-wild” is a promising direction for vision research. Future work might improve on our method by handling a larger diversity of backgrounds, and robustness to geometric errors in capture.

Appendix A Implementation Details

a.1 Schedule

We apply a top- schedule to so that it focuses more on hard negatives. We apply the following schedule: [(50k, 0.0), (50k, 0.5), (50k, 0.25), (50k, 0.1), (, 0.05)], where each item in the schedule means we apply only to the top percentage of pixels with the highest loss for

iterations. Additionally, since the Beta distribution density function has gradients approaching infinite magnitude around

, we clip the input between and contract (scale) the input to this loss to such that there are no gradients when the input is very close to , and the maximum magnitude of the gradient isn’t too large.

a.2 Random Density Perturbation

Additionally, we add some randomness to the initial portion of the training procedure in order to help encourage the separation. We perturb and (the foreground and background volume densities, respectively) with a random variable for the first 50k iterations of training. This helps the training process avoid local minima where either the foreground or background learn the entire scene, and the other learns nothing.

a.3 Integral Approximations

We follow [22, 21] and approximate the integrals with numerical quadrature. Let be samples along the ray between the near/far bounds such that . We define . To approximate (Eq. (2) in main paper) with ,


where and .

Additionally, to apply , we approximate (Eq. (6) in the main paper) with by computing


where .

a.4 Positional Encodings

For the background model and foreground template, we use a positional encoding with 10 frequencies, as in the original NeRF [22]. For the deformation field, we use 10 frequencies for the spatial encoding on cars and glasses since they can exhibit high frequency geometry such as spoilers, side mirrors, and thin frames, while we use 4 frequencies for cups since they typically do not exhibit high frequency geometry. Our viewing directions use a positional encoding with 4 frequencies.

Appendix B Dataset Details

b.1 ShapeNet

For this dataset, we removed (the latent vector that controls background appearance) from the NeRF-based models (NeRF+L, NeRF+L+S, NeRF+L+S+D) since the pure gray background does not change for each instance.

b.2 Cups

We detail the exact cups videos that we manually selected from Objectron [2] to build our cups dataset in Table 4.

batch 1 3 5 6 7 8 9 13 15 16 18 21 23 24 25
video number 22 1 3 15 24 20 13 4 13 21 2 21 2 29 47
30 14 16 28 36 23 9 15 28 4 28 38
39 25 40 45 15 17 39 8 38
26 46 48 39 49 9
34 48 16
42 18
batch 27 28 30 31 32 33 34 36 38 39 41 42 45 46 48
video number 42 36 11 12 13 20 38 29 17 2 16 0 7 5 23
22 38 16 20 18 2 37 14 38
29 44 21 34 44 10 46 39 41
36 31 12 47 49
40 38 17
46 40 20
44 21
Table 4: Videos used from Objectron.

Appendix C More Results

Please see for more results. We show instance interpolation, viewpoint interpolation and extrapolation, separation, and amodal segmentation results.


  • [1] William Agnew, Christopher Xie, Aaron Walsman, Octavian Murad, Caelen Wang, Pedro Domingos, and Siddhartha Srinivasa. Amodal 3d reconstruction for robotic manipulation via stability and connectivity. In Conference on Robot Learning (CoRL), 2020.
  • [2] Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Artsiom Ablavatski, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2021.
  • [3] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.
  • [4] Piotr Bojanowski, Armand Joulin, David Lopez-Pas, and Arthur Szlam. Optimizing the latent space of generative networks. In

    International Conference on Machine Learning (ICML)

    , 2018.
  • [5] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012, 2015.
  • [6] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. In European Conference on Computer Vision (ECCV), 2016.
  • [7] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  • [8] Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V Le, and Xiaodan Song. Spinenet: Learning scale-permuted backbone for recognition and localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [9] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3D shape induction from 2D views of multiple objects. In International Conference on 3D Vision (3DV), 2017.
  • [10] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh R-CNN. In IEEE International Conference on Computer Vision (ICCV), 2019.
  • [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
  • [12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [13] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. Escaping plato’s cave: 3d shape from adversarial rendering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [14] Philipp Henzler, Jeremy Reizenstein, Patrick Labatut, Roman Shapovalov, Tobias Ritschel, Andrea Vedaldi, and David Novotny. Unsupervised learning of 3d object categories from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems (NeurIPS), 2017.
  • [16] Yoshitaka Ushiku Hiroharu Kato and Tatsuya Harada. Neural 3D Mesh Renderer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [17] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning Category-Specific Mesh Reconstruction from Image Collections. In European Conference on Computer Vision (ECCV), 2018.
  • [18] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [19] Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. ACM Transactions on Graphics (TOG), 38(4), 2019.
  • [20] Ricardo Martin-Brualla, Rohit Pandey, Sofien Bouaziz, Matthew Brown, and Dan B Goldman. GeLaTO: Generative Latent Textured Objects. In European Conference on Computer Vision, 2020.
  • [21] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [22] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In European Conference on Computer Vision (ECCV), 2020.
  • [23] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. HoloGAN: Unsupervised Learning of 3D Representations From Natural Images. In IEEE International Conference on Computer Vision (ICCV), 2019.
  • [24] Michael Niemeyer and Andreas Geiger. GIRAFFE: Representing scenes as compositional generative neural feature fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [25] David Novotny, Diane Larlus, and Andrea Vedaldi. Learning 3D object categories by looking around them. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [26] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [27] Keunhong Park, Arsalan Mousavian, Yu Xiang, and Dieter Fox.

    Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [28] Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Deformable neural radiance fields. arXiv preprint arXiv:2011.12948, 2020.
  • [29] Xiaofeng Ren, Charless C Fowlkes, and Jitendra Malik. Figure/ground assignment in natural images. In European Conference on Computer Vision (ECCV), 2006.
  • [30] Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [31] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [32] Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein.

    Implicit Neural Representations with Periodic Activation Functions.

    In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [33] Vincent Sitzmann, Justus Thies, Felix Heide, Matthias Nießner, Gordon Wetzstein, and Michael Zollhöfer. DeepVoxels: Learning Persistent 3D Feature Embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [34] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [35] Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8(2):131–162, 2007.
  • [36] Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • [37] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [38] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B Goldman, and M. Zollhöfer. State of the Art on Neural Rendering. Computer Graphics Forum (EG STAR), 2020.
  • [39] Justus Thies, Michael Zollhöfer, and Matthias Nießner. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
  • [40] Sara Vicente, Joao Carreira, Lourdes Agapito, and Jorge Batista. Reconstructing PASCAL VOC. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [41] He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J. Guibas. Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [42] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. In European Conference on Computer Vision (ECCV), 2018.
  • [43] Olivia Wiles and Andrew Zisserman. Silnet : Single- and multi-view reconstruction by learning from silhouettes. In British Machine Vision Conference (BMVC), 2017.
  • [44] Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi.

    Unsupervised learning of probably symmetric deformable 3D objects from images in the wild.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [45] Yufei Ye, Shubham Tulsiani, and Abhinav Gupta. Shelf-supervised mesh prediction in the wild. arXiv preprint arXiv:2102.06195, 2021.
  • [46] Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, and Steven Lovegrove. Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering. arXiv preprint arXiv:2101.01602, 2021.
  • [47] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.

    The unreasonable effectiveness of deep features as a perceptual metric.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [48] Yan Zhu, Yuandong Tian, Dimitris Metaxas, and Piotr Dollár. Semantic amodal segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.