Unsupervised Discovery of Object Radiance Fields
We study the problem of inferring an object-centric scene representation from a single image, aiming to derive a representation that explains the image formation process, captures the scene's 3D nature, and is learned without supervision. Most existing methods on scene decomposition lack one or more of these characteristics, due to the fundamental challenge in integrating the complex 3D-to-2D image formation process into powerful inference schemes like deep networks. In this paper, we propose unsupervised discovery of Object Radiance Fields (uORF), integrating recent progresses in neural 3D scene representations and rendering with deep inference networks for unsupervised 3D scene decomposition. Trained on multi-view RGB images without annotations, uORF learns to decompose complex scenes with diverse, textured background from a single image. We show that uORF performs well on unsupervised 3D scene segmentation, novel view synthesis, and scene editing on three datasets.READ FULL TEXT VIEW PDF
Novel View Synthesis (NVS) is concerned with the generation of novel vie...
The ability to decompose scenes into their object components is a desire...
We tackle the challenge of learning a distribution over complex, realist...
With the advent of Neural Radiance Fields (NeRF), neural networks can no...
We present an approach to infer a layer-structured 3D representation of ...
The ability to decompose complex multi-object scenes into meaningful
Cut-and-paste methods take an object from one image and insert it into
Unsupervised Discovery of Object Radiance Fields
Building factorized, object-centric scene representations is a fundamental ability in human vision and a constant topic of interest in computer vision and machine learning. We identify that such representations should bear three characteristics: they should be learned without supervision or prior knowledge about object categories, and therefore applicable to environments where object categories are unknown; they should explain the image formation process, addressing questions like ‘what if the object is not there?’; they should be 3D-aware, capturing geometric and physical object properties for navigation, interaction, and manipulation.
For decades, researchers have attempted to solve the problems from various angles. Inspiring as they are, these methods each lack in one or more of the three aspects. Computer vision research on unsupervised object discovery has achieved great success on deriving object segments from real images, but it doesn’t capture the image formation process, nor is it 3D-aware rubinstein2013unsupervised ; zhu2012unsupervised . Recent work on deep probabilistic inference for visual scene decomposition is unsupervised and generative burgess2019monet ; engelcke2019genesis ; greff2019multi ; lin2020space ; locatello2020object , though most still formulate the problem as 2D segmentation and work on simple scenes of geometric primitives, ignoring the complex 3D nature of realistic visual scenes. A few recent papers on ‘scene de-rendering’ have attempted to reconstruct 3D, object-centric representations by leveraging the forward rendering procedure yao20183d ; ost2020neuralscenegraphs ; they are however supervised, relying on annotations of specific object and scene categories, such as cars and road scenes.
The fundamental challenge that prevents these systems from acquiring all three desired characteristics is that the image formation process from 3D to 2D is complex and non-differentiable (e.g., due to occlusion); thus, for a long time, it has been unclear how it may be integrated with powerful inference schemes, such as deep neural networks. But most recently, progresses in differentiable and neural renderingtewari2020state ; kato2020differentiable have demonstrated that their continuous nature works well with gradient-based inference models, capturing high-fidelity 3D scenes. In particular, Neural Radiance Fields (NeRFs) mildenhall2020nerf recover a 3D scene from a set of RGB images via differentiable volume rendering. Such encouraging advances in generative modeling suggest a promising route for inferring 3D, generative, and object-centric scene representations without supervision.
In this paper, we propose unsupervised discovery of Object Radiance Fields (uORF), integrating conditional NeRFs as 3D object representations with deep inference networks for unsupervised 3D scene decomposition. uORF infers a set of object-centric latent codes through a slot-based encoder from a single image locatello2020object . Each latent code is decoded into an object radiance field; thus, uORF represents a 3D scene as a composition of object radiance fields (Figure 1). During training, such radiance fields are neurally rendered in multiple views, with reconstruction losses in pixel space as training supervision; during testing, uORF infers the set of object radiance fields from a single image. Again, learning uORF does not require explicit supervision of 3D geometry or object segmentation, but only multi-view RGB images of training scenes.
The integration of NeRFs allows us to work with more realistic scenes with complex, diverse background environments, beyond simple scenes with the same textureless clean background, such as those in CLEVR johnson2017clevr and multi-dSprites greff2019multi , as considered by most current unsupervised scene decomposition methods. We further make two innovations to improve uORF’s performance. First, as background geometry and appearance can be quite different from foreground objects, we design uORF with explicit modeling of both components. This background-aware design not only facilitates learning on complex scenes, but also allows single-image scene manipulation including moving individual objects and changing background. Second, as volume rendering requires massive queries to render a single pixel for the recomposed scene, a practical challenge of learning uORF lies in the computational inefficiency. We tackle this issue by proposing a novel progressive coarse-to-fine training which improves representation quality while remaining affordable computational cost.
We evaluate uORF on both scene representation learning (e.g., 3D segmentation) and scene generation (e.g., novel view synthesis, scene manipulation). Our evaluation is on three photo-realistic datasets with a gradually increasing complexity: first, CLEVR-like scenes with primitives foreground shapes; second, room scenes with complex chair shapes and textured backgrounds; third, more diverse room scenes with various foreground shapes and backgrounds. Our results show that uORF learns factorized representations that can segment 3D scenes into objects with fine shape details (e.g., thin chair legs) and backgrounds with well-recovered appearance details (e.g., irregular textures of a wooden floor). We also show that the learned representations allow 3D scene manipulation including moving objects and changing background appearances. We will release all code and data.
Our work is closely related to traditional computer vision methods on object discovery, which aims to locate (visually similar) objects in a collection of images. These methods typically model objects as visual words and adopted methods from topic modeling to localize objects russell2006using ; sivic2005discovering ; sivic2008unsupervised , or cluster and group image patches grauman2006unsupervised ; joulin2010discriminative ; rubio2012unsupervised ; vicente2011object ; rubinstein2013unsupervised ; cho2015unsupervised
. Recent works have integrated the clustering-based strategy with deep learningli2019group ; vo2020toward . Nevertheless, they do not explain image formation process nor are they 3D-aware.
Our method is also closely related to recent work on deep probabilistic inference for scene decomposition. Most works formulate the problem as compositional generative models, in which a visual scene is represented by a set of latent codes that either correspond to localized object-centric patches eslami2016attend ; crawford2019spatially ; kosiorek2018sequential ; lin2020space ; jiang2019scalable or scene mixture components burgess2019monet ; greff2019multi ; greff2016tagger ; greff2017neural ; engelcke2019genesis . The scene mixture models generate full-sized images for each latent code and blend them via attentional masks burgess2019monet in iterative variational inference frameworks. Recently, Locatello et al. locatello2020object proposed the Slot Attention module to simplify the inference by a slot-based encoder. We adopt a similar slot-based encoder architecture locatello2020object , but ours explicitly models background environment to deal with complex scenes. Besides these inference models, Monnier et al. formulated scene decomposition as layered image decomposition and demonstrated it on real images monnier2021unsupervised . However, these methods do not account for the 3D nature of scenes.
Very recently, a few works also focus on unsupervised 3D scene decomposition. Elich et al. elich2020semi infer object shapes park2019deepsdf from a single scene image, but they require pretraining on groundtruth shapes. Chen et al. chen2020learning extend Generative Query Network eslami2018neural to decompose 3D scenes, but they require multi-view images during inference. The closest to our work is a concurrent work by Stelzner et al. stelzner2021decomposing which also utilizes a slot-based encoder and NeRFs as 3D representations. However, stelzner2021decomposing relies on groundtruth multi-view dense depth in addition to images in training. Moreover, we explicitly model the separation of objects and background to address various complex shapes and textured backgrounds, while they only demonstrate scenes with a single textureless background.
A few recent works have shown reconstructing 3D object-centric representations by incorporating forward image rendering process wu2017neural ; yao20183d ; kundu20183d ; ost2020neuralscenegraphs . Yao et al. yao20183d de-render an image into semantic segments and geometric object attributes, which enable 3D scene manipulation. Most recently, Ost et al. propose Neural Scene Graph to represent dynamic scenes into a scene graph where each node encodes object-centric information. However, these methods rely on manual annotations of specific objects (such as cars) and scene categories (such as street scenes).
Our method is related to recent progresses in neural continuous scene representations park2019deepsdf ; mescheder2019occupancy ; sitzmann2019scene and neural rendering tewari2020state . Neural scene representations parameterize 3D scenes with a deep network sitzmann2019scene . Combined with differentiable neural rendering techniques kato2020differentiable ; tewari2020state , they can be learned from only 2D images niemeyer2020differentiable ; sitzmann2019scene . In particular, Neural Radiance Fields (NeRFs) mildenhall2020nerf have shown impressive novel view synthesis from a set of densely captured images. Related follow-up works include those that infer NeRFs from a single image yu2020pixelnerf ; kosiorek2021nerf ; rematas2021sharf and those that incorporate NeRFs into generative models schwarz2020graf ; niemeyer2021campari ; chan2020pi . Different from these works which cope with single objects or holistic scenes, we learn object NeRFs via decomposing a multi-object scene without segmentation annotations. Compositional generative modeling of 3D scenes is also related to our work niemeyer2020giraffe ; guo2020object . GIRAFFE niemeyer2020giraffe adversarially generates latent codes to condition object NeRFs and thus compose 3D scenes. While they target at compositional scene synthesis, we instead focus on object discovery (inference).
Our goal is to infer from a single image a set of object-centric 3D representations to recover the underlying scene. We show an illustration in Figure 2. We assume that an underlying 3D scene is composed of a background and foreground objects , where we represent them by neural radiance fields mildenhall2020nerf conditioned on latent codes. The latent codes are inferred from an RGB image by a slot-based encoder (Figure 2-I). After being decoded, all foreground objects and background can then be recomposed and re-rendered from arbitrary camera views (Figure 2-II). We train our model by comparing the re-rendered images to reference RGB images (Figure 2-III) without needing 3D geometry or segmentation annotations. We describe each of our model components in the following subsections, and leave implementation details in supplement.
Our encoder infers latent object-centric representations from a single image. As shown in Figure 3, it consists of a convolutional net to extract features and a background-aware slot attention module to produce latent codes from the feature maps.
The convolutional net extracts features from the input image for the slot attention module. Because we want the model to generalize to decompose unseen images, it is natural to represent foreground objects position and pose in the viewer coordinate system. As identified in previous studies tatarchenko2019single , this facilitates the learning of 3D object position and helps generalization. In order for the object-centric representations to include such information in the viewer coordinate system, we inform the encoder of position information by feeding pixel coordinates and viewer-space ray directions as additional input channels.
Given the feature maps extracted from a convolutional network, we adopt the Slot Attention module locatello2020object to produce a set of permutation-invariant latent codes in the same representational space. Each latent code binds to a specific group of the convolutional features to explain an object. However, in 3D scenes, the geometry and appearance of the background are usually highly different from those of foreground objects. Modeling them indistinguishably often leads to object representations entangled with blurry background segments burgess2019monet ; locatello2020object , which impedes applications such as scene manipulation and re-composition. Therefore, we propose modeling the separation of foreground objects and background explicitly.
To do this, we extend the slot-based encoder to allow a single slot to lie in a different latent space than the other slots to specialize for the background features. We show pseudo-code of our background-aware slot attention in the supplementary material. We also refer the readers to Locatello et.al. locatello2020object for more details and insight of the slot attention module. In the following we describe a single iteration of our background-aware slot attention module.
We flatten convolutional feature maps into a set of
input feature vectors,. The latent representations (i.e., slots) are initialized by sampling from two learnable Gaussians, i.e., and for background and foreground objects, respectively. All slots are then competing to explain the inputs via a dot-product softmax-based attention bahdanau2014neural ; luong2015effective ; vaswani2017attention :
Here and / are learnable linear mappings for computing dot-product similarity luong2015effective , and is a fixed softmax temperature vaswani2017attention . The softmax normalization introduces competition among all slots for explaining the feature vectors by enforcing the attention coefficients for each input feature vector to add up to one. The background slot is expected to capture the modality of background features and explain all of them, allowing foreground slots to focus only on the objects without explaining background segments (Figure 3). With the attention coefficients, input values are aggregated via a weighted mean pooling , where , and , where .
All slots are then updated using the aggregated values via a learnable updating rule parameterized by a Gated Recurrent Unit (GRU)cho2014learning . Notice that the updating is applied independently for each slot with shared parameters (except for the background slot due to its different feature modality). The final latent codes are the slots after being updated for iterations.
We use the latent codes to condition neural radiance fields (NeRFs) mildenhall2020nerf to represent the 3D objects. A NeRF is a continuous mapping from spatial location and viewing direction to emitted color and volume density used for volume rendering max1995optical . This mapping is parameterized by an MLP network. We use a conditional NeRF that acts like an implicit decoder for each object. Specifically, we represent the background by and the foreground objects by another conditional NeRF network .
To compose individual objects and background into the holistic scene, we consider a scene mixture model and use density-weighted mean to combine all components: , where . Here and are the combined density and color, respectively. The color of a camera ray
is then estimated via numerical integration of volume rendering, usingdiscrete combined samples along a ray max1995optical : , where . Here is the distance between adjacent samples along a ray.
During training, we render multiple views from a recomposed scene NeRF for supervision. Our training loss function comprises of a reconstruction loss, a perceptual loss, and an adversarial loss:, where are weights. The reconstruction loss is , where and denote the groundtruth image and rendered image, respectively.
Since we estimate 3D radiance fields from a single view, there can be uncertainties about the appearance from other views (e.g., the back view). For example, regarding visual appearance of objects, inaccurate global lighting estimation leads to uncertainties in brightness and shadows from occluded views even if the object shapes can be well estimated. To address this, we incorporate a perceptual loss johnson2016perceptual which is tolerant to mild appearance changes. The perceptual loss is defined by where
is a deep feature extractor.
In addition to appearance, there can be even higher uncertainties in estimating object shapes from a single view, which is a multi-modal distribution. In this case, the unimodal reconstruction loss leads to blurry results (“mean shape”). We mitigate this issue by adding an adversarial loss which can deal with multi-modal distributions:
Here we adopt the non-saturating loss with R1 regularization mescheder2018training .
A practical challenge in training compositional NeRFs lies in the computational cost of neural volume rendering, as it requires massive queries to render a single pixel. While there have been attempts on fast inference Liu20neurips_sparse_nerf ; Rebain20arxiv_derf ; neff2021donerf ; Garbin21arxiv_FastNeRF ; reiser2021kilonerf ; yu2021plenoctrees , high space complexity in training remains a challenge. Further, because our perceptual and adversarial losses depend on image patches, the system has to render a large enough patch (instead of a single pixel) at the same time, which further increases its space demand.
To allow training on a higher resolution, we propose a coarse-to-fine progressive training. In a coarse training stage, we bilinearly downsample image supervision to a base resolution, and train uORF on these downsampled images. Although the coarsely trained model can already decompose the 3D scenes and recover rough object radiance fields, fine details (e.g., thin legs of chairs) might be missing. Thus, in a following fine training stage, we refine our model by training on patches randomly cropped from images of the higher target resolution. Specifically, the fine training stage can be easily implemented by replacing the holistic downsampled images with patches of the same base resolution. We include more details in the supplementary material.
We evaluate uORF on both scene representation (via 3D segmentation) and scene generation (via novel view synthesis and scene manipulation) on three photo-realistic datasets.
We build three photo-realistic synthetic datasets with gradually increasing complexity. For each scene in the dataset, we point the camera to the scene center and render four images with a randomly chosen azimuth angle and a fixed elevation angle.
The first dataset includes scenes of 5–7 CLEVR objects johnson2017clevr , with a random position and orientation and a clean background. Foreground object shapes include three geometric primitives (i.e., cubes, spheres and cylinders). Since there is intrinsic ambiguity in estimating specularity from a single image, we use only the largely diffuse “Rubber” material. There are 1,000 scenes for training and 500 for testing.
The second dataset includes scenes of 3 to 4 chairs of the same shape in a room with three different textured backgrounds. There are 1,000 scenes for training and 500 for testing.
The third dataset includes scenes of diverse foreground object shapes and background appearances. Each scene includes 4 different chairs, whose shape is randomly sampled from 1,200 ShapeNet chair shapes chang2015shapenet , and the background is sampled from 50 floor textures from the web. There are 5,000 scenes for training and 500 for testing.
We first evaluate uORF’s factorized 3D scene representations via 3D scene segmentation.
Because there is no previous work focusing on the same setting as uORF, we compare to a 2D state-of-the-art scene decomposition model Slot Attention locatello2020object for unsupervised scene segmentation wherever possible. In addition, we compare to two ablated versions of uORF. First, we remove our background-aware modeling but keep the same number of slots. Second, we ablate our progressive training such that the training procedure only contains the coarse training stage. We refer to ablated models as “uORF (w/o background)” and “uORF (w/o prog. train.)”, respectively.
We adopt the widely-used Adjusted Rand Index (ARI) as our metric. To evaluate 3D scene segmentation, we consider three kinds of ARIs: (1) For direct comparison to 2D methods, we compute ARI on reconstructed images. (2) To reflect the 3D nature, we also compute ARI on synthesized novel views, denoted as “NV-ARI”. Note that each scene includes 4 views, and only one is used as input, and the other three are treated as novel views for this metric. (3) In line with previous 2D methods, we also report foreground ARI (Fg-ARI), computed only on foreground regions indicated by groundtruth masks. Yet, we note that Fg-ARI is not as accurate as ARI to reflect the segmentation quality, because background segments assigned to foreground slots are treated correct.
. For all segmentation metrics, we show mean and standard deviation for three runs. uORF outperforms all methods in terms of ARI and NV-ARI. From Figure4, it is clear that uORF is able to discover the 3D objects from a single image, and uORF can better depict the object outlines while entangling less background segments. These results validate that uORF can learn well-factorized 3D object-centric scene representations.
|Slot Attention locatello2020object||3.50.7||-||93.21.5||38.418.4||-||40.24.5||17.411.3||-||43.811.7|
|uORF (w/o background)||11.74.6||10.53.6||86.42.8||42.310.6||40.49.2||93.31.9||24.09.9||21.08.1||78.93.1|
|uORF (w/o prog. train.)||83.70.8||81.10.7||84.20.5||65.42.6||62.32.5||81.03.0||63.71.7||53.81.4||66.94.1|
We then show that uORF is 3D-aware and generative via evaluation on novel view synthesis.
For each test scene, we randomly pick one image as input and the remaining three images as groundtruth for novel view synthesis. As Slot Attention is purely in 2D and does not support novel view synthesis, we compare to a variant of NeRF mildenhall2020nerf , equipped with an encoder similar to uORF, termed as “NeRF-AE”. For fair comparison, we increase the latent bottleneck dimension for NeRF-AE to guarantee approximately the same computational cost, and we use the same training strategy and losses as uORF. Thus, NeRF-AE can also be seen as a monolithic alternative model to uORF. We also compare with the same ablated models, “uORF (w/o background)” and “uORF (w/o prog. train.)”, as in scene segmentation. We use the perceptual metric LPIPS zhang2018unreasonable , together with SSIM wang2004image
and PSNR, as our evaluation metrics.
Quantitative results are in Table 2 and qualitative results are in Figure 5 (more qualitative results can be found in supplementary materials). Quantitatively, uORF outperforms all compared methods on all metrics. From the qualitative comparison in Figure 5, we highlight three advantages of uORF. First, compared with NeRF-AE, which has a monolithic latent structure for the entire scene, uORF better preserves the features of each object: for example, see how NeRF-AE fuses object colors in the first two rows, while uORF does not. This shows the advantage of factorized scene representations to structurally describe a visual scene. Second, compared with uORF (w/o background), one can clearly see how our background-aware modeling helps recovering background appearances: uORF can accurately recover background appearance of the Room-Chair example, while uORF (w/o background) does not. It also facilitates learning on complex scenes with diverse, textured background: uORF can learn to roughly recover object shapes in the Room-Diverse example. Third, compared with uORF (w/o prog. train.), we highlight that the fine training on image patches indeed improves both visual quality and representation quality: the full uORF tries to recover sharp edges of cubes, while uORF (w/o prog. train.) cannot distinguish cube from sphere.
Overall, the novel view synthesis results suggest that uORF can learn to represent 3D scenes with reasonable fidelity, even with the presence of complex foreground object shapes, such as chairs and different textured backgrounds.
|uORF (w/o background)||0.0919||0.8924||28.93||0.1671||0.7852||27.86||0.2231||0.6924||25.90|
|uORF (w/o prog. train.)||0.1044||0.8894||28.84||0.1573||0.8287||28.33||0.2123||0.6760||25.19|
Being object-centric and 3D-aware, uORF is able to edit 3D scene radiance fields inferred from a single view, and generate novel scene images.
We test uORF’s ability to edit scenes and synthesize novel images on the Room-Chair dataset. We consider both moving foreground objects and changing background appearance. For object moving, we randomly pick one object in a test scene and move it to a random position. We render 4 images for each of the 500 test scenes. For background changing, we replace the current background texture to a different one and also render 4 images for evaluation. To indicate the new background, we re-pick and re-put foreground objects such that the resultant background indicator image is different from the groundtruth image.
For uORF and Slot Attention locatello2020object , we use groundtruth masks of the input view to determine which slot should be manipulated by picking the one with largest mask IoU. For NeRF-AE mildenhall2020nerf , we back-project the masks to frustums to determine the 3D regions to be moved/replaced. As in view synthesis, we use LPIPS, SSIM, and PSNR as our metrics.
Finally we explore the generalization ability of uORF. We consider generalization on unseen, challenging spatial arrangement of objects, as well as generalization on unseen object appearances.
We build a new test dataset, packed-CLEVR-11, where each scene has 11 objects that are closely packed into a cluster. Therefore, each scene bears an unseen number of objects in an unseen challenging arrangement. We test models trained on CLEVR-567, report results in Table 6 and the supplementary material. Despite uORF never sees such object arrangements, it still achieves a reasonable performance and outperforms baselines.
For unseen object appearances, we consider generalization in a systematic way such that the model can deal with unseen combination of object color and shape. Thus, we build a new training set similar to CLEVR-567, but we remove red cylinders and blue spheres from the object candidate pool. Then we test trained models on another dataset with only red cylinders and blue spheres in the candidate pool. We show results in Table 6 and examples in the supplement material. We see that although uORF has never seen any of the test set objects, it achieves similar results to the one trained on a normal CLEVR-567 dataset (denoted as “uORF (oracle)”). This suggests uORF’s ability for systematic generalization to unseen combinations of object color and shape.
uORF uses perceptual and adversarial losses to combat intrinsic uncertainties in single-image inference of 3D representations. We show ablation results on novel view synthesis in Table 6 and the supplementary material. Both losses significantly improve image quality.
In this work, we propose unsupervised discovery of Object Radiance Fields (uORF), which learns to infer object-centric 3D radiance fields from a single image of complex multi-object scenes. We demonstrate uORF’s ability on 3D scene segmentation and scene generation. Our positive results suggest a promising direction to integrate neural rendering into deep probabilistic inference scheme, allowing learning factorized 3D object-centric scene representations from only RGB images.
Learning object-centric scene representations is a long-standing topic in vision and it finds various applications in downstream tasks. We represent a 3D scene as a composition of simple radiance fields, which only models object appearances and entangles their physical properties that may be crucial to downstream tasks in a non-interpretable way. However, we envision that careful designs in more structured 3D object representations for specific downstream applications could help improve transparency and human interpretability in model prediction and behavior, allowing both better performances and secure, fair usage. In our code release, we will explicitly specify allowable uses of our system with appropriate licenses. We will use techniques such as watermarking to identify and label visual contents generated by our system.
This work is supported by an Amazon Research Award (ARA), the Samsung Global Research Outreach (GRO) Program, Toyota Research Institute, Autodesk, a Qualcomm Innovation Fellowship (QIF), and Stanford Institute for Human-Centered AI (HAI).
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1201–1210, 2015.
Spatially invariant unsupervised object detection with convolutional neural networks.In
Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
Perceptual losses for real-time style transfer and super-resolution.In European conference on computer vision, 2016.
Group-wise deep object co-segmentation with co-attention recurrent neural network.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8519–8528, 2019.