1 Introduction
Reinforcement learning agents that need to deal with 3D scenes often have to grapple with enormous observation spaces that result from the combinatorial interactions between objects and scene layouts. This phenomenon renders naïve representations all but infeasible except on the simplest of 3D scenes. Motivated in part by cognitive psychology studies ObjectFile that suggest human brains organize observations at an object level, recent advances in reinforcement learning research OOMDP; SchemaNetworks and physical prediction OOPhysics1 have demonstrated superior robustness with environments modeled in an objectoriented manner. To exploit these techniques for 3D scenes, reinforcement learning agents need fast and robust ways to segment objects from 3D observations to achieve an objectcentric representation of the scene. This is the highlevel research problem that motivates the algorithm presented in this paper.
There is a good body of existing literature on supervised object segmentation techniques for 2D MaskRCNN; YOLO and 3D meshrcnn; PointGroup scenes. To get around the downside of having to rely heavily on on human annotations, unsupervised objectsegmentation methods from RGB images have received an increasing amount of attention IODINE; MONET; SPACE; PSG in recent times. In particular, generative models based on Variational Autoencoders (VAE) VAE
have been employed to model the pixel intensities of an image with spatial Gaussian mixture models
IODINE; SPACE; SPAIR; AIR. The encoderdecoder structure of a VAE forms an information bottleneck by compressing regions of highly correlated appearance, and the notion of objectness (i.e. the property of being an object) is then identified with image regions that exhibit strong appearance correlation. Unfortunately, extending existing imagebased generative models IODINE; MONET; SPACE; PSG to achieve unsupervised 3D point cloud segmentation is not straightforward. It is known that 2D imagebased VAEs mainly rely on coordinatedependent intensity modeling and reconstruction BCVAE. In contrast, finding correspondences between input point clouds and predicted ones for 3D point cloudbased generative models is far from trivial.Existing works have shown that learning objectcentric representations is crucial for unsupervised 2D object segmentation IODINE; MONET; SPAIR. SPAIR SPAIR, in particular, makes use of an objectspecification scheme which allows the method to scale well to scenes with a large number of objects. Inspired by SPAIR SPAIR and existing works on 3D point cloud object generation requiring centredpoints shapenet, we propose in this paper a VAEbased model called Spatially Invariant Attend, Infer, Repeat in 3D (SPAIR3D), a model that generates spatial mixture distributions on point clouds to discover and segment 3D objects from a point cloud of static scenes.
Here, in summary, are the key contributions of this paper:

We propose, to the best of our knowledge, the first generativemodelbased unsupervised point cloud segmentation pipeline, named SPAIR3D, capable of performing unsupervised 3D point cloud segmentation on static scenes.

We also propose a new Chamfer Likelihood
function tailored for learning mixture probability distributions over point cloud data with a novel graph neural network called Point Graph Flow network (PGF) that can be used to model and generate a variable number of 3D points for segmentation across diverse scenes.

We provide qualitative and quantitative results that show SPAIR3D can perform object segmentation on point clouds in an unsupervised manner using only spatial structure information across diverse scenes with an arbitrary number of objects.
2 Related Work
Unsupervised Generative Modelbased 2D Object Segmentation. Unsupervised object segmentation approaches have attracted increasing attention recently. A major focus of these methods is on joint object representationlearning and segmentation from images and videos via generative models MONET; IODINE; GENESIS; SPACE; SPAIR. In particular, spatial Gaussian mixture models are commonly adopted to model pixel colors, and the unsupervised object segmentation problem is then framed as a generative latentvariable modelling task. For example, IODINE IODINE employs amortized inference that iteratively updates the latent representation of each object and refines the reconstruction. GENESIS GENESIS and MONET MONET
sequentially decode each object, while Neural Expectation Maximization (NEM)
NEM realizes Expectation Maximization algorithms with RNNs. Such iterative processes allow IODINE and NEM to handle dynamic scenes where viewpoints or objects may change slowly.Instead of treating each component of the mixture model as a fullscale observation in images, Attend, Infer, Repeat (AIR) AIR confines the extent of each object to a local region. To improve the scalability of AIR, SPAIR SPAIR employs a grid spatial attention mechanism to propose objects locally, which has proven effective in objecttracking tasks SPAIROT. To achieve a complete scene segmentation, SPACE SPACE includes MONET in its framework for background modeling.
While the works discussed above achieved promising results for object representation and segmentation from 2D images, they cannot be directly applied to handle 3D point cloud due to the lack of inputprediction correspondences. Spatial attention model is employed to reconstruct 3D scenes in the form of meshes or voxel in an objectcentric fashion from a sequence of RGB frames
ObjectCentricVideoGeneration. This method similarly relies heavily on appearance and motion cues. In contrast, our work aims to segment 3D objects from point clouds of static scenes. Inspired by the aforementioned works employing VAE VAE; betaVAE structure to define the objectness by compressing image regions of high correlations, we explore in this paper correlations beyond appearance using point cloud with VAE.Graph Neural Network for Point Cloud Generation. Generative models such as VAEs gadelha2018multiresolution
and generative adversarial networks (GANs)
achlioptas2018learning have been successfully used for pointcloud generation. However, these generative models are constrained to generate a predefined fixed number of points, limiting their applications. Yang et.al PointFlow proposed a VAE and normalizing flowbased approach that models object shapes as continuous distributions. While the proposed approach allows the generation of a variable number of points, it could not be integrated into our framework naturally due to its requirement of solving an ODE. We therefore propose to build a graph neural network specialised for our VAEbased framework as a decoder to generate an arbitrary number of points at run time.3 Spair3d
SPAIR3D is a VAEbased generative model for 3D object segmentation via objectcentric pointcloud generation. It takes a point cloud as input and generates a structured latent representation for foreground objects and scene layout. In the following, we first describe latent representation learning. We then leverage variational inference to jointly learn the generative model (§3.2) and inference model (§3.4). We also discuss the particular challenges arising for generative models in handling a varying number of points with a novel Chamfer Likelihood (§3.3) and Point Graph Flow (§3.4).
3.1 Objectcentric Latent Representation
As shown in Fig. 0(a), SPAIR3D first decomposes the 3D scene uniformly into a voxel grid named spatial attention voxel grid. Due to the irregular distribution of the point cloud in 3D, there can be empty voxel cells with no point. After discarding those empty cells, we associate an object proposal with each nonempty spatial attention voxel cell, specified in the form of a bounding box. The set of input points captured by a bounding box is termed an object glimpse. Besides object glimpses, SPAIR3D also defines a scene glimpse
covering every point in an input scene. Later, we show that we encode and decode (reconstruct) points covered by each glimpse and generate a mixing weight for each point to form a probability mixture model over the scene. The mixture weights for each point naturally defines the segmentation of the point cloud.
3.2 Generative Model
Similar to SPAIR, each grid cell generates posterior distributions over a set of latent variables defined as , where encodes the relative position of the center of the object proposal to the center of the cell, encodes the apothem of the bounding box. Thus, each induces one object glimpse associated with the cell. Each object proposal is then associated with posterior distributions over a set of latent variables specified as where encodes the structure information of the corresponding object glimpse, encodes the mask for each point in the glimpse,
is a binary variable indicating whether the proposed object should exist (
= 1) or not ( = 0).Different from object glimpse, scene glimpse is defined by only one latent variable The mixing weight of the scene glimpse as well as those of object glimpses completes the mixture distribution of the point cloud. We assume
follows a Bernoulli distribution. The posteriors and priors of other latent variables are all set to isotropic Gaussian distributions (see the supplementary material Sec. A for details).
Given latent representations of objects and the scene, the complete likelihood for a point cloud is formulated as , where , is the Chamfer Likelihood defined below. As maximising the objective is intractable, we resort to variational inference method to maximise its evidence lower bound (ELBO).
3.3 Chamfer Likelihood
To maximize the ELBO, we maximize the reconstruction accuracy of the point cloud. Unlike generative modelbased unsupervised 2D segmentation methods that reconstruct the pixelwise appearance conditioning on its spatial coordinate, the reconstruction of a point cloud lost its pointwise correspondence to the original point cloud. To measure the reconstruction quality, Chamfer distance is commonly adopted to measure the discrepancy between the generated point cloud () and the input point cloud (). Formally, Chamfer distance is defined by . We refer to the first and the second term on the r.h.s as forward loss and backward loss respectively.
Unfortunately, the Chamfer distance does not fit into the variationalinference framework. To get around that, we propose a Chamfer Likelihood function tailored for training probability mixture models defined on point clouds. The Chamfer Likelihood has a forward likelihood and a backward likelihood corresponding to the forward and backward loss respectively, which we describe next.
Denote the glimpse as , and its reconstruction as , . Specifically, we treat the scene glimpse as the glimpse that contains all input points, that is, . Note that one input point can be a member of multiple glimpses. Below we use to denote the probability density value of point evaluated at a Gaussian distribution of mean
and variance
. For each input point in the glimpse, the glimpsewise forward likelihood of that point is defined as where is the normalizer andis a hyperparameter. For each glimpse
, , defines a mixing weight for point in the glimpse and . In particular, , , is determined by and the mask latent and formulated as , where if and otherwise. Here, is the mask decoder network (see §3.4), andis the Sigmoid function. The mixing weight for the scene layout points completes the distribution through
for . Thus, the final mixture model for an input point is The total forward likelihood of is then defined as . For each predicted point , the pointwise backward likelihood is defined as , where returns the glimpse index of . We denote and . The backward likelihood is then defined as .The exponential weighting, i.e. , is crucial. As each predicted point belongs to one and only one glimpse, it is thus difficult to impose a mixture model interpretation on the backward likelihood. The exponential weighting encourages the generated points in object glimpse to be close to input points of high probability belonging to . The backward likelihood provides the necessary regularization to enforce a reconstruction of high quality. Combining the forward and backward likelihood together, we define Chamfer Likelihood as . After inference, the segmentation label for each point is naturally obtained by .
The evidence lower bound is , where is the KL divergence between the prior and posterior of the latent variables (supp. material Sec. A for details).
In general, the normalizer of glimpsewise forward likelihood doesn’t have a closedform solution. For a glimpse with points, we have . Thus, instead of optimising , we optimise the lower bound . To improve the stability of early training process where exhibits high variance due to inaccurate object proposals, we absorb the term into and compute instead.
3.4 Model Structure
We next introduce the encoder and decoder network structure for SPAIR3D. The building blocks are based on graph neural networks and point convolution operations (See supp. Sec. C for details).
Encoder network. We design an encoder network to obtain the latent representations and from a point cloud, where encode information from points in grid cells and encode information for points from object glimpses. To achieve the spatially invariant property, we group one PointConv PointConv layer and one PointGNN PointGNN layer into pairs for message passing and information aggregation among points and between cells. We now provide details on how we use the encoder for voxel grids and glimpses to learn latent representations.
(a) Voxel Grid Encoding. The voxelgrid encoder takes a point cloud as input and generates for each spatial attention voxel cell two latent variables and to propose a glimpse potentially occupied by an object.
To better capture the point cloud information in , we build a voxel pyramid within each cell with the bottom level corresponding to the finest voxel grid. We aggregate information hierarchically using PointConvPointGNN pairs from bottom to top through each level of the pyramid for each cell . For each layer of the pyramid, we aggregate the features of all points and assign it to the centre of mass of points within a voxel cell. Then PointGNN is employed to perform message passing on the radius graph built on all new points. The output of the final aggregation block produces and via the reparametrization trick VAE.
We obtain the offset distance of a glimpse center from its corresponding grid cell center using , where is the maximum offset distance. The apothems of the glimpse in the direction is given by , where is the sigmoid function and defines the range of apothem.
(b) Glimpse Encoding. Given the predicted glimpsecentre offset and the apothems, we can associate one glimpse with each spatial attention voxel cell. We adopt the same encoder structure to encode each glimpse into one point , where is the glimpse center coordinate and
is the glimpse feature vector. We then generate
and from via the reparameterization trick.The generation of determines the glimpse rejection process and is crucial to the final segmentation quality. Unlike previous work SPAIR; SPACE, SPAIR3D generates from glimpse features instead of cell features based on our observation that message passing across glimpses provides more benefits in the glimpserejection process. To this end, a radius graph is first built on the point set to connect nearby glimpse centers, which is followed by multiple PointGNN layers with decreasing output channels to perform local message passing. The of each glimpse is then obtained via the reparameterization trick. Information exchange between nearby glimpses can help avoid oversegmentation that would otherwise occur because of the high dimensionality of point cloud data.
(c) Global Encoding. The global encoding module adopts the same encoder as that for the glimpse encoding to encode all points in the scene, which is treated as a single glimpse . The learned latent representation is with .
Decoder network. We now introduce the decoders that are used for pointcloud reconstruction and assignment of mask value to each input point.
(a) Point Graph Flow. Given the of each glimpse, the decoder is used for pointcloud reconstruction as well as segmentationmask generation. Most existing decoder or pointcloud generation framework can only generate a predefined fixed number points, and this can lead to under or oversegmentation because the number of generated points has a direct effect on the relative magnitudes between the forward and backward terms in the Chamfer likelihood.
To balance the forward and backward likelihood, the number of predictions for each glimpse must be approximately the same as the number of input points. Inspired by PointFlow PointFlow, which allows the sampling of an arbitrary number of points forming a point cloud within a normalizingflow framework, we propose a Point Graph Flow (PGF) network that allows the generation of a variable number of points at run time. The input to the PGF is a set of 3D points with coordinates sampled from a zerocentered Gaussian distribution, with the population determined by the number of points in the current glimpse. Features of the input points are set uniformly to the latent variable . PGF is composed of several PointGNN layers, each of which is preceded by a radius graph operation. The output of each PointGNN layer is of dimension , with the first
dimensions interpreted as the updated features and the last 3 dimensions interpreted as the updated 3D coordinates for estimated points. Since we only focus on point coordinates prediction, we set
for the last PointGNN layer. Unlike PointFlow, PGF has a simple structure and does not require an external ODE solver.(b) Mask Decoder. The Mask Decoder decodes to the mask value, , of each point within a glimpse . The decoding process follows the exact inverse pyramid structure of the Glimpse Encoder. To be more precise, the mask decoder can access the spatial coordinates of the intermediate aggregation points of the Glimpse Encoder as well as the point coordinates of . During decoding, PointConv is used as deconvolution operation.
Glimpse VAE and Global VAE. The complete Glimpse VAE structure is presented in Fig. 0(b). The Glimpse VAE is composed of a Glimpse Encoder, Point Graph Flow, Mask Decoder and a multilayer PointGNN network. The Glimpse Encoder takes all glimpses as input and encodes each glimpse individually and in parallel into feature points . Via the reparameterization trick, and are then obtained from . From there, we use the Point Graph Flow to decode to reconstruct the input points, and we use the Mask Decoder to decode to assign a mask value for each input point within . Finally, is generated using message passing among neighbouring glimpses. The processing of all glimpses happens in parallel. The Global VAE consisting of the Global Encoder and a PGFbased decoder outputs the reconstructed scene layout.
3.5 Soft Boundary
The prior of is set to encourage apothem to shrink so that the size of the glimpses will not be overly large. However, if points are excluded from one glimpse, the gradient from the likelihood of the excluded points will not influence the size and location of the glimpse anymore, and this can lead to oversegmentation. To solve this problem, we introduce a soft boundary weight which decreases when a point moves away from the bounding box of . Taking into the computation of , we obtain an updated mixing weight . By employing such a boundary loss, the gradual exclusion of points from glimpses will be reflected in gradients to counter oversegmentation. Details can be found in supp. Sec. B.
4 Experiments
To evaluate the performance of SPAIR3D, we first introduce two new pointcloud datasets Unity Object Room and Unity Object Table built on the Unity platform Unity.
4.1 Datasets
The Unity Object Room (UOR) dataset is built as a benchmark dataset for the evaluation of unsupervised 3D objectsegmentation models. Specifically, each scene in the UOR dataset consists of mulitple objects sitting on a square floor. Object are randomly sampled from a list of regular geometries. The Unity Object Table (UOT) dataset approximates Robotic Object Grasping scenario where multiple objects are placed on a round table. Instead of using objects of simple geometries, we create each scene of the UOT dataset with objects selected from a pool of irregular objects such as a toy car or a teddy bear. For both datasets, the number of objects placed in each scene varies from to with equal probabilities. During the scene generation, the size and orientation of the objects are varied randomly within a predefined range. We also randomly assign each object mesh a color.
We capture the depth, RGB, normal frames, and pixelwise semantics as well as instance labels for each scene from 10 different viewpoints for both datasets. This setup aims to approximate the scenario where a robot equipped with depth and RGB sensors navigates around target objects and captures data. The point cloud data for each scene is then constructed by merging these 10 depth maps. For each dataset, we collect training scenes, validation scenes and testing scenes. Indepth dataset specification and analysis can be found in supplementary D.
4.2 Unsupervised Segmentation
Baseline. Due to the sparse literature on unsupervised 3D point cloud segmentation, we could not find a generative baseline to compare with. Thus, we compare SPAIR3D with PointGroup PointGroup, a recent supervised 3D point cloud segmentation model. PointGroup performs semantic prediction and instance predictions from a point cloud and RGB data using a model trained with groundtruth semantic labels and instance labels. To ensure a fair comparison, we assign each point the same color (white) so appearance information doesn’t play a role. The PointGroup network is finetuned on the validation set to achieve the best performance.
Performance Metric. We use the Adjust Rand Index (ARI) ARI to measure the segmentation performance against the ground truth instance labels. We also employ foreground Segmentation Covering (SC) and foreground unweighted mean Segmentation Covering (mSC) GENESIS for performance measurements as ARI does not penalize models for object oversegmentation GENESIS.
UOR  ARI  SC  mSC  CD F  CD B 

UOT  
PG PointGroup  N/A  N/A  
N/A  N/A  
Ours  

Evaluation. Table 1 shows that SPAIR3D achieves comparable performance with the baseline, which is a supervised method on both UOT and UOR datasets. As demonstrated in Fig. 2, each foreground object is proposed by one and only one glimpse. The scene layout is separated from objects and accurately modelled by the global VAE. It is worth noting that the segmentation errors mainly happen at the bottom of objects. Without appearance information, points at the bottom of objects can also be correlated to the ground. Since the prior of encourages glimpses to shrink, those points are likely segmented as part of the scene layout instead of objects.
In Fig. 3, we show the performance distributions on the test set. To draw the distribution curve, we sort the test data in ascending order of performance. As expected, the supervised baseline (Orange) performs better but SPAIR3D manages to achieve highquality segmentation (SC score ) on around of the scenes without any supervision.
More segmentation results including failure cases can be found in the supp. E.
4.3 Voxel Size Robustness and Scalability
In the literature SPACE; SPAIR, the cell voxel size , an important hyperparameter, is chosen to match the object size in the scene. To evaluate the robustness of our method w.r.t voxel size, we train our model on the UOR dataset with voxel size set to and . We report that SPAIR3D achieves on average ARI:, SC: 0.853/0.857, and mSC: 0.850/0.861 with voxel size respectively demonstrating the robustness of our method.
The spatial invariance property of SPAIR3D allows the generalization of our method to scenes with an arbitrary number of objects, provided that object structures are captured intact. To demonstrate this, we evaluate our pretrained model directly on scenes containing randomly selected objects. SPAIR3D achieves on average ARI:, SC: , and mSC: across the UOR/UOT datasets.
It is worth noting that UOR and UOT datasets include around objects (for scenes of 2 to 5 objects) and objects (for scenes of 6 to 12 objects) that are close to its neighbouring ones with touching or almost touching surfaces. The reported quantitative results (Table 1) and qualitative results (Fig. 4(a)–(d)) demonstrate that our method achieves stable performance for those challenging scenes.
We also evaluated our approach on scenes termed as Object Matrix, which consists of objects in a matrix form. We fixed the position of all objects but change their size and rotation randomly. For each dataset, SPAIR3D is evaluated on Object Matrix scenes. On UOR/UOT Object Matrix scenes, SPAIR3D achieves on average ARI:, SC: and mSC: . Note that our model is trained on scenes with to objects, which is less than onethird of the number of objects in Object Matrix scenes. Fig 4(e)(f) is illustrative of the results.
4.4 Ablation Study of Multilayer PointGNN
To evaluate the importance of multilayer PointGNN in generation (right branch in Fig. 0(b)), we remove the multilayer PointGNN and generate directly from . The ablated model on the UOR dataset achieves ARI:, SC:, and mSC:, which is significantly worse than the full SPAIR3D model. The performance distribution of ablated SPAIR3D (Fig 3, first row) indicates that removing the multilayer PointGNN has a broad negative influence on the entire dataset. Fig. 5 shows that the multilayer PointGNN is crucial to preventing oversegmentation.
4.5 Empirical Evaluation of PGF
3D objects of the same category can be modeled by a varying number of points. The generation quality of the point cloud largely depends on the robustness of our model against the number of points representing each object. To demonstrate that PGF can reconstruct each object with a dynamic number of points, we train the global VAE on the ShapeNet dataset shapenet, where each object is composed of roughly points, and reconstruct the object with a varying number of points. To this end, for reference input point clouds of size , we force PGF to reconstruct a point cloud of size , , , , and respectively. The reconstruction results are shown in Fig. 6. While the reconstructions have less details than the input point cloud, the PGF can reconstruct point clouds that capture the overall object structure in all different settings.
(a)  (b)  (c)  (d)  (e)  (f) 
5 Conclusion and Future Work
We propose, to the best of our knowledge, the first generative model for unsupervised 3D pointcloud object segmentation called SPAIR3D, which models a scene with a 3D spatial mixture model that exploits spatial structure correlations within each object. Our SPAIR3D model achieves the unsupervised point cloud segmentation by generation. Our model is evaluated on two newly proposed UOR and UOT datasets. We further demonstrate that SPAIR3D can generalise well to previously unseen scenes with a large number of objects without performance degeneration. Our model may tend to oversegment objects of a significant different scale from those in the training dataset, which is left as our future work. Moreover, the spatial mixture interpretation of SPAIR3D allows the integration of memory mechanism VMA or iterative refinement IODINE to handle objects segmentation from video sequences, which are key directions for our future work.
Comments
There are no comments yet.