MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion
Robots and other smart devices need efficient object-based scene representations from their on-board vision systems to reason about contact, physics and occlusion. Recognized precise object models will play an important role alongside non-parametric reconstructions of unrecognized structures. We present a system which can estimate the accurate poses of multiple known objects in contact and occlusion from real-time, embodied multi-view vision. Our approach makes 3D object pose proposals from single RGB-D views, accumulates pose estimates and non-parametric occupancy information from multiple views as the camera moves, and performs joint optimization to estimate consistent, non-intersecting poses for multiple objects in contact. We verify the accuracy and robustness of our approach experimentally on 2 object datasets: YCB-Video, and our own challenging Cluttered YCB-Video. We demonstrate a real-time robotics application where a robot arm precisely and orderly disassembles complicated piles of objects, using only on-board RGB-D vision.READ FULL TEXT VIEW PDF
We introduce an approach for recovering the 6D pose of multiple known ob...
In this paper, we present an approach to tactile pose estimation from th...
We present a system for accurate 3D instance-aware semantic reconstructi...
In object-based Simultaneous Localization and Mapping (SLAM), 6D object ...
We present a novel multi-attentional convolutional architecture to tackl...
We propose an unsupervised vision-based system to estimate the joint
Understanding physical relations between objects, especially their suppo...
MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion
Robots and other smart devices that aim to perform complex tasks such as precise manipulation in cluttered environments need to capture information from their cameras that enables reasoning about contact, physics and occlusion among objects. While it has been shown that some short-term tasks can be accomplished using end-to-end learned models that connect sensing to action, we believe that extended and multi-stage tasks can greatly benefit from persistent 3D scene representations.
Even when the object elements of a scene have known models, inferring the configurations of many objects that are mutually occluding and in contact is challenging even with state-of-the-art detectors. In this paper we present a vision system that can tackle this problem, producing a persistent 3D multi-object representation in real-time from the multi-view images of a single moving RGB-D camera. Our system has four main components, as highlighted in Figure 2: 1) 2D object detection is fed to object-level fusion to make volumetric occupancy map of objects. 2) A pose prediction network that uses RGB-D data and the surrounding occupancy grid makes 3D object pose estimates. 3) Collision-based pose refinement jointly optimizes the poses of multiple objects with differentiable collision checking. 4) The intermediate volumetric representation of objects are replaced with information-rich CAD models.
Our system takes full advantage of depth information and multiple views to estimate mutually consistent object poses. The initial rough volumetric reconstruction is upgraded to precise CAD model fitting when models can be confidently aligned without intersecting with other objects. This visual capability to infer the poses of multiple objects with occlusion and contact enables robotic planning for pick-and-place in a cluttered scene e.g. removing obstacle objects for picking the target red box in Figure 1.
In summary, the main contributions of this paper are:
Pose prediction with surrounding spatial awareness, in which the prediction network receives occupancy grid as an impenetrable space of the object;
Joint optimization of multi-object poses, in which the scene configuration with multiple objects is evaluated and updated with differentiable collision check;
Full integration of fusion and 6D pose as a real-time system, in which the object-level volumetric map is exploited for incremental and accurate pose estimation.
Template-based methods [15, 26, 12, 11, 13, 24] are one of the earliest approaches to pose estimation. Traditionally, these methods involve generating templates by collecting images of the object (or a 3D model) from varying viewpoints in an offline training stage and then scanning the template across an image to find the best match using a distance measure. These methods are sensitive to clutter, occlusions, and lighting conditions, leading to a number of false positives, which in turn requires greater post processing. Sparse feature-based methods have been a popular alternative to template-based methods for a number of years [16, 21, 22]. These methods are concerned with extracting scale invaraint points of interest from images, describing them with local descriptors, such as SIFT  or SURF , and then storing them in a database to be later matched with at test time to obtain a pose estimate using a method such as RANSAC . This processing pipeline can be seen in manipulation tasks, such as MOPED . With the increase in affordable RGB-D cameras, dense methods have become increasingly popular for object and pose recognition [7, 25, 3]. These methods involve construction of a 3D point-cloud of a target object, and then matching this with a stored model using popular algorithms such as Iterative Closest Point (ICP) .
The use of deep neural networks is now prevalent in the field of 6D pose estimation. PoseCNN was one of the first works to train an end-to-end system to predict an initial 6D object poses directly from RGB images, which is then refined using depth-based ICP. Recent RGB-D-based system are PointFusion  and DenseFusion , which individually process the two sensor modalities (CNNs for RGB, PointNet  for point-cloud), and then fuse them to extract pixel-wise dense feature embeddings. Our work is most closely related to these RGB-D and learning-based approaches with deep neural networks. In contrast to the point-cloud-based and target-object-focused approach in the prior work, we process the geometry using more structured volumetric representation with the geometry information surrounding the target object.
Our system estimates the 6D pose of a set of known objects given RGB-D images of a cluttered scene. We represent 6D poses as a homogeneous transformation matrix , and denote a pose as , where is the rotation and is the translation.
Our system, summarized in Figure 2, can be divided into four key stages. (1) An object-level volumetric fusion stage which combines the object instances masks produced from an object detection along with depth measurement and camera tracking component to produce a volumetric map known and unkown objects. (2) A volumetric pose prediction stage which uses the surrounding information from the volumetric map along with the RGB-D masks to produce an initial pose prediction for each of the objects. (3) A collision-based pose refinement stage that jointly optimizes the pose of multiple objects via gradient descent by using differentiable collision checking between object CAD models and occupied space from the volumetric map. (4) A CAD alignment stage that replaces the intermediate representation of each object with a CAD model, containing compact and rich information. In the following sections, we expend further on each of these stages.
Building a volumetric map is the first stage of our pose estimation system, which allows the system to gradually increase the knowledge about the scene until having confidence about object poses in the scene. For this object-level volumetric fusion stage, we build a pipeline similar to [18, 27, 30], combining RGB-D camera tracking, object detection, and volumetric mapping of detected objects.
Given that the camera is mounted on the end of a robotic arm, we are able to retrieve the accurate pose of the cameras using forward kinematics and a well-calibrated camera. However, to also allow this to be used with a hand-held camera, we adopt the sparse SLAM framework ORB-SLAM2  for camera tracking. Unlike its monocular predecessor , ORB-SLAM2 tracks camera pose in metric space, which is crucial for the volumetric mapping.
We use octree-based occupancy mapping, OctoMap, for the volumetric mapping. By using octree structure, OctoMap can quickly retrieve the voxel from queried points, which is critical for both updating the occupancy value from depth measurements and checking occupancy value when use in the later (pose prediction and refinement) components of the pipeline.
We build this volumetric map for each detected objects including unknown (background) objects. In order to track the objects that are already initialized, we use the intersect over union of the detected mask in the current frame and rendered mask current reconstruction following prior work [18, 30]. For objects that are already initialized, we fuse new depth measurements to the volumetric map of that object, and a new volumetric map is initialized when it finds a new object when moving the camera. This object-level reconstruction enables to use volumetric representation of objects as an intermediate representation before the model alignment by pose estimation.
Our system retrieves surrounding information from the volumetric map to incorporate spatial awareness of the area surrounding a target object into pose prediction. In this section, we describe how this surrounding information is represented and used in pose prediction.
Each target object () for pose prediction carries its own volumetric occupancy grid. The voxels that make up this grid can be in one of the following states: (1) Space occupied by the object itself () from the target object reconstruction. (2) Space occupied by other objects () from reconstruction of the surrounding objects. (3) Free space () identified by depth measurement. (4) Unknown space () unobserved by mapping because of occlusion and sensor range limit (Figure 3).
Ideally, the bounding box of surrounding information should cover the whole area of the target object even if it is occluded. This means the bounding box size should change depending on the target object size. Since we need to use fixed voxel dimension for network prediction (e.g., ), we use different voxel size for each object computed from the object model size (diagonal of the bounding box divided by the voxel dimension).
and masked RGB-D images. The architecture can be categorized into 4 core components: (1) 2D feature extraction from RGB from a ResNet; (2) Point-wise encoding of RGB features and point cloud; (3) Voxelization of the point-wise features followed by 3D-CNNs; (4) Point-wise pose prediction from both 2D and 3D features.
Even when depth measurement are available, RGB images can still carry vital sensor information for precise pose prediction. Because of the color and texture detail in RGB images, this can be an especially strong signal for pose prediction of highly-textured and asymmetric objects.
Following [28, 31], we use ResNet18  with succeeding upsampling layers  to extract RGB features from masked images. Though both prior methods [28, 31] used cropped images of objects with a bounding box, we used masked images which makes the network invariant to changes in background appearance, and also encourages it to focus on retrieving surrounding information using the occupancy grid.
Similarly to , both the RGB features and extracted point-cloud points (using the target object mask) are encoded via several fully connected layers to produce point-wise features, which are then concatenated.
From these point-wise features we build a feature grid (with the same dimensions as the occupancy grid), which will be combined with the occupancy grid extracted from the volumetric fusion. The concatenated voxel grid is processed by 3D-CNNs to extract hierarchical 3D features reducing voxel dimension and increasing the channel size. We process the original grid (voxel dimension: 32) with 2-strided convolutions to have hierarchical features (voxel dimension: 16, 8).
An important design choice in this pipeline is to perform 2D feature extraction before voxelization, instead of directly applying 3D feature extraction on the voxel grid of raw RGB pixel values. Though 3D CNNs and 2D CNNs have similar behaviour when processing RGB-D input, it is hard to use a 3D CNN on a high resolution grid unlike a 2D image, and also the voxelized grid can have more missing points than an RGB image because of sensor noise in the depth image.
To combine the 2D and 3D features for pose prediction, we extract points from the 3D feature grid that corresponds to the point-wise 2D features with triliner interpolation. These 3D and 2D features are concatenated as point-wise feature vectors for the pose prediction, from which we predict both the pose and confidence as in.
For point-wise pose prediction, we follow DenseFusion  for training loss which is extended version of the model alignment loss from PoseCNN . For each pixel-wise prediction, this loss computes average distance of corresponding points of the object model transformed with ground truth and predicted pose (pose loss).
Let be ground truth pose, be -th point-wise prediction of the pose, and be the point sampled from the object model. This pose loss is formulated as:
For symmetric objects, which have ambiguity for the correspondence in object model, nearest neighbor of transformed point is used as the correspondence (symmetric pose loss):
The confidence of the pose prediction is trained with these pose loss in an unsupervised way. Let be number of pixel-wise predictions and be the -th predicted confidence. The final training loss is formulated as:
where is the regularization scaling factor (we use following ).
Though the symmetric pose loss is designed to handle symmetric objects using nearest neighbour search, we found that this loss is prone to be stuck to local minima compared to the standard pose loss, which uses 1-to-1 ground truth correspondence in the object model. Figure (b)b shows the examples where the symmetric pose loss has a problem with the local minima with the non-convex shaped object.
For this issue, we introduce warm-up stage with standard pose loss (e.g
. 1 epoch) during training before switching to symmetric pose loss. This training strategy with warm-up allows the network first to be optimized for the pose prediction without local minima problem though ignoring symmetries, and then to be optimized considering the symmetries, which gives much better results for pose estimation of complex-shaped symmetric objects (Figure(c)c).
In the previous section, we showed how we combine image-based object detections, RGB-D data and volumetric estimates of the shapes of nearby objects to make per-object pose predictions from a network forward pass. This can often give good initial pose estimates, but not necessarily a mutually consistent set of estimates for objects which are in close contact with each other. In this section we therefore introduce a test-time pose refinement module that can jointly optimize the poses of multiple objects.
For joint optimization, we introduce differentiable collision checking, composing of occupancy voxelization of the object CAD model and an intersection loss between occupancy grids. As both are differentiable, it allows us to optimize object poses using gradient descent with optimized batch operation on a GPU.
The average voxelization of feature vectors mentioned in §3.2.2 uses feature vectors using points and is differentiable with respect to the feature vector. In contrast, the occupancy voxelization needs to be differentiable with respect to the points. This means the values of each voxel in the occupancy grid must be a function of the points, which has been transformed by estimated object pose.
Let be a point, be the voxel size, and be the origin of the voxel (i.e. left bottom corner of the voxel grid). We can transform the point into voxel coordinate with:
For each voxel we compute the distance against the point:
We decide the occupancy value based proportional to the distance from nearest point, resulting in the occupancy value of -th voxel being computed as:
where is the distance threshold.
This differentiable occupancy voxelization gives occupancy grids from object model and hypothesized object pose. For a target object , the points sampled from its CAD model are transformed with the hypothesized pose : , from which the occupancy value is computed. The point is uniformly sampled from the CAD model (including internal part), and gives a hypothesized occupancy grid of the target object .
Similarly, we perform this voxelization with the surrounding objects . Unlike the target object voxelization, surrounding objects are voxelized in the voxel coordinate of the target: where is the occupancy grid origin of the target object and is its voxel size. This gives the hypothesized occupancy grids of surrounding objects of the target object: .
The occupancy voxelization gives the hypothesized occupied space of the target (-th object in the scene) and surrounding objects . The occupancy grids of surrounding objects are built in the voxel coordinate (center, voxel size) of the target object and aggregated with element-wise max:
This gives a single impenetrable occupancy grid, where the target object pose is penalized with intersection. In addition to the impenetrable occupancy grid from the pose hypothesis of surrounding objects, we also use the occupancy information from the volumetric fusion: occupied space including background objects , free space (Figure 3), as additional impenetrable area: . The collision penalty loss is computed as the intersection between hypothesized occupied space of the target and the impenetrable surrounding grid:
where is element-wise multiplication.
Though this loss correctly penalizes the collision among the target and surrounding objects, optimizing for this alone is not enough, as it does not take into account the visible surface constraint of the target object . The other term in the loss is the intersection between the hypothesized occupied space of the target and with this grid , to encourage the surface intersection between object pose hypothesis and volumetric reconstruction:
We compute these collision and surface alignment losses for number of objects with the batch operation on GPU, and sum them as the total loss :
This loss is minimized with gradient descent allowing us to jointly optimize the pose hypothesis of multiple objects.
After performing the pose estimation and refinement, we spawn object CAD models into the map once there are enough agreements on the poses estimated in different views. To compare the object poses estimated in the different camera coordinate, we first transform those poses into the world coordinate using the tracked camera pose in camera tracking module (§3.1). Those transformed object poses are compared using the pose loss, which we also use for training the pose prediction network (§3.2.3). For the recent pose hypothesis, we compute the pose loss for each pair, which gives pose loss: . We count how many pose losses are under the threshold (): . When reaches a threshold, we initialize the object with that agreed pose.
In this section, we first evaluate how well the pose prediction (§4.2) and refinement (§4.3) performs on 6D pose estimation datasets. We then demonstrate the system running on a robotic pick-and-place task(§4.4).
We evaluate our pose estimation components using 21 classes of YCB objects  used in YCB-Video dataset . YCB-Video dataset has been commonly used for evaluation of 6D pose estimation in prior work, however, since all of the scenes are table-top, this dataset is limited in terms of the variety of object orientations and occlusions.
To make the evaluation possible with heavy occlusions and arbitrary orientations, we built our own synthetic dataset: Cluttered YCB (Figure (a)a). We used a physics simulator  to place object models with feasible configurations from random poses. This dataset has 1200 scenes () and 15 camera frames for each.
We used the same metric as prior work [29, 28], which evaluates the average distance of corresponding points: ADD, ADD-S. ADD uses ground truth and ADD-S uses nearest neighbours as correspondence with transforming the model with the ground truth and estimated pose. These distances are computed for each object pose in the dataset, and plotted with the error threshold in x-axis and the accuracy in the y-axis. The metric is the area under the curve (AUC) using 10cm as maximum threshold for x-axis.
We used DenseFusion  as a baseline model. For fair comparison with our proposed model, we reimplemented DenseFusion and trained with the same settings (e.g. data augmentation, normalization, loss).
Table 1 shows the pose prediction result on YCB-Video dataset using the detection mask of , where DenseFusion is the official GitHub implementation 111https://github.com/j96w/DenseFusion and DenseFusion is our version, which includes the warm-up loss (§3.2.3) and the centralization of input point cloud (analogues to the voxelization step in our model). We find that the addition of the two added components leads to big performance improvements. In the following evaluations, we use DenseFusion as the baseline model.
We compared the proposed model (Morefusion) with the baseline model (DenseFusion). For fair comparison, both models predict object poses in a single-view, where Morefusion only uses occupancy information from the single-view depth observation. We trained models using combined dataset of Cluttered-YCB and YCB-Video dataset and tested separately with ground truth masks. The result (Table 2, Figure (a)a) shows that Morefusion consistently predicts better poses with volumetric CNN and surrounding occupancy information. Larger improvement is performed on heavily occluded objects (visibility30%).
|DenseFusion||Cluttered YCB (visibility)||59.7||83.8|
To evaluate the effect of surrounding occupancy as input, we tested the trained model (MoreFusion) feeding different level of occupancy information: discarding the occupancy information from the single-view observation -occ; full reconstruction of non-target objects ; full reconstruction of background objects +bg. Table 3 shows that the model gives better prediction as giving more and more occupancy information, which is very common in our incremental and multi-view object mapping system. This comparison also shows that even without occupancy information (Morefusion) our model performs better than DenseFusion purely because of the 3D-CNNs architecture.
We evaluate our pose refinement, Iterative Collision Check (ICC), against point-to-point Iterative Closest Point (ICP) . Since ICP only uses masked point cloud of the target object without any reasoning with surrounding objects, the comparison of ICC with ICP allows us to evaluate how well and in what case the surrounding-object geometry used in ICC helps pose refinement in particular.
Figure 7 shows a typical example where the pose prediction has object-to-object intersections because of less visibility of the object (e.g., yellow box). ICC refines object poses to better configurations than ICP by using the constraints from nearby objects and free-space reconstructions.
For quantitaive evaluation, we used Cluttered YCB-Video dataset with pose estimate refined from initial pose prediction MoreFusion in Table 2. Figure 9 shows how the metric varies with different visibility on the dataset, in which the combination of the two methods (+ICC+ICP) gives consistently better pose than the others. With small occlusions (visibility ), ICC does not perform as well as ICP because of the discrimination by the voxelization (we use 32 dimensional voxel grid). However, results are at their best with the combination of the two optimization, where ICC resolves collisions in discritized space and then ICP aligns surfaces more precisely.
We demonstrate the capability of our full system, MoreFusion, with two demonstration: scene reconstruction, in which the system detects each known objects in the scene and aligns the pre-build object model (shown in Figure 10); and secondly, a robotic pick-and-place tasks, where the robot is requested to pick a target object from a cluttered scene with intelligently removing distractor objects to access the target object (shown in Figure 8).
We have shown consistent and accurate pose estimation of objects that may be heavily occluded by and/or tightly contacting with other objects in cluttered scenes. Our real-time and incremental pose estimation system builds an object-level map that describes the full geometry of objects in the scene, which enables a robot to manipulate objects in complicated piles with intelligent of dissembling of occluding objects and oriented placing. We believe that there is still a long way to go in using known object models to make persistent models of difficult, cluttered scenes. One key future direction is to introduce physics reasoning into our optimization framework.
Research presented in this paper has been supported by Dyson Technology Ltd.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
Pointnet: Deep Learning on Point Sets for 3D Classification and Segmentation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 652–660, 2017.
PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes.In Proceedings of Robotics: Science and Systems (RSS), 2018.