Symmetry is omnipresent in both nature and the synthetic world. Symmetry detection is therefore a long-standing problem that has attracted substantial attention in both computer graphics and vision (liu2010computational; mitra2013symmetry). Symmetry is at heart a purely geometric concept, with a rigorous definition on the basis of transformation invariance and group theory. It might therefore be supposed that symmetry detection can always be solved by a purely geometric approach. For example, reflectional symmetry in 2D or 3D can be easily parameterized in the transformation space. As such, detection methods such as the Hough transform have historically been utilized to accumulate local cues of symmetry transformations based on detected symmetry point correspondences (yip2000hough; podolak2006planar; mitra2006partial).
If we consider the problem of symmetry detection in the presence of significant missing data, however, it becomes appropriate to abandon purely-geometric approaches and infer
what symmetries might be present. A common application scenario is estimating symmetries of 3D shapes based on a single-view RGB-D image. Single-view symmetry detection finds various potential applications ranging from object/scene completion, camera tracking, and relocalization, to object pose estimation. Due to partial observations and object occlusion, it also poses special challenges that are beyond the reach of geometric detection. For example, it is difficult, if not impossible, to find local symmetry correspondences and transformations supporting global symmetry analysis. In this situation, symmetry analysis should rely not only on geometric detection but also on statistical inference. The latter necessitates data-driven learning.
In this work, we propose an end-to-end learning approach for symmetry prediction based on a single RGB-D image using deep neural networks. As shown in Figure 1, given an RGB-D image as input, the network is trained to detect two types of 3D symmetries present in the scene, namely (planar) reflectional and (cylindrical) rotational symmetries, and outputs the corresponding symmetry planes and axes, respectively. Directly training a deep model for symmetry prediction, however, can quickly run into the issue of overfitting. This is due to the fact that the network is able to easily “memorize” the symmetry axes of a class of objects in training and will simply perform object recognition at test time. Such an overfitted model cannot generalize well to large shape variation or changes in the symmetries that are present. In fact, symmetry is not a global shape property but rather is supported by local geometric cues: symmetry transformation invariance is defined by local shape correspondences. Straightforward training of symmetry prediction cannot help the network to truly understand the local-to-global support.
To this end, we adopt a multi-task learning approach. Aside from symmetry axis prediction, our network is also trained to predict symmetry correspondences. In particular, given the 3D points present in the RGB-D image, our network outputs for each 3D point its symmetric counterpart corresponding to a specific predicted symmetry. Given any point
, its counterpart may lie in the 3D point cloud or be missing due to occlusion or limited field of view. To accommodate both cases, we output the counterpart in two modalities. First, we output a heat map for each point in the point cloud indicating the probability of any other points being the symmetric counterpart of this point. Second, we directly regress thelocation of this counterpart. To avoid overfitting, we correlate the two tasks by supervising their learning with a unified supervision signal, i.e., the ground-truth local position of the counterparts. We devise a loss that encourages a point with high counterpart probability to be spatially close to its corresponding ground-truth location.
Through proper parameterization, our network is able to handle reflectional symmetry, as well as continuous and discrete rotational symmetry. Since the number of symmetries present in a 3D shape may vary, a network with single output is not suitable for the symmetry prediction task. To this end, we design our network to produce multiple symmetry outputs. When training the network, however, one needs to know how to match the outputs to different ground-truth symmetries in order to compute proper prediction error for gradient propagation. This is achieved by an optimal assignment process, which keeps the entire network end-to-end trainable.
Through extensive evaluation on three symmetry prediction datasets, we demonstrate the strong generalization ability of our method. It attains high accuracy not only for symmetry axis prediction, but also for counterpart estimation. Therefore, our method is robust in handling unseen object instances with large variation in shape, multi-symmetry composition, as well as novel object categories. In summary, we make the following contributions:
We propose the problem of reflective and rotational symmetry detection from single-view RGB-D images, and introduce a robust solution based on deep learning.
We use a series of dedicated tasks (losses) to guide the deep network to learn not only parametrized symmetry axes but also the local symmetry correspondences that support them.
We realize end-to-end learning of multi-symmetry detection by devising an optimal assignment process for multi-output network training.
We propose a benchmark for single-view symmetry detection, encompassing a moderately-sized dataset containing both real and synthetic scenes, as well as evaluation metrics.
2. Related work
Symmetry detection has a large body of literature, which has been comprehensively reviewed by the two excellent surveys of Liu et al. (liu2010computational) and Mitra et al. (mitra2013symmetry). Here we focus only on the work that is most related to our specific designs and techniques.
2D symmetry detection.
2D symmetry detection has long been a major topic of interest in computer vision. Many approaches have been proposed, including direct methods(kuehnle1991symmetry), voting-based methods (ogawa1991symmetry)
, and moment-based methods(marola1989detection). Different types of primitive symmetries such as rotation, translation, and reflection, as well as symmetry groups (liu2000computational)
, have been studied. Among all these directions, the detection of bilateral reflectional symmetry and its skewed version(liu2001skewed) from 2D images has received the most attention from the community. Our work is relatively closely related to the detection of skewed bilateral symmetry, since the latter is inherently inferring reflectional symmetry of 3D objects from their 2D projections. In contrast to our work, these works do not output symmetries in 3D space and rely on the presence of a large portion of the symmetric regions.
3D symmetry detection.
Since the two seminal works of Mitra et al. (mitra2006partial) and Podolak et al. (podolak2006planar), symmetry detection of 3D geometry has attracted much attention in the field of geometry processing. Existing works can be categorized according to different problem settings targeted, such as exact vs. approximate symmetry, local vs. global symmetry, and extrinsic vs. intrinsic symmetry. Different combinations of the settings lead to different problems and approaches, such as the detection of extrinsic global symmetries (podolak2006planar; martinet2006accurate), extrinsic partial symmetries (mitra2006partial; bokeloh2009symmetry; lipman2010symmetry), intrinsic global symmetries (raviv2007; ovsjanikov2008global), and intrinsic partial symmetries (xu2009partial; raviv2010full). Common to these works is the reliance on 3D shape correspondence (van2011survey), which is regarded as a primary building block of 3D symmetry detection. However, in cases of significant missing data, such as single-view scans, shape correspondence becomes extremely challenging.
Learning-based symmetry detection.
Early methods for symmetry detection using statistical learning include the use of feed-forward networks to detect and enhance edges that are symmetric in terms of edge orientation (zielke1992intensity). Tsogkas and Kokkinos (tsogkas2012learning) employ hand-crafted features and multiple instance learning to detect ribbon-like structures in natural images, which was later extended to detect more general reflectional symmetry (shen2016multiple). Teo et al. (teo2015detection)
utilize structured random forests to detect curved reflectional symmetries. Most recently, deep learning has been adopted for the task of 2D symmetry detection(shen2016object; ke2017srn), typically detecting reflectional symmetries as 2D skeletons instead of symmetries in 3D space.
Gao et al. (gao2019prs)
propose PRS-Net, the first deep learning based symmetry detection method for 3D models. They develop a loss function to measure symmetry correspondence that requires the counterpart of any point to lie on the shape surface. This limits their use in handling single-view scans: the reflective counterpart of a point may be far away from the surface due to missing data, which may lead to high loss and slow convergence.
Learning-based shape correspondence.
Deep learning has also been applied to shape correspondence. Existing works mostly focus on learning-based shape descriptors (huang2017learning), which have proven more robust than hand-crafted ones. Wei et al. (wei2016dense) learn feature descriptors for each pixel in a depth scan of a human for establishing dense correspondences. Zeng et al. (zeng20173dmatch) learn a local patch descriptor for volumetric data, which can be used for aligning depth data for RGBD reconstruction. Although data-driven local shape descriptors can be used for symmetry detection, it is unclear how to harness them to realize an end-to-end learned symmetry detector. Moreover, severe data incompleteness renders shape correspondence inapplicable.
Learning-based object pose estimation.
Our work is also related to 6D object pose estimation based on single-view RGB(D) input, since pose and symmetry usually imply each other. Most existing methods are instance-level (choi20123d; hodavn2015detection; konishi2018real; georgakis2018matching; avetisyan2019scan2cad; peng2019pvnet; wang2019densefusion) and require a template 3D model, which is unavailable for our single-view symmetry detection task. Recently, Wang et al. (wang2019normalized) achieve category-level 6D pose estimation based on a Normalized Object Coordinate Space (NOCS), a shared canonical representation of object instances within a category. They train a neural network to directly infer the pixel-wise correspondence between an RGB image and the NOCS. 6D object pose is estimated using shape matching. This method finds difficulty in generalizing across different shape categories. Our network, on the other hand, has good cross-category generality.
The symmetry of a 3D object is easily measurable when its geometry is fully known. Conventional symmetry detection pipelines for 3D objects normally establish symmetric correspondences within the observed geometrical elements (e.g. points or parts) before aggregating them into meaningful symmetries. However, single-view observations of real-world objects are usually incomplete due occlusion and limited field of view. Symmetry detection on incomplete geometry is an ill-posed problem which is difficult to solve with existing approaches.
When inferring the underlying symmetries of an incompletely observed object, humans usually resolve ambiguities based on whether the object is familiar. For an object commonly encountered in daily life, a person recognizes its category, estimates its pose, and determines the symmetries, all based on prior knowledge. For a novel, rarely-encountered object, however, she may look for local evidence of symmetry (i.e., establish symmetry correspondences) over the observed geometry and/or imagined unseen parts. Clearly, symmetry inference for novel objects is much harder since it involves simultaneous shape matching and shape completion. In this work, we propose a unified solution to single-view symmetry detection for both known and novel objects through coupling the predictions of symmetries and symmetry correspondences.
Our solution is to train an end-to-end network for symmetry prediction; see Figure 2. The network consists of three major components. The first module takes an RGB image and a depth image as input and extracts point-wise appearance and geometric features, respectively. These features are subsequently utilized for point-wise symmetry prediction. Finally, the third module performs symmetry aggregation and verification during inference.
3.1. Problem definition
Given an RGB-D image of an 3D object, our goal is to detect its extrinsic reflectional and/or rotational symmetries, if any. In particular, we detect at most reflectional symmetries , which is parameterized as with being a point in the reflection plane and the plane normal. We also detect at most rotational symmetries , parameterized as where is a point lying on the rotation axis and defines the axis orientation. All symmetries are represented in the camera reference frame.
3.2. Symmetry prediction network
Let us first introduce how the network predicts one symmetry, and then extend it to output multiple symmetries.
Dense-point symmetry prediction.
We first extract features for both RGB and depth images and then fuse their feature maps. Following (wang2019densefusion), we extract point-wise appearance and geometric features using a fully-convolutional network (wang2019densefusion) and a PointNet (qi2017pointnet), respectively. The two features are then concatenated and used for point-wise prediction tasks. Our network makes individual predictions for each point before aggregating all the predictions to form the final one. The overall prediction loss is , where is the prediction loss of point .
The point-wise prediction takes both point-wise and global features as input. To compute global features, a straightforward way would be to perform average- or max-pooling over all point features. However, average-pooling over all points is redundant for symmetry detection, which can be determined by features of sparse points(mitra2006partial). On the other hand, max-pooling may lose too much information. We instead opt for spatially weighted pooling (hu2017deep)
. This method measures the significance of the each point by learning a weighted mask for every feature map. We insert a spatially weighted pooling layer after the appearance and geometric feature extraction layers. The resulting global feature is then concatenated with the point-wise features for symmetry prediction.
To improve the prediction accuracy and generality, we train the point-wise symmetry prediction network with a multi-task learning scheme. In particular, the tasks include 1) a classification of symmetry type (null if there is no symmetry), 2) a regression predicting the symmetry parameters, 3) a regression estimating the location of the symmetric counterpart of a given point for the corresponding symmetry, and 4) a classification indicating whether an input point is the symmetric counterpart of the current point. To make the point-wise prediction easy to train, all predicted coordinates are represented in a local reference frame centered at the current point, with the same orientation as the camera reference frame.
Although the extra tasks of symmetric counterpart prediction make the point-wise symmetry detection over-constrained, they allow the network to learn the essence of symmetry (i.e., symmetry correspondence) via reinforcing the relation between symmetry parameters and symmetric counterparts.
Given a point , its symmetry prediction loss is defined as
where is the cross-entropy loss for symmetry type classification (null (0) for no symmetry, 1 for reflectional symmetry, 2 for rotational symmetry). is the loss for symmetry parameters and symmetric counterparts calculated based on the ground-truth symmetry type:
For reflectional symmetry (Figure 3a), the network outputs , the projection of onto the predicted symmetry plane, and measures the distance between this point and its ground-truth location :
where is Euclidean distance. In the local reference frame of , the predicted normal of the symmetry plane is .
The counterpart loss for reflectional symmetry is:
where is the predicted probability that point is the counterpart of point and is the ground-truth label (0 for negative and 1 for positive). is the cross-entropy loss. is the predicted symmetric reflection (counterpart) of , and is its corresponding ground-truth. The counterpart loss penalizes when a point with high counterpart probability is spatially distant from the corresponding ground-truth counterpart location. The weight tunes the importance of counterpart prediction.
For rotational symmetry (Figure 3(b)), the network predicts point as the projection of onto the predicted symmetry axis . We define the rotational symmetry prediction loss as:
where and are the corresponding ground-truths.
Unlike reflectional symmetry, in which each point has only one symmetric counterpart, rotational symmetry induces more than one counterpart on the rotational orbit. For discrete rotational symmetry, the number of counterparts is equal to the order of rotational symmetry. For continuous rotational symmetry, on the other hand, the number is infinite. Learning to regress all points on the rotation orbit is extremely difficult if not impossible. We therefore opt to predict the probability for a given input point how likely it is to be in the rotation orbit. In addition, we predict the order of the rotational symmetry ( for continuous rotational symmetry and
for discrete rotational symmetry) using an MLP and a softmax layer for-way classification. is the maximal order of discrete rotational symmetry in the datatset. We set in our experiment. Note that with this formulation we unify the prediction of continuous and discrete symmetry, leading to reduced model parameters.
The counterpart loss for rotational symmetry is:
where is the predicted probability that a point lies in the ground-truth orbit of , is the predicted order, and are the corresponding ground-truths, and is a trade-off weight.
Handling arbitrary number of symmetries.
To accommodate multiple symmetries in our network,
one option is to train a recurrent neural network which is able to output an arbitrary number of
symmetries sequentially. However, training such a sequential prediction requires a prescribed consistent order for the symmetries, which is obviously infeasible.
A more straightforward option is to have an
To accommodate multiple symmetries in our network, one option is to train a recurrent neural network which is able to output an arbitrary number of symmetries sequentially. However, training such a sequential prediction requires a prescribed consistent order for the symmetries, which is obviously infeasible. A more straightforward option is to have an-way output with being the maximum number of symmetries per object, which we adopt in our approach. However, training a network with an -way output still requires predefining the order of different outputs. To circumvent this order dependency, we propose an optimal assignment based approach to train the network for order-independent multi-way output.
In particular, each point produces a maximum of outputs for reflectional symmetry or
for rotational symmetry. A classifier is used to determine the presence or absence of each symmetry. For those symmetries verified by the classifier, an optimization is applied to find the maximally-beneficial-matching to the ground-truth symmetries. To be specific, we solve the following optimization during training:
is a permutation matrix with indicating whether the -th ground-truth symmetry matches the -th predicted symmetries. is the total number of the output symmetries, and the total number of the ground-truth symmetries. is a benefit matrix in which represents the benefit of matching the -th ground-truth symmetry to the -th predicted symmetry. A higher similarity between two symmetries leads to a larger benefit.
We compute the benefit as follows. For reflectional symmetries, given two symmetries and , and their corresponding reflectional transformations and , the benefit of matching and is computed as the Euclidean distance between points that are transformed by the two reflectional transformations respectively:
where is a small value used to prevent dividing by zero.
For rotational symmetries, the benefit of matching two symmetries and is defined as the Euclidean distance between points that are transformed by the two rotational transformations respectively:
where is the rotational transformation of with a rotation angle of . The set of rotation angles is . Note that the transformations with different rotation angles are used only for comparing two rotational symmetries; they have nothing to do with the order of the rotational symmetries.
Solving this optimization amounts to finding the assignment that maximizes the total benefit of matching between the predicted and ground-truth symmetries. We use the Hungarian algorithm (kuhn1955hungarian) to solve the optimization. Figure 4 shows an illustration of the entire process of outputting multiple symmetries per point and finding the optimal assignment.
3.3. Symmetry inference
During inference, we start by extracting point-wise features and making point-wise symmetry predictions. We then aggregate these individual predictions to generate the ultimate prediction. A straightforward method of aggregation is to perform a clustering over the predicted symmetries and select the final predictions as the cluster centers, similar in spirit to (mitra2006partial) and (podolak2006planar). When clustering, we need to account for the importance of the predictions since they are not equally accurate due to the influence of occlusion and non-uniform lighting. To this end, we introduce a confidence value for each symmetry prediction of each point. In particular, the confidence is evaluated as the probability output by the softmax layer in predicting symmetry type (the in Eq. (1)).
After testing the performance of various clustering algorithms, we found that Density-Based Spatial Clustering (DBSCAN) (ester1996density) works the best for our task. The dissimilarity between two symmetries is defined in Eq. (8) and Eq. (9). In addition, we use the confidence value of each predicted symmetry as its density weight, thus encouraging the selection of more confident predictions.
Symmetry prediction on complete 3D objects can be conveniently verified by computing the matching error between the original model and the model transformed by the predicted symmetry. Large matching error implies inaccurate/incorrect symmetry prediction. This verification, however, is infeasible when the observation is incomplete because even a correct symmetry may have a large matching error due to data incompleteness. We therefore propose a visibility-based verification approach which is suited to our data, i.e., single-view RGB-D images.
As depicted in Figure 5, we first compute a volumetric representation of the space observed by the depth image. Based on visibility w.r.t. the camera pose, the volumetric map contains three types of voxels: observed, free, and unknown. Unknown voxels represent those which are either occluded or outside the FOV. The verification then computes the matching error as the overlap between the transformed surface points and the known free regions. A large overlap means a large confirmed mismatch. We then filter out the predicted symmetries with a large mismatch.
We have also tested using this visibility-based verification as an extra constraint (loss) in the network training. However, we found that it leads to slow convergence while resulting in little improvement in prediction accuracy.
4. Implementation details
To extract point-wise color features, we use a fully-convolutional network consisting of five convolutional layers, each of which is followed by a Batch Normalization (BN) layer. It encodes an input RGB image of size .
The symmetry predictor is a three-layer Multi-Layer Perceptron (MLP) which takes the per-point features as input and outputs symmetries. Each MLP is followed by a BN layer. The weights
To extract point-wise color features, we use a fully-convolutional network consisting of five convolutional layers, each of which is followed by a Batch Normalization (BN) layer. It encodes an input RGB image of sizeinto a feature space. We use PointNet to extract geometric features. The architecture of our PointNet implementation is the same as the point cloud segmentation network described in (qi2017pointnet). It encodes a point cloud with points into a feature matrix. The size of the global feature is . After concatenating local and global features, the size of per-point feature is
. The symmetry predictor is a three-layer Multi-Layer Perceptron (MLP) which takes the per-point features as input and outputs symmetries. Each MLP is followed by a BN layer. The weightsand are both set to . and in the multi-way prediction of reflectional and rotational symmetries are and , respectively.
Training and inference.
We implement the prediction network in PyTorch
We implement the prediction network in PyTorch(paszke2019pytorch). The Adam optimizer (kingma2014adam) is used with a base learning rate of . We use the default hyper-parameters of , , and a weight decay of . The batch size is . For DBSCAN, the maximum neighborhood distance eps is . The minimal number of neighbours for a point to be considered as a core point is . We filter the symmetry predictions whose confidence value is less than . For the visibility-based verification, we filter out the predictions with more than counterpart points located in the known-empty region.
5. Results and applications
In order to train and evaluate the proposed network, we have constructed a 3D symmetry detection benchmark for single-view RGB-D images. The benchmark is built upon ShapeNet (chang2015shapenet), YCB (calli2015ycb), and ScanNet (dai2017scannet). For each of the three datasets, we automatically compute symmetries on the 3D models using existing methods. The symmetry labels are then meticulously verified by experienced workers. Finally, we transfer the symmetries of the 3D models to each RGB-D image, transforming by each object’s pose. The details of collecting symmetry detection annotations for these datasets are as follows:
ShapeNet consists of 3D CAD models with category labels. We first use an optimization-based symmetry detection method to find the ground-truth symmetries in each model, then perform RGB-D virtual scans and transfer the ground-truth symmetry annotations to the local camera coordinates of each RGB-D image. We split this dataset into four subsets: rendered RGB-D images of training models (Train), rendered RGB-D images from novel views of training models (Holdout view), rendered RGB-D images of testing models (Holdout instance), and rendered RGB-D images of models in untrained categories (Holdout category). The details of the train and holdout categories are provided in the supplemental material.
YCB is a dataset originally built for robotic manipulation and 6D pose estimation. It contains RGB-D videos of table-top objects with different sizes, shapes, and textures. High quality 3D reconstructions are provided for each object. We manually annotate ground-truth symmetries for these reconstructed 3D models, and transfer them to the local camera coordinates of each RGB-D image by using the ground-truth 6D pose of each object. We follow the original train/test split established in calli2015ycb.
ScanNet is a dataset containing RGB-D videos of indoor scenes, annotated with 3D camera poses, surface reconstructions, and semantic segmentations. The recent work Scan2CAD (avetisyan2019scan2cad) provides individual alignment between 3D CAD models and the objects present in the reconstructed surfaces. To obtain the ground-truth symmetries, we first perform an optimization-based symmetry detection on the 3D CAD models, then transfer the detected symmetries to each RGB-D frame. We split the original train/test split into three subsets: RGB-D images of the training scenes (Train), holdout RGB-D images of the training scenes (Holdout view), and RGB-D images of the testing scenes (Holdout scene).
The statistics of the benchmark datasets are reported in Table 1.
5.2. Evaluation metric
To evaluate and compare the proposed method, we show precision-recall curves (funk20172017) produced by altering the threshold of the confidence value of the prediction. To determine whether a predicted symmetry is a true positive or a false positive, we compute a dense symmetry error from the difference between the predicted symmetry and the ground-truth symmetry. Specifically, for a reflectional symmetry, the dense symmetry error of the predicted symmetry and the ground-truth symmetry of an object with points is computed as:
where and are the symmetric transformations of and , respectively, and is the max distance from the points in to the symmetric plane of .
For rotational symmetries, the dense symmetry error between a predicted symmetry and the ground-truth symmetry is:
where is the rotational transformation of with a rotation angle of . The set of rotation angles is , and is the max distance from the points in to the rotational axis of .
In all experiments, we set the dense symmetry error threshold to be for both reflectional and rotational symmetries. The confidence value of a predicted symmetry is computed by the number of input points and the number of samples (symmetries predicted by each point) belonging to the same cluster as symmetry .
5.3. Ablation studies
To study the importance of each component of our method, we compare our full method against several variants. A specific part of the pipeline is taken out for each variant, as follows:
No RGB Input: without the input RGB image channels (see the first component in Figure 2). The network can only learn knowledge about symmetries based on geometry.
No Counterpart Predictions: without multi-task learning in the form of counterpart predictions or during training (see the second component in Figure 2).
No Verification: without the visibility-based filtering of false positives during inference (see the third component in Figure 2).
Figure 6 shows the results of our ablation studies, for reflectional (left column) and rotational (right column) symmetry detection. The full method outperforms the simpler variants in almost all cases. Omitting counterpart prediction degrades the results the most, especially for reflectional symmetry detection. This demonstrates that the multi-task learning scheme is crucial to our approach. An interesting observation is that the baseline without RGB input achieves comparable or even better results on subsets containing novel objects (see Figure 6 d, e, and f). This demonstrates that generalization to unknown objects requires geometry, and confirms the intuition presented in Section 3.
5.4. Comparison to baselines
We evaluate our method against three baseline symmetry detection methods for objects based on RGB-D images:
Geometric Fitting (ecins2018seeing): a state-of-the-art symmetry detection approach for point clouds. It first generates a set of symmetry candidates, and then performs symmetry refinement based on geometric constraints. Since their focus is to detect reflectional symmetries, we only compare with it on the reflectional symmetry prediction task.
RGB-D Retrieval: an intuitive approach for symmetry detection that finds, for each object in an RGB-D image, the most similar object present in the training data. The precomputed symmetries are then transferred from the training data to be the symmetry predictions. To achieve this, we train a FoldingNet (yang2018foldingnet) to extract the feature vectors of all objects in the training data. During testing, distance in the feature space is used to retrieve the most similar RGB-D image.
Shape Completion: a two-step approach which first performs a shape completion (liu2019morphing) on the input point cloud and then detects symmetries on the completed shape by a geometric symmetry detection method (li2014efficient). We compare to it on the reflectional symmetry prediction task of ShapeNet.
DenseFusion (wang2019densefusion): a cutting-edge approach to estimate the 6D pose of the known objects. We transform the ground-truth symmetries of the known objects by using the predicted 6D pose produced by DenseFusion, thus obtaining the predicted symmetries for each object in the RGB-D images. Since DenseFusion only works on scenarios where the geometries of the target objects are known, we compare to it on the YCB dataset.
The comparisons are plotted in Figure 7 (reflectional symmetry) and Figure 8 (rotational symmetry). They show that our method outperforms the baselines by a large margin, for both the reflectional and rotational symmetry detection tasks, and over all the data subsets. Crucially, our method achieves relatively high performance on subsets (ShapeNet holdout category and ScanNet holdout scene) that include very different objects from those present in the training data. The Shape Completion baseline is inferior to our method, especially on the ShapeNet holdout Category subset, due to its poor generality on the untrained categories. Even though the Geometric Fitting baseline has high precision for the symmetries it detects, it fails to detect most of the symmetries since it can be easily influenced by incomplete observations. The RGB-D Retrieval baseline shows worse performance than our method, especially on the ShapeNet holdout Category
subset and the ScanNet dataset, due to its weaker generalization ability. The DenseFusion baseline achieves a relatively high precision and recall on the YCB dataset. However, it cannot be extended to datasets containing objects that have not been seen during training.
5.5. Qualitative results
Figure 9 visualizes the symmetry prediction results for both synthetic and real data. Our approach detects both reflectional and rotational symmetries in challenging cases, such as novel objects, objects with multiple symmetries, and objects with heavy occlusion. We also show the qualitative comparison with baselines in Figure 14. We see that partial observations and occlusion interfere with the ability of Geometric Fitting and PRS-Net to establish correspondences on the observed points. RGB-D Retrieval provides poor features for untrained objects, leading to an inability to predict accurate symmetries. Shape Completion can generate reasonably good geometry for common objects, but it is less capable in cases where the objects are novel or heavily occluded. Our method, in contrast, successfully predicts symmetries with high accuracy for all of these examples.
5.6. Sensitivity to occlusion
To evaluate the capability of our approach when it comes to occluded objects, we create a dataset based on ShapeNet by grouping the data according to the occlusion ratio. In order to generate data with mutual occlusion, we randomly add a foreground mask in the rendered RGB-D images. The occlusion ratio of the object is computed by dividing the area of the occluded region by the area of the whole surface. Examples of the occluded data are provided in the supplemental material. Note that both self-occlusion and mutual-occlusion are included.
Figure 10 compares our approach with the Shape Completion, Geometric Fitting, and RGB-D Retrieval baselines, all three of which are capable of finding symmetries for occluded objects. It is evident that our method performs better than the baseline methods for all the experiments. While the overall performance is generally affected as the occlusion ratio increases, ours outperforms the baseline methods and shows a relatively smaller decreasing rate in all cases. Qualitative results on the real data with occlusions are shown in Figure 1, Figure 9 and Figure 14.
5.7. Evaluation of counterpart prediction
To evaluate the quality of the predicted counterparts, we compute and plot the distribution of the Euclidean distance of each predicted counterpart to its ground-truth counterpart, similar to (kim2011blended). Figure 11 shows the plots on the three ShapeNet subsets. The x-axis of the plots represents a varying Euclidean distance (error) threshold. The y-axis shows the percentage of counterpart correspondences whose Euclidean distance are within the threshold. Larger AUC (Area Under the Curve) represents better performance. Compared to the baselines, our method is more accurate on the counterpart prediction task over all ShapeNet subsets.
5.8. Runtime analysis
Table 2 reports the timing of each component in our approach on a server with an Intel® Xeon® CPU E5-2678 v3 @ 2.50GHz 48, 128GB RAM, and an Nvidia TITAN V graphics card. Note that our method is dramatically faster compared to the state-of-the-art method in (ecins2018seeing) which takes about seconds to detect symmetries for an object.
|Dataset||Network train||Network inference||Aggregation||Verification|
5.9. Failure cases
Figure 12 shows two typical failure cases found in our experiments. The first case is that our method is unable to deal with other symmetry types than it was trained for. In the example, it detects two reflectional symmetries for a spherical symmetry. Another case is when a cuboid object is viewed orthogonally from only one face, our method would fail to infer the location of the reflection plane along the depth direction since the shape information is completely missing along that direction.
Various applications can potentially benefit from symmetry prediction. A straightforward application is symmetry-based object completion as commonly demonstrated in many symmetry detection works (e.g., (bokeloh2009symmetry)). Here, we focus on a more unique application, i.e., how to apply our predicted symmetries to assist 6D object pose estimation from single-view RGB-D images.
To demonstrate the effectiveness of the predicted symmetries on the improvement of 6D pose estimation, we combine the predicted symmetry information of our approach to the state-of-the-art 6D pose estimation approach DenseFusion (wang2019densefusion). To be specific, we feed the parameters of predicted symmetries to DenseFusion as extra features. These features are then processed by an MLP and concatenated to the point cloud feature in DenseFusion. We train the network using the same data as described in (wang2019densefusion). Figure 13 demonstrates how the predicted symmetries boost the performance of DenseFusion.
6. Discussion and conclusions
We have proposed a novel problem of detecting 3D symmetries from single-view RGB-D images. Due to partial observation and object occlusion, the problem is challenging, to the point of being beyond the reach of purely geometric detection methods. Instead, we have proposed an end-to-end deep neural network that predicts both reflectional and rotational symmetries for 3D objects based on a single RGB-D image. Several dedicated designs make our method general and robust. First, our network is trained on multiple tightly coupled tasks to achieve outstanding generalizability for both types of symmetries. Second, we devise an optimal assignment module in our network to enable it to output an arbitrary number of symmetries.
Our current method has certain limitations, which we believe will inspire future research:
Our method relies on a good object-level segmentation. If the segmentation mask of an object of interest contains other objects, the symmetry detection will be affected. Although there have been many powerful deep models trained for RGB-D segmentation, it would still be interesting to integrate object detection/segmentation and symmetry detection into one unified deep learning model.
Our current network can only deal with reflectional and rotational symmetries. Extending it to other types of symmetry should not be difficult, although it may make the network harder to train. In general, finding a suited parameterization/representation of symmetry for end-to-end learning is a fundamental and interesting future direction to pursue.
Our method cannot handle hierarchical (nested) symmetries such as those considered in (wang2011symmetry). We expect that recursive neural networks (RvNN) could be utilized for this case, following the series of works on using RvNNs for 3D structure encoding/decoding (li2017grass; yu2019partnet).
Our network relies on strong supervision. Annotating symmetries for RGB-D data is a non-trivial endeavor. Therefore, it would be interesting to look into unsupervised or self-supervised approaches to symmetry detection, through exploiting rich geometric constraints.