1 Introduction
In the real world, there often exist a number of dynamic and articulated objects, which can be directly operated using their moving parts. For example, we may need to open a refrigerator door to keep or remove an object. If we expect autonomous agents to correctly interact with such objects, they must have the ability to analyze which part of an object is a moving part. Hence, learning part mobility of 3D objects is beneficial for 3D computer vision
[Li_CGF2016] and robotics [Hermans_2013] and is closely related to the understanding of object affordances [Myers_ICRA2015], functionality [Hu_CGF2018], and interaction using object recognition [Liu_2019_CVM] and human motion capture data [Roberts_2019_CVM].Recently, with the emergence of large 3D labelied datasets and deep learning techniques, several studies have been conducted to make considerable progress in supervised semantic part segmentation, such as PointNet++ [Pointnet++] and PointCNN [PointCNN]. However, these segments were often defined by personal subjective experience, which can easily lead to ambiguity. Thus, research on learning moveable part segmentation of 3D objects is of more practical significance, and it helps agents understand the essential features of dynamic objects. Moreover, current research on parsing 3D shape representations into semantic parts faces challenges in processing novel object categories and discovering new functional parts. It is not conducive for agents to explore strange new worlds. Ideally, intelligent agents should be able to parse 3D shapes into previously unseen functional parts from observations of continuous part motion. Mobilitybased shape parsing brings a novel perspective to the part determination problem and provides an unambiguous decomposition. Additionally, a mobilitybased part segmentation algorithm enables intelligent agents to make better use of manmade objects designed to function or interact with other objects (including humans).
In this study, we are interested in discovering the part mobility of 3D objects by observing their continuous part motion. We define part mobility as motion part segmentation, motion axis prediction and motion range estimation. In previous studies, motion parts have been typically extracted from one or two single static snapshots of dynamic objects. By contrast, we aim to deduce motion part structure from observations of continuous articulation states of an object. The reason is that speculating the motion parts and confirming motion vectors by observing only a few motion states can easily cause confusion. In this study, we adopt point cloud sequences to represent dynamic objects. Point cloud sequences reveal sufficient spatiotemporal information, and they are easier to obtain than large wellannotated datasets, owing to recent advances in the techniques of realtime 3D acquisition such as commercial RGBD cameras.
Automatic motion part induction from point cloud sequences is challenging for several reasons. First, objects differ significantly because of their geometry and pose. Motion directions and motion ranges of motion parts have noticeable differences. Second, a point cloud is unordered. There is no tight point correspondence between adjacent frames of point cloud sequences. Third, we must consider the influence factors of acquisition quality of scan data, including noisy and missing data.
We propose a novel selfsupervised deep neural networkbased method to address the above mentioned problems, inspired by the traditional methods of Yan and Pollefeys [Yan_ECCV2006], which is illustrated in Fig. 1. Points on the same part produce similar motion trajectories. We can calculate a rigid motion hypothesis based on these similar motion trajectories. Following these observations, we treat motion part segmentation as a trajectory clustering problem. Moreover, directly handling point cloud sequences can be difficult in a network without point correspondence. Our method transforms a point cloud sequence into a bunch of trajectories. We design a neural network, PointRNN, to process trajectories, and it can learn latent trajectory feature representations and generate candidate rigid motion hypotheses represented by motion axes and motion ranges. Finally, we merge the trajectories belonging to a similar rigid motion to achieve motion part segmentation. The experimental results show that our deep learning method can tolerate noise well compared with the traditional method.
Our method is verified through qualitative and quantitative evaluations on both synthetic and real datasets. Additionally, we conduct ablation experiments to confirm the influence of different loss designs, hyperparameter settings, etc. The results of the comparison experiments demonstrate that our method achieves better performance and requires less time than the traditional method
[Yuan_CGF2016] and deep learning methods [Wang_CVPR2019], [Yi_SIGa2018], [Yan_SIGa19]. Moreover, our algorithm has the ability to generalize to novel categories.In summary, a new selfsupervised deep learning method is proposed to parse 3D shapes into moving parts, motion axes, and motion ranges from point cloud sequences without labeling or any prior knowledge. Specifically, our method makes three key contributions. First, we introduce a selfsupervised method for learning motion part segmentation, motion axis prediction and motion range estimation. Second, we propose a novel neural network called PointRNN which has the capability to process trajectories and extract a feature representation. Third, we demonstrate that the performance of our network is superior stateoftheart methods, and it can generalize to novel object categories.
2 Related Work
Many 3D shape segmentation approaches have been proposed in previous studies to extract moving parts from an RGBD sequence, a single point cloud, point cloud sequence, and mesh model. Given a specified RGBD sequence that contains a dynamic object, attempts have been made in previous studies to recover the 3D scene flow, thus discovering moving parts of 3D objects. Jaimez et al. [Jaimez_ICRA2015] proposed a primal–dual algorithm to compute RGBD flow for estimating heterogeneous and nonrigid motion at a high frame rate. Vogel et al. [Vogel_ECCV2014] introduced a method to recover dense 3D scene flow from multiple consecutive frames in a sliding temporal window. To reconstruct the articulated structure, motion part segmentation can be extracted by point correspondence, as shown by Fayad et al. [Fayad_ICCV2011]. However, common defects in these approaches are their reliance on RGB color to compute scene flow and inability to handle complex structures or large motions.
In the case of a raw 3D point cloud sequence, many studies aimed to establish a pointwise correspondence between consecutive point clouds of an articulated shape [Chang_and_Zwicker_CGF2008], [Papazov_and_Burschka_CGF2011]. Yan and Pollefeys [Yan_ECCV2006] cast the problem of motion segmentation of feature trajectories as linear manifold finding problems and proposed a general framework for motion segmentation under affine projections. In addition, Kim et al. [Kim_IROS2016] considered that trajectories can be grouped by clustering to separate different motion parts. Yuan et al. [Yuan_CGF2016] proposed a localtoglobal approach to cosegment point cloud sequences of articulated objects into nearrigid moving parts. Most of aforementioned methods require a large amount of computation to achieve better performance. These approaches also require considerable effort in threshold setting from case to case.
Several approaches parse mesh models into moving parts. Mitra et al. [Mitra_2013] inferred the motion of individual parts and the interactions among parts based on their geometry and a few userspecified constraints. They utilized the results to illustrate the motion of mechanical assemblies. Hu et al. [Hu_SIGa2017] introduced a datadriven approach for learning part mobility based on a defined motion pattern from a single static state of 3D objects. However, these methods are based on a welldefined motion pattern and wellsegmented 3D objects.
Recent supervised learning approaches in 3D shape segmentation employ deep network architecture to train a classifier on labeled data represented by a point cloud
[Pointnet++], volumetric grids [Maturana_and_Scherer_ICRA2015] or spatial data structures [Klokov_ICCV2017]. Subsequent attempts have been made to extract part rigid motions using a new deep learning technology. Wang et al. [Wang_CVPR2019] proposed the shape2motion network architecture, which takes a single point cloud as input. Their network aims to simultaneously segment motion parts and predict motion axes based on a large, wellannotated dataset. However, their approach has difficulty in discovering new motion parts and generalizing to novel categories. Their network requires a considerable amount of time for supervised learning using a large dataset. Another type of method, such as that proposed by Yi et al. [Yi_SIGa2018], infers part motion flow and estimates part correspondence by comparing two different motion states. Unfortunately, they did not perform a specific analysis of motion attributes such as motion axes and motion ranges. Behl et al. [Behl_CVPR2019] and Liu et al. [Liu_CVPR2019] promoted this class of algorithm to identify motion flow in scene data. However, these methods still rely on supervised learning. Yan et al. [Yan_SIGa19]introduced RPMNet to infer movable parts of a single point cloud. They adopted a recurrent neural network (RNN) to segment movable parts by forecasting a temporal sequence motion of 3D objects. In contrast, our network takes point cloud sequences as input instead of a single point cloud. Furhtermore, we use an RNN to process each trajectory locally rather than the entire point cloud directly and globally.
Deformable shape registration [ARAP] is a basic problem in computational geometry and widely applied to various fields such as computer vision. There are various deformable registration algorithms for point clouds. Many methods are extensions of the classic ICP algorithm [BM92]. Papazov and Burschka [Papazov_and_Burschka_CGF2011] proposed an algorithm that computes shape transitions based on local similarity transforms, thereby allowing it to model not only asrigidaspossible deformations but also shapes on local and global scales.
3 Method
Our method comprises three steps for learning part mobility, as shown in Fig. 1. First, we adopt a deformable registration algorithm using point correspondence to generate trajectories (Sec. 3.1). Second, we propose a network architecture called PointRNN to produce candidate motion hypotheses extracted from trajectories (Sec. 3.2). Third, we design an iterative algorithm for removing redundant motion hypotheses and segmenting a point cloud into several motion parts while considering the match degree between trajectories and motion hypotheses (Sec. 3.3). The second step of our method occurs within the learning pipeline, and the other two steps occur outside it. In this section, we discuss the modules of our approach in detail.
3.1 Trajectory generation
The trajectories are input to our network, PointRNN (Fig. 2). To ensure a fair comparison with Yuan et al. [Yuan_CGF2016], we use the same trajectory generation method. Thus, we adopt the method of Papazov and Burschka [Papazov_and_Burschka_CGF2011] based on local similarity transforms to generate motion trajectories, because it is an alternative module to generate trajectories for our method. All the trajectories are cropped to an equal length to facilitate deep neural network processing.
Given two adjacent frames, and , of a point cloud sequence, the registration algorithm includes iterative steps of correspondence and deformation. In terms of the correspondence step, we find the closest point in for each point . We optimize by searching over its nearest neighbors to make the match error smooth. Formally, we compute a mapping, , by minimizing the Laplacian smoothness energy of the residual field :
(1) 
where the residual Laplacian, , is defined as
(2) 
Here, is the nearest neighbors of . In terms of the deformation step, we minimize a fitting error between the local neighborhood of and its matched counterpart to estimate a similarity transformation, denoted as
, by using the singular value decomposition
[Scott_SIGGRAPH2006].(3) 
3.2 Part rigid motion hypothsis extraction
PointRNN network architecture. We propose a novel neural network called PointRNN which has the capacity to process trajectories and extract the feature representation, as illustrated in Fig. 2.
Given a set of trajectories, ( in our implementation), has a point set, ( by default), with and a motion direction set, ( by default), with
. There is a onetoone relationship between these two sets. For each trajectory, we employ a Long ShortTerm Memory (LSTM) network
[LSTM] to encode time information, because LSTM has demonstrated high accuracy in tasks related to the processing of sequential and timeseries data. Yan et al. [Yan_SIGa19] used the PointNet++ encoder to create a global feature for the input point cloud. Then, they fed this feature vector to the LSTM network. Compared to their approach, ours pays more attention to local and detailed features for each trajectory.We assume that similar trajectories should belong to one motion part, as inspired by the traditional clustering method [Yan_ECCV2006]. Thus, we design a Sample&Group module to learn the spatial features with their underlying motion. Specifically, we employ a LSTM network to encode time information for each trajectory at first. To encode spatial information, we first use iterative farthest point sampling to obtain a subset of trajectories, , such that is the most distant trajectory (in metric distance) from set . The metric distance function between two trajectories, , is as follows:
(4) 
We adopt Euclidean distance rather than Hausdorff distance or Ferche distance, because the latter two metrics require significantly more computations without yield improved performance. Next, we search a neighborhood trajectory set, , for each sampled trajectory, , using the nearest method, where the metric distance between and () does not exceed a certain threshold . Then, we obtain a number of overlapping partitions of trajectories. Finally, we aggregate the features of the trajectory set as regional spatial features and express them as features of . In our method, and are tunable parameters in different layers.
We design an Interpolate module to propagate global spatial features to single trajectory features with distancebased interpolation and skipconnection. We achieve feature propagation by interpolating features from the nearest trajectory set (), , to each trajectory, . We use the inverse distanceweighted average based on the
nearest neighbors in interpolation. Finally, we feed these interpolated trajectory features into a Multilayer Perceptron (MLP) Network to encode single trajectory features.
By observing most objects in the real world, we found that the number of motion parts is finite and small. Therefore, we assume that the number of motion parts is no more than 10 based on our dataset. We design a Sample&Predict module to generate part rigid motion hypotheses. Similar to the Sample&Group module, we obtain () trajectory sets using farthest point sampling and nearest method (). We consider that these sets can cover all the motion parts of 3D shapes. We use pooling to obtain aggregated spatial features for each trajectory set. Then, we obtain temporal feature using a LSTM network. Finally, we feed their temporal and spatial features to a MLP Network. The network outputs a motion axis including a start point () and direction (), motion ranges including a shifted distance (), and a rotation angle () (Fig. 3).
Selfsupervised loss function design.
To reconstruct the motion sequences, we compute the motion matrix, , by taking advantage of network output. The motion matrix is composed of a rotation matrix and a translation matrix (). is easy to generate by using and . The problem of solving can be reduced to transforming the point set rotating around the given axis (including and ) into the point set rotating around the zaxis of the standard coordinate system, . The basic idea of deriving this matrix is to divide the problem into few known simple steps. First, we align the start of the given axis with the coordinate origin and move the point set. Next, we rotate the given axis and the point set such that the axis lies in the coordinate planes. Then, we rotate the given axis and the point set such that the axis is aligned with the zaxis. We use one of the fundamental rotation matrices to rotate the point sets using . Finally, we undo steps three to one.We can rebuild the last frame () from the first frame () by premultiplication (
on the left) in our neural network. To apply the backpropagation algorithm, we implement the partial derivative for each network output in our neural network. Thus, we define the first loss function, called the rebuild loss function, as follows.
(5) 
For each trajectory set in the Sample&Predict module, we solve an absolute orientation problem to obtain an approximate solution of rigid motion (). Our solution for the absolute orientation problem is based on the method of Myronenko and Song [Myronenko_and_Song_arXiv2009]. In addition, we guarantee that the determinant of is 1 to ensure that has a rotation matrix other than the reflection matrix. Then, we extract rotation matrix from . Similarly, we can obtain from
. We define the relative pose estimation loss function, which was used to measure the angular distance between
and in the study by Suwajanakorn et al. [Supasorn_NIPS2018].(6) 
We found that the network has a slow convergence and easily falls into a local optimum solution using only and . The reason for this is that it is possible to generate the same by using two different parameter sets of , , , and . To solve this problem, we compute an approximate solution of , , , and denoted as , , , and , respectively. In general, can be written more concisely as Rodrigues’ rotation formula [RodriguesRotationFormula] as follows:
(7) 
where is the cross product matrix of , is the outer product, and I
is the identity matrix. We can easily obtain
and from equation (7).(8) 
where is the trace of .
(9) 
We extract from , where is a vector from the centroid of to that of . is the length of projected to .
(10) 
Finally, we solve linear algebraic equations to obtain , where I is the identity Matrix.
(11) 
In the case of the general situation in which the start of an axis is not at the coordinate origin, we can obtain the same result. Then, we define the axis loss function as follows:
(12) 
where is the cosine distance between and . Given that is difficult to compute, network output should satisfy rather than directly regressing by . All formula deductions and practical calculation procedures are presented in the appendix.
In our implementation, by default. In addition, we set a threshold for each loss function to tolerate error and improve robustness, because approximate solutions of , , , and usually have errors with respect to the ground truth, especially in real scan data.
3.3 Motion parts segment and optimization
We obtain part rigid motion hypotheses from PointRNN. These hypotheses are redundant, such that we must merge similar motion hypotheses. We define that there is no movement if and are no more than (0.1 radian by default) and (0.05 length by default), respectively, considering noisy data and computation errors. Further, we define rotation as and and translation by contrast. The combination of rotation and translation satisfies the conditions that and . We merge the remaining motion axes by comparing the diversity of , , and .
We define the metric distance between trajectory and motion hypotheses as the match degree. We use the motion hypotheses to rebuild . Then, the metric distance has two terms: rebuild loss and direction cosine distance.
(14) 
in our implementation. Then, we obtain refined motion axes and motion parts using an iteration algorithm. Inspired by NonMaximum Suppression, we first set the refined motion axis set, , as empty. We assign the trajectories to candidate axes. Then, we choose the candidate axis that has the most votes. If is empty, we add this candidate axis to . If is not empty, we compute whether trajectories belonging to this candidate axis can also belong to an axis in . If not, we add this candidate axis to . We iterate these steps until all trajectories are assigned to a motion axis. We optimize coarse segments by examining whether the label of a trajectory is the same as most of the labels of its nearest neighbors.
4 Experiments
In this section, we first introduce the benchmark datasets used in the experiments (Sec. 4.1). We then compare our approach with previous stateoftheart methods on the tasks of motion part segmentation, motion attribute estimation, and 3D flow determination (Sec. 4.2). We also replace the PointRNN backbone with two different network designs to evaluate the effect of PointRNN (Sec. 4.3). Finally, we conduct ablation experiments of the loss function design, transfer ability, hyper parameters, etc. (Sec. 4.4). Further analyses and visualizations are provided in the appendix.
4.1 Dataset settings
We evaluate our methods on three different datasets, including two synthetic sets (fdata, sdata) and one real set (rdata). A fake dataset (fdata) is generated randomly and automatically. fdata enables the neural network to learn the common motion patterns of random trajectories. We leverage two annotated datasets including the Motion dataset [Wang_CVPR2019] and PartNet [Mo_2019_CVPR] to construct the second synthetic dataset (sdata). Xiang et al. [Xiang_2020_SAPIEN] enriched the PartNet dataset with motion attributes. We choose 31 categories including cabinet, lamp, and window to generate motion sequences. For the real data (rdata), we select certain reasonable motion sequences in the RBO dataset [RBO_dataset]. Moreover, Yi et al. [Yi_SIGa2018] provided a small real scan dataset. Additionally, we scan some real data sequences inhouse. More details on these datasets are provided in the appendix.
4.2 Comparison with stateoftheart methods
We test our method against four alternatives, including both nonlearning and learning approaches. Specifically, we compare our method with the traditional method of Yuan et al. [Yuan_CGF2016] and three networkbased approaches including Yi et al. [Yi_SIGa2018], Wang et al. [Wang_CVPR2019], and Yan et al. [Yan_SIGa19]. To the best of our knowledge, we are the first to propose an approach using deep learning to extract part mobility from a point cloud sequence. To ensure a fair comparison, we implement a spacetime cosegmentation baseline following that of Yuan et al. [Yuan_CGF2016]. Because their method also needs trajectories as input, we employ the same trajectory generation algorithm [Papazov_and_Burschka_CGF2011]. Yi et al. [Yi_SIGa2018] proposed a neural network architecture with three modules that propose correspondences, estimate 3D deformation flows, and perform segmentation. Their method takes as input a pair of point clouds representing two different articulation states to segment motion parts and estimate 3D flows. To better provide a contrasting experiment between their approach and ours, we set in our network settings. That is, we also take a pair of point clouds as input to train and test our network. In terms of the approach of Wang et al. [Wang_CVPR2019], we train and evaluate their network and ours using the same training and testing data. However, it must be mentioned that their network requires a single point cloud as input. Considering this difference, the data for their network are randomly sampled frames from point cloud sequences. For the comparison between Yan et al. [Yan_SIGa19] and our method, the data for their approach are the first frame of the point cloud sequence, because their approach segments motion parts by predicting a temporal sequence from a single point cloud.
We use seven metrics to evaluate motion part segmentation and motion axis prediction. We use to evaluate motion part segmentation. We measure the Minimum Distance (MD), Orientation Error (OE) and Type Accuracy (TA), as introduced in [Wang_CVPR2019], which are used for measuring the distance, angle, and motion type accuracy, respectively, between the predicted motion axis line and the ground truth. Moreover, we employ Rand Index (RI) and EndPointError (EPE) to evaluate motion part segmentation and 3D flows in comparison with Yi et al. [Yi_SIGa2018]. To compare the method of Yan et al. [Yan_SIGa19] with our method, we also use the mean Average Precision (mAP) as defined in their paper to measure the segmentation accuracy.
Method  MD  OE  TA  
Yuan et al. [Yuan_CGF2016]  0.71       
Wang et al. [Wang_CVPR2019]  0.67  0.051  0.055  0.92 
Ours  0.87  0.032  0.027  0.99 
Yuan et al. [Yuan_CGF2016] adopted a clustering method to obtain local segments and propagated these segments to the neighboring frames. Finally, they merged all frames using a space–time segment grouping technique to obtain the final motion part segmentation. However, their method tends to have degraded performance when dealing with tiny motion parts. Moreover, their approach has difficulty in processing 3D shapes with many diverse large motions. Table 1 shows that our approach outperforms theirs in terms of . They did not exploit the fact that motion parts have local characteristics. They generated a motion hypothesis by randomly choosing a number of trajectory triplets. This resulted in the creation of several fragments that were difficult to group, especially in tiny parts. Additionally, the parameters of their approaches were different depending on the case.
Wang et al. [Wang_CVPR2019] parsed 3D shapes into part mobility by observing of a single static point cloud based on a large, welllabeled dataset. Their method can only predict motion parameters of 3D shapes that have been seen in the training data, because of the supervised learning based on a large, wellannotated dataset. It seems very easy to get confused in extracting motion parts from a single snapshot. For example, it would be very difficult to discover whether a door opens on the right or left when it is closed. More importantly, their network is likely to consider that is not a motion part when a door is closed. They used a similar matrix to segment motion parts. However, this type of method is also not conducive to dealing with tiny parts. In contrast to their method, which directly predicts an axis as a regression problem, we minimize the cosine distance for and solve linear algebraic equations to compute , which makes our result more robust and stable than theirs (Table 1).
Yi et al. [Yi_SIGa2018] aimed to discover motion parts of objects and estimated 3D flows by analyzing the underlying articulation states and geometry of shapes. Their network architecture alternates between correspondence, deformation flow, and segmentation prediction iteratively in an ICPlike fashion. Theirs is a supervised learning method, which estimates 3D pointwise flows at first and then segments the motion parts. In contrast to their method, we compute the motion axes and motion ranges, and then segment motion parts using these motion parameters. The results demonstrate that our method achieves higher and RI, and lower EPE than theirs.
Method  RI  EPE  
Yi et al. [Yi_SIGa2018]  0.71  0.80  0.029 
Ours  0.83  0.87  0.024 
Yan et al. [Yan_SIGa19] predicted a temporal sequence of pointwise displacements from the input shape using LSTM. Then, the RPMNet used these displacements to learn all the movable points. Next, they obtained a pointwise distance matrix by computing the Euclidean distance between these learned point features. Finally, they selected a clustering method to separate the points in the motion parts according to a distance matrix. They tended to study the part mobility from point cloud sequence using LSTM, but they still used a single point cloud. Table 3 shows that the results obtained from these models are close to each other. Although their method deal with motion segmentation well, it has the same defect as that of Wang et al. [Wang_CVPR2019], which is due to supervised learning. It is difficult to imagine a motion sequence from a single point cloud.
Method  mAP  
Yan et al. [Yan_SIGa19]  0.86  0.76 
Ours  0.87  0.77 
4.3 Analysis of the effects of different network designs.
Compared to the contrast experiments, there are three additional novel metrics, , , and , presented in Sec. 4.3 and Sec. 4.4. Our approach also estimates motion ranges such that and are errors between the network outputs and ground truth. is the error of rebuilding the last frame from the first frame, which is used for a comprehensive evaluation. Moreover, we employ the model size to compare the size of different models and timing, which includes training time and testing time, to evaluate the efficiency of the algorithm.
To verify the effectiveness of PointRNN on the tasks of motion part segmentation and motion attribute estimation, we design two baselines to replace the network backbone. We adopt a slightly modified version of that of Liu et al. [Liu_2019_ICCV] to build baseline1. We combine point clouds to obtain a single point cloud. Next, we feed this single point cloud to PointNet++ to extract features. Finally, we minimize the loss function to estimate motion axis and motion range. Simultaneously, we design baseline2 by referring to Yan et al. [Yan_SIGa19]. We extract global features per frame using a shared PointNet++. Then, we adopt LSTM to encode the time information. Similarly, we use as a loss function. For more details about these baselines, please refer to the appendix. The results show that we can achieve better performance by using PointRNN, because our network can extract local and detailed spatiotemporal features.
Method  MD  OE  TA  
baseline1  0.86  0.029  0.054  0.054  0.042  0.025  0.97 
baseline2  0.84  0.034  0.062  0.064  0.052  0.029  0.96 
Ours  0.88  0.027  0.028  0.046  0.028  0.023  0.98 
4.4 Ablation experiments
Effect of different loss function designs. To analyze the consequences of different loss function designs, we experiment with an ablated version of our network with four combinations of all loss functions. Different loss designs mainly influence the network performance with respect to axis generation. Thus, we train and test our network on fdata to focus on examining the quality of the predicted motion axes. Table 5 lists the results of including , , , and . The results demonstrate that our method achieves optimal performance for motion axis estimation, thereby verifying its optimal effect when using , , and together. The reason for this is that it is possible to generate the same using two different parameter sets of , , and . There is an interactive constraint between two loss functions.
Loss Function  MD  OE  TA  

0.035  0.017  0.058  0.026  0.061  0.96  

0.031  0.017  0.051  0.036  0.086  0.90  

0.032  0.016  0.053  0.022  0.045  0.94  

0.027  0.014  0.047  0.023  0.020  0.97 
Comparison between our method and that of Myronenko and Song [Myronenko_and_Song_arXiv2009]. Our method is selfsupervised, because our data are not manually annotated, either in training or testing. Moreover, the supervised information is auto generated from the characteristic distribution of the data. We solve an absolute orientation problem to generate a motion matrix () using the method of Myronenko and Song [Myronenko_and_Song_arXiv2009]. Then, we parse the supervised information (, , , ) from . We adopt the neural network approach rather than the traditional method of Myronenko and Song [Myronenko_and_Song_arXiv2009] because the traditional method is easily influenced by noise, which would interfere with and alter the degree of accuracy of the motion matrix. The neural network approach has better stability and robustness in complex situations (e.g., an object has many motion parts and the data are noisy). In the experiment of loss function design, it is reflected from the side that the network output is more robust than only using supervised information parsed from Myronenko and Song [Myronenko_and_Song_arXiv2009], because the result of is better than that of . In addition, we design an experiment to compare our method with that of Myronenko and Song [Myronenko_and_Song_arXiv2009], and the results presented in Table 6 show that directly using supervised information as an output is better in fdata than ours, but worse in sdata, because sdata is more complex and noisy than fdata.
[width=13em]MethodDataset  fdata  sdata 
Myronenko and Song [Myronenko_and_Song_arXiv2009]  0.018  0.027 
Ours  0.021  0.023 
Performance on three datasets. We train and test our network using three different datasets. The performance of these datasets is reported in Table 7. These three datasets have unique features. The difficulty of fdata stems from its randomness. Each element of fdata has totally different motion parameters, including motion axes and motion ranges. sdata has various categories. Most 3D shapes in sdata have two or more motion parts. rdata generally has a partial single view of 3D shapes with more noise. The results demonstrate that not only can our approach learn motion parameters from synthetic data, but it can also be applied to noisy real data.
Dataset  MD  OE  TA  
fdata    0.027  0.014  0.047  0.023  0.020  0.97 
sdata  0.87  0.032  0.027  0.046  0.027  0.023  0.99 
rdata  0.79  0.058  0.036  0.079  0.037  0.058  0.99 
Analysis on the length of point cloud sequence. Considering that our motivation is to learn part mobility from a point cloud sequence, the number of frames () is an important hyper parameter in our approach. Thus, we must determine how many frames are sampled from one motion sequence. To achieve this, we conduct an ablation experiment on to discuss the impact of the length of the point cloud sequence. We reduce the number of frames from 17 to 3 for a sufficient verification. The results are depicted in Fig. 7. As the number of frame increases, the performance gain is consistent but more and more slow. Hence, we adopt 11 frames as input to balance the performance and efficiency.
Analysis on transfer ability. Here, we aim to verify that our network can estimate the motion axes by learning the feature representation of trajectories other than only by learning geometrical characteristics. Our approach is a selfsupervised deep learning algorithm. It can abstract out motion patterns from trajectories, which can enhance the ability of generalization. Therefore, we design an experiment in which we train our network on fdata, and then test it on sdata. We present three categories with different levels of complexity results in Table 8. The results demonstrate that the network can produce correct motion axes, even though it did not see any real objects. This implies that our network can learn motion patterns from trajectories and our algorithm indeed has the capability of generalization.
category  MD  OE  TA  
Fan  0.89  0.014  0.014  0.032  0.028  1 
Laptop  0.95  0.018  0.013  0.035  0.021  1 
Scissor  0.79  0.022  0.017  0.087  0.031  1 
Effect of trajectory aggregation The aggregation of trajectories is of great importance in PointRNN, as it describes a motion part consisting of a group of similar motion trajectories. Thus, it is significant to analyze how different aggregation designs influence performance. We report five different sizes with two types of aggregations: max and average, as shown in Fig. 8. As the aggregation radius (
) increases, PointRNN improves until it peaks at approximately 32. The results demonstrate that it is difficult to learn the features when we cluster too few trajectories in a local region. In contrast, a larger region usually contains more than one motion part and results in decreased performance. Moreover, max pooling achieves better results than average by aggregating the features in local regions.
Analysis on the quality of trajectory. We also discuss the impact of different qualities of trajectory. Given a point cloud sequence, we first generate a set of trajectories. Then, we add different levels of random noise to these trajectories. Fig. 9 shows the relationship between and noise. Our method maintained more than 0.7 () after adding 0.04 random noise. As the noise degree increases, the performance declines sharply. The results demonstrate that our method can tolerate noise to a certain extent, but it will fail when under the addition of excessive noise.
Method  Model size  Timing 
Yuan et al. [Yuan_CGF2016]    /180s 
Wang et al. [Wang_CVPR2019]  86.7MB  38h/7.32s 
Yi et al. [Yi_SIGa2018]  78MB  35.5h/4.88s 
Yan et al. [Yan_SIGa19]  153.3 MB  13.2h/ 0.55s 
Ours  25.2MB  11.6h/0.43s 
Model size and speed. Our proposed model is highly efficient, as it leverages sparsity in point clouds by using clusters and does not require many stages like those of Wang et al. [Wang_CVPR2019] and Yi et al [Yi_SIGa2018]. Compared to previous methods (Table 9), our model is more than smaller in size and more than times faster than those of Wang et al. [Wang_CVPR2019] and Yi et al [Yi_SIGa2018]. While the method in Yuan et al. [Yuan_CGF2016] is a nonlearning method, it costs approximately 3 min to segment a 3D shape without giving motion parameters. Furthermore, our model is more than smaller in size than that of Yan et al. [Yan_SIGa19].
4.5 Limitations and future work
There are several avenues for future research. First, our method is based on trajectories such that if the quality of trajectories is extremely poor with many fragments, the performance decreases. Second, if two parts have exactly the same motion, e.g., two drawers in a cabinet have the same motion, our method considers that they have the same motion and should be one motion part. Third, for hierarchical mobility extraction, we can segment the 3D object into different motion parts, but the motion axis describes a compound movement. For example, the motion of a bulb holder is a composite movement with that of a lamp post. In the future, it would be interesting to infer parts and motions and discover common articulation patterns from various of sensor data. Moreover, our method relies on a preregistration of point clouds using an ICPlike optimization. There are also many deep learning methods for point cloud registration, such as [Pais20203DRegNetAD]. It is deserved to design a network encompassing both trajectory generation and motion attribute estimation in an endtoend fashion in future work. To facilitate future research and reproduce our method more easily, the source code and dataset are available on Github: https://github.com/AGithubRepository/PartMobility.
5 Conclusion
In this paper, we introduce a selfsupervised method for parsing 3D shapes into several motion parts with motion parameters from point cloud sequences. We first transform point cloud sequences into trajectories. Then, we propose a novel neural network architecture called PointRNN to extract the feature representations from trajectories and produce motion hypotheses. Finally, we propose an iterative algorithm for segmenting motion parts. In contrast to previous studies, our approach first predicts part motion parameters including motion axes and motion ranges, and then segments motion parts guided by these motion hypotheses. We experimentally demonstrated that our approach yields significantly better results compared with both the traditional and networkbased methods.
Acknowledgement
We thank the anonymous reviewers for their valuable comments. This work was supported in part by National Key Research and Development Program of China (2018YFC0831003 and 2019YFF0302902), National Natural Science Foundation of China (61902014 and U1736217 and 61932003), and Preresearch Project of the Manned Space Flight (060601).
Comments
There are no comments yet.