Self-Supervised Learning of Part Mobility from Point Cloud Sequence

by   Yahao Shi, et al.
Beihang University

Part mobility analysis is a significant aspect required to achieve a functional understanding of 3D objects. It would be natural to obtain part mobility from the continuous part motion of 3D objects. In this study, we introduce a self-supervised method for segmenting motion parts and predicting their motion attributes from a point cloud sequence representing a dynamic object. To sufficiently utilize spatiotemporal information from the point cloud sequence, we generate trajectories by using correlations among successive frames of the sequence instead of directly processing the point clouds. We propose a novel neural network architecture called PointRNN to learn feature representations of trajectories along with their part rigid motions. We evaluate our method on various tasks including motion part segmentation, motion axis prediction and motion range estimation. The results demonstrate that our method outperforms previous techniques on both synthetic and real datasets. Moreover, our method has the ability to generalize to new and unseen objects. It is important to emphasize that it is not required to know any prior shape structure, prior shape category information, or shape orientation. To the best of our knowledge, this is the first study on deep learning to extract part mobility from point cloud sequence of a dynamic object.



There are no comments yet.


page 4

page 7

page 8


MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences

Understanding dynamic 3D environment is crucial for robotic agents and m...

Shape2Motion: Joint Analysis of Motion Parts and Attributes from 3D Shapes

For the task of mobility analysis of 3D shapes, we propose joint analysi...

Deep Part Induction from Articulated Object Pairs

Object functionality is often expressed through part articulation -- as ...

CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations

We propose CaSPR, a method to learn object-centric canonical spatiotempo...

Self-supervised Sparse to Dense Motion Segmentation

Observable motion in videos can give rise to the definition of objects m...

Extracting Contact and Motion from Manipulation Videos

When we physically interact with our environment using our hands, we tou...

Graph-based compression of dynamic 3D point cloud sequences

This paper addresses the problem of compression of 3D point cloud sequen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the real world, there often exist a number of dynamic and articulated objects, which can be directly operated using their moving parts. For example, we may need to open a refrigerator door to keep or remove an object. If we expect autonomous agents to correctly interact with such objects, they must have the ability to analyze which part of an object is a moving part. Hence, learning part mobility of 3D objects is beneficial for 3D computer vision 

[Li_CGF2016] and robotics [Hermans_2013] and is closely related to the understanding of object affordances [Myers_ICRA2015], functionality [Hu_CGF2018], and interaction using object recognition [Liu_2019_CVM] and human motion capture data [Roberts_2019_CVM].

Recently, with the emergence of large 3D labelied datasets and deep learning techniques, several studies have been conducted to make considerable progress in supervised semantic part segmentation, such as PointNet++ [Pointnet++] and PointCNN [PointCNN]. However, these segments were often defined by personal subjective experience, which can easily lead to ambiguity. Thus, research on learning moveable part segmentation of 3D objects is of more practical significance, and it helps agents understand the essential features of dynamic objects. Moreover, current research on parsing 3D shape representations into semantic parts faces challenges in processing novel object categories and discovering new functional parts. It is not conducive for agents to explore strange new worlds. Ideally, intelligent agents should be able to parse 3D shapes into previously unseen functional parts from observations of continuous part motion. Mobility-based shape parsing brings a novel perspective to the part determination problem and provides an unambiguous decomposition. Additionally, a mobility-based part segmentation algorithm enables intelligent agents to make better use of man-made objects designed to function or interact with other objects (including humans).

Figure 1: We first convert a raw point cloud sequence into trajectories. Then, we adopt PointRNN to extract varieties of part rigid motion hypotheses by taking these trajectories as input. Finally, we merge these hypotheses and segment motion parts using an iterative algorithm. The results can propagate from the last frame to other frames.

In this study, we are interested in discovering the part mobility of 3D objects by observing their continuous part motion. We define part mobility as motion part segmentation, motion axis prediction and motion range estimation. In previous studies, motion parts have been typically extracted from one or two single static snapshots of dynamic objects. By contrast, we aim to deduce motion part structure from observations of continuous articulation states of an object. The reason is that speculating the motion parts and confirming motion vectors by observing only a few motion states can easily cause confusion. In this study, we adopt point cloud sequences to represent dynamic objects. Point cloud sequences reveal sufficient spatiotemporal information, and they are easier to obtain than large well-annotated datasets, owing to recent advances in the techniques of real-time 3D acquisition such as commercial RGB-D cameras.

Automatic motion part induction from point cloud sequences is challenging for several reasons. First, objects differ significantly because of their geometry and pose. Motion directions and motion ranges of motion parts have noticeable differences. Second, a point cloud is unordered. There is no tight point correspondence between adjacent frames of point cloud sequences. Third, we must consider the influence factors of acquisition quality of scan data, including noisy and missing data.

We propose a novel self-supervised deep neural network-based method to address the above mentioned problems, inspired by the traditional methods of Yan and Pollefeys [Yan_ECCV2006], which is illustrated in Fig. 1. Points on the same part produce similar motion trajectories. We can calculate a rigid motion hypothesis based on these similar motion trajectories. Following these observations, we treat motion part segmentation as a trajectory clustering problem. Moreover, directly handling point cloud sequences can be difficult in a network without point correspondence. Our method transforms a point cloud sequence into a bunch of trajectories. We design a neural network, PointRNN, to process trajectories, and it can learn latent trajectory feature representations and generate candidate rigid motion hypotheses represented by motion axes and motion ranges. Finally, we merge the trajectories belonging to a similar rigid motion to achieve motion part segmentation. The experimental results show that our deep learning method can tolerate noise well compared with the traditional method.

Our method is verified through qualitative and quantitative evaluations on both synthetic and real datasets. Additionally, we conduct ablation experiments to confirm the influence of different loss designs, hyperparameter settings, etc. The results of the comparison experiments demonstrate that our method achieves better performance and requires less time than the traditional method 

[Yuan_CGF2016] and deep learning methods [Wang_CVPR2019][Yi_SIGa2018][Yan_SIGa19]. Moreover, our algorithm has the ability to generalize to novel categories.

In summary, a new self-supervised deep learning method is proposed to parse 3D shapes into moving parts, motion axes, and motion ranges from point cloud sequences without labeling or any prior knowledge. Specifically, our method makes three key contributions. First, we introduce a self-supervised method for learning motion part segmentation, motion axis prediction and motion range estimation. Second, we propose a novel neural network called PointRNN which has the capability to process trajectories and extract a feature representation. Third, we demonstrate that the performance of our network is superior state-of-the-art methods, and it can generalize to novel object categories.

2 Related Work

Many 3D shape segmentation approaches have been proposed in previous studies to extract moving parts from an RGB-D sequence, a single point cloud, point cloud sequence, and mesh model. Given a specified RGB-D sequence that contains a dynamic object, attempts have been made in previous studies to recover the 3D scene flow, thus discovering moving parts of 3D objects. Jaimez et al. [Jaimez_ICRA2015] proposed a primal–dual algorithm to compute RGB-D flow for estimating heterogeneous and non-rigid motion at a high frame rate. Vogel et al. [Vogel_ECCV2014] introduced a method to recover dense 3D scene flow from multiple consecutive frames in a sliding temporal window. To reconstruct the articulated structure, motion part segmentation can be extracted by point correspondence, as shown by Fayad et al. [Fayad_ICCV2011]. However, common defects in these approaches are their reliance on RGB color to compute scene flow and inability to handle complex structures or large motions.

In the case of a raw 3D point cloud sequence, many studies aimed to establish a point-wise correspondence between consecutive point clouds of an articulated shape [Chang_and_Zwicker_CGF2008][Papazov_and_Burschka_CGF2011]. Yan and Pollefeys [Yan_ECCV2006] cast the problem of motion segmentation of feature trajectories as linear manifold finding problems and proposed a general framework for motion segmentation under affine projections. In addition, Kim et al. [Kim_IROS2016] considered that trajectories can be grouped by clustering to separate different motion parts. Yuan et al. [Yuan_CGF2016] proposed a local-to-global approach to co-segment point cloud sequences of articulated objects into near-rigid moving parts. Most of aforementioned methods require a large amount of computation to achieve better performance. These approaches also require considerable effort in threshold setting from case to case.

Several approaches parse mesh models into moving parts. Mitra et al. [Mitra_2013] inferred the motion of individual parts and the interactions among parts based on their geometry and a few user-specified constraints. They utilized the results to illustrate the motion of mechanical assemblies. Hu et al. [Hu_SIGa2017] introduced a data-driven approach for learning part mobility based on a defined motion pattern from a single static state of 3D objects. However, these methods are based on a well-defined motion pattern and well-segmented 3D objects.

Recent supervised learning approaches in 3D shape segmentation employ deep network architecture to train a classifier on labeled data represented by a point cloud 

[Pointnet++], volumetric grids [Maturana_and_Scherer_ICRA2015] or spatial data structures [Klokov_ICCV2017]. Subsequent attempts have been made to extract part rigid motions using a new deep learning technology. Wang et al. [Wang_CVPR2019] proposed the shape2motion network architecture, which takes a single point cloud as input. Their network aims to simultaneously segment motion parts and predict motion axes based on a large, well-annotated dataset. However, their approach has difficulty in discovering new motion parts and generalizing to novel categories. Their network requires a considerable amount of time for supervised learning using a large dataset. Another type of method, such as that proposed by Yi et al. [Yi_SIGa2018], infers part motion flow and estimates part correspondence by comparing two different motion states. Unfortunately, they did not perform a specific analysis of motion attributes such as motion axes and motion ranges. Behl et al. [Behl_CVPR2019] and Liu et al. [Liu_CVPR2019] promoted this class of algorithm to identify motion flow in scene data. However, these methods still rely on supervised learning. Yan et al. [Yan_SIGa19]

introduced RPM-Net to infer movable parts of a single point cloud. They adopted a recurrent neural network (RNN) to segment movable parts by forecasting a temporal sequence motion of 3D objects. In contrast, our network takes point cloud sequences as input instead of a single point cloud. Furhtermore, we use an RNN to process each trajectory locally rather than the entire point cloud directly and globally.

Deformable shape registration [ARAP] is a basic problem in computational geometry and widely applied to various fields such as computer vision. There are various deformable registration algorithms for point clouds. Many methods are extensions of the classic ICP algorithm [BM92]. Papazov and Burschka [Papazov_and_Burschka_CGF2011] proposed an algorithm that computes shape transitions based on local similarity transforms, thereby allowing it to model not only as-rigid-as-possible deformations but also shapes on local and global scales.

3 Method

Our method comprises three steps for learning part mobility, as shown in Fig. 1. First, we adopt a deformable registration algorithm using point correspondence to generate trajectories (Sec. 3.1). Second, we propose a network architecture called PointRNN to produce candidate motion hypotheses extracted from trajectories (Sec. 3.2). Third, we design an iterative algorithm for removing redundant motion hypotheses and segmenting a point cloud into several motion parts while considering the match degree between trajectories and motion hypotheses (Sec. 3.3). The second step of our method occurs within the learning pipeline, and the other two steps occur outside it. In this section, we discuss the modules of our approach in detail.

3.1 Trajectory generation

The trajectories are input to our network, PointRNN (Fig. 2). To ensure a fair comparison with Yuan et al. [Yuan_CGF2016], we use the same trajectory generation method. Thus, we adopt the method of Papazov and Burschka [Papazov_and_Burschka_CGF2011] based on local similarity transforms to generate motion trajectories, because it is an alternative module to generate trajectories for our method. All the trajectories are cropped to an equal length to facilitate deep neural network processing.

Given two adjacent frames, and , of a point cloud sequence, the registration algorithm includes iterative steps of correspondence and deformation. In terms of the correspondence step, we find the closest point in for each point . We optimize by searching over its -nearest neighbors to make the match error smooth. Formally, we compute a mapping, , by minimizing the Laplacian smoothness energy of the residual field :


where the residual Laplacian, , is defined as


Here, is the -nearest neighbors of . In terms of the deformation step, we minimize a fitting error between the local neighborhood of and its matched counterpart to estimate a similarity transformation, denoted as

, by using the singular value decomposition 



3.2 Part rigid motion hypothsis extraction

PointRNN network architecture. We propose a novel neural network called PointRNN which has the capacity to process trajectories and extract the feature representation, as illustrated in Fig. 2.

Given a set of trajectories, ( in our implementation), has a point set, ( by default), with and a motion direction set, ( by default), with

. There is a one-to-one relationship between these two sets. For each trajectory, we employ a Long Short-Term Memory (LSTM) network 

[LSTM] to encode time information, because LSTM has demonstrated high accuracy in tasks related to the processing of sequential and time-series data. Yan et al. [Yan_SIGa19] used the PointNet++ encoder to create a global feature for the input point cloud. Then, they fed this feature vector to the LSTM network. Compared to their approach, ours pays more attention to local and detailed features for each trajectory.

Figure 2:

PointRNN network architecture consists of three parts. The Sample&Group module is employed to learn temporal and spatial features from trajectories. We adopt the Interpolate module to recover original resolution features from global features. We feed these features to the Sample&Predict module to estimate motion parameters. Our supervised information is automatically obtained from the input.

We assume that similar trajectories should belong to one motion part, as inspired by the traditional clustering method [Yan_ECCV2006]. Thus, we design a Sample&Group module to learn the spatial features with their underlying motion. Specifically, we employ a LSTM network to encode time information for each trajectory at first. To encode spatial information, we first use iterative farthest point sampling to obtain a subset of trajectories, , such that is the most distant trajectory (in metric distance) from set . The metric distance function between two trajectories, , is as follows:


We adopt Euclidean distance rather than Hausdorff distance or Ferche distance, because the latter two metrics require significantly more computations without yield improved performance. Next, we search a neighborhood trajectory set, , for each sampled trajectory, , using the -nearest method, where the metric distance between and () does not exceed a certain threshold . Then, we obtain a number of overlapping partitions of trajectories. Finally, we aggregate the features of the trajectory set as regional spatial features and express them as features of . In our method, and are tunable parameters in different layers.

We design an Interpolate module to propagate global spatial features to single trajectory features with distance-based interpolation and skip-connection. We achieve feature propagation by interpolating features from the -nearest trajectory set (), , to each trajectory, . We use the inverse distance-weighted average based on the

-nearest neighbors in interpolation. Finally, we feed these interpolated trajectory features into a Multilayer Perceptron (MLP) Network to encode single trajectory features.

By observing most objects in the real world, we found that the number of motion parts is finite and small. Therefore, we assume that the number of motion parts is no more than 10 based on our dataset. We design a Sample&Predict module to generate part rigid motion hypotheses. Similar to the Sample&Group module, we obtain () trajectory sets using farthest point sampling and -nearest method (). We consider that these sets can cover all the motion parts of 3D shapes. We use pooling to obtain aggregated spatial features for each trajectory set. Then, we obtain temporal feature using a LSTM network. Finally, we feed their temporal and spatial features to a MLP Network. The network outputs a motion axis including a start point () and direction (), motion ranges including a shifted distance (), and a rotation angle () (Fig.  3).

Self-supervised loss function design.

To reconstruct the motion sequences, we compute the motion matrix, , by taking advantage of network output. The motion matrix is composed of a rotation matrix and a translation matrix (). is easy to generate by using and . The problem of solving can be reduced to transforming the point set rotating around the given axis (including and ) into the point set rotating around the z-axis of the standard coordinate system, . The basic idea of deriving this matrix is to divide the problem into few known simple steps. First, we align the start of the given axis with the coordinate origin and move the point set. Next, we rotate the given axis and the point set such that the axis lies in the coordinate planes. Then, we rotate the given axis and the point set such that the axis is aligned with the z-axis. We use one of the fundamental rotation matrices to rotate the point sets using . Finally, we undo steps three to one.

We can rebuild the last frame () from the first frame () by pre-multiplication (

on the left) in our neural network. To apply the backpropagation algorithm, we implement the partial derivative for each network output in our neural network. Thus, we define the first loss function, called the rebuild loss function, as follows.


For each trajectory set in the Sample&Predict module, we solve an absolute orientation problem to obtain an approximate solution of rigid motion (). Our solution for the absolute orientation problem is based on the method of Myronenko and Song [Myronenko_and_Song_arXiv2009]. In addition, we guarantee that the determinant of is 1 to ensure that has a rotation matrix other than the reflection matrix. Then, we extract rotation matrix from . Similarly, we can obtain from

. We define the relative pose estimation loss function, which was used to measure the angular distance between

and in the study by Suwajanakorn et al. [Supasorn_NIPS2018].


We found that the network has a slow convergence and easily falls into a local optimum solution using only and . The reason for this is that it is possible to generate the same by using two different parameter sets of , , , and . To solve this problem, we compute an approximate solution of , , , and denoted as , , , and , respectively. In general, can be written more concisely as Rodrigues’ rotation formula [RodriguesRotationFormula] as follows:


where is the cross product matrix of , is the outer product, and I

is the identity matrix. We can easily obtain

and from equation (7).


where is the trace of .


We extract from , where is a vector from the centroid of to that of . is the length of projected to .


Finally, we solve linear algebraic equations to obtain , where I is the identity Matrix.


In the case of the general situation in which the start of an axis is not at the coordinate origin, we can obtain the same result. Then, we define the axis loss function as follows:


where is the cosine distance between and . Given that is difficult to compute, network output should satisfy rather than directly regressing by . All formula deductions and practical calculation procedures are presented in the appendix.

The final loss function is the weighted sum of (5), (6), and (12).


In our implementation, by default. In addition, we set a threshold for each loss function to tolerate error and improve robustness, because approximate solutions of , , , and usually have errors with respect to the ground truth, especially in real scan data.

Figure 3: Motion parameters (, , , and ). is the start of the motion axis. The motion direction () is a unit vector. is the translation distance, and is the rotation angle.

3.3 Motion parts segment and optimization

We obtain part rigid motion hypotheses from PointRNN. These hypotheses are redundant, such that we must merge similar motion hypotheses. We define that there is no movement if and are no more than (0.1 radian by default) and (0.05 length by default), respectively, considering noisy data and computation errors. Further, we define rotation as and and translation by contrast. The combination of rotation and translation satisfies the conditions that and . We merge the remaining motion axes by comparing the diversity of , , and .

We define the metric distance between trajectory and motion hypotheses as the match degree. We use the motion hypotheses to rebuild . Then, the metric distance has two terms: rebuild loss and direction cosine distance.


in our implementation. Then, we obtain refined motion axes and motion parts using an iteration algorithm. Inspired by Non-Maximum Suppression, we first set the refined motion axis set, , as empty. We assign the trajectories to candidate axes. Then, we choose the candidate axis that has the most votes. If is empty, we add this candidate axis to . If is not empty, we compute whether trajectories belonging to this candidate axis can also belong to an axis in . If not, we add this candidate axis to . We iterate these steps until all trajectories are assigned to a motion axis. We optimize coarse segments by examining whether the label of a trajectory is the same as most of the labels of its -nearest neighbors.

4 Experiments

In this section, we first introduce the benchmark datasets used in the experiments (Sec. 4.1). We then compare our approach with previous state-of-the-art methods on the tasks of motion part segmentation, motion attribute estimation, and 3D flow determination (Sec. 4.2). We also replace the PointRNN backbone with two different network designs to evaluate the effect of PointRNN (Sec. 4.3). Finally, we conduct ablation experiments of the loss function design, transfer ability, hyper parameters, etc. (Sec. 4.4). Further analyses and visualizations are provided in the appendix.

4.1 Dataset settings

We evaluate our methods on three different datasets, including two synthetic sets (f-data, s-data) and one real set (r-data). A fake dataset (f-data) is generated randomly and automatically. f-data enables the neural network to learn the common motion patterns of random trajectories. We leverage two annotated datasets including the Motion dataset [Wang_CVPR2019] and PartNet [Mo_2019_CVPR] to construct the second synthetic dataset (s-data). Xiang et al. [Xiang_2020_SAPIEN] enriched the PartNet dataset with motion attributes. We choose 31 categories including cabinet, lamp, and window to generate motion sequences. For the real data (r-data), we select certain reasonable motion sequences in the RBO dataset [RBO_dataset]. Moreover, Yi et al. [Yi_SIGa2018] provided a small real scan dataset. Additionally, we scan some real data sequences in-house. More details on these datasets are provided in the appendix.

Figure 4: Three frames of point cloud sequences in each of the three datasets used in our experiments.

4.2 Comparison with state-of-the-art methods

We test our method against four alternatives, including both non-learning and learning approaches. Specifically, we compare our method with the traditional method of Yuan et al. [Yuan_CGF2016] and three network-based approaches including Yi et al. [Yi_SIGa2018], Wang et al. [Wang_CVPR2019], and Yan et al. [Yan_SIGa19]. To the best of our knowledge, we are the first to propose an approach using deep learning to extract part mobility from a point cloud sequence. To ensure a fair comparison, we implement a space-time co-segmentation baseline following that of Yuan et al. [Yuan_CGF2016]. Because their method also needs trajectories as input, we employ the same trajectory generation algorithm [Papazov_and_Burschka_CGF2011]. Yi et al. [Yi_SIGa2018] proposed a neural network architecture with three modules that propose correspondences, estimate 3D deformation flows, and perform segmentation. Their method takes as input a pair of point clouds representing two different articulation states to segment motion parts and estimate 3D flows. To better provide a contrasting experiment between their approach and ours, we set in our network settings. That is, we also take a pair of point clouds as input to train and test our network. In terms of the approach of Wang et al. [Wang_CVPR2019], we train and evaluate their network and ours using the same training and testing data. However, it must be mentioned that their network requires a single point cloud as input. Considering this difference, the data for their network are randomly sampled frames from point cloud sequences. For the comparison between Yan et al. [Yan_SIGa19] and our method, the data for their approach are the first frame of the point cloud sequence, because their approach segments motion parts by predicting a temporal sequence from a single point cloud.

Figure 5: Results of our approach compared with the ground truth on both synthetic and real datasets. Rows 1–5 show the results on the synthetic dataset and row 6 shows the results on the real dataset. We show three frames, first, middle, and lats, of the point cloud sequence. All results are shown in the first frame.

We use seven metrics to evaluate motion part segmentation and motion axis prediction. We use to evaluate motion part segmentation. We measure the Minimum Distance (MD), Orientation Error (OE) and Type Accuracy (TA), as introduced in [Wang_CVPR2019], which are used for measuring the distance, angle, and motion type accuracy, respectively, between the predicted motion axis line and the ground truth. Moreover, we employ Rand Index (RI) and End-Point-Error (EPE) to evaluate motion part segmentation and 3D flows in comparison with Yi et al. [Yi_SIGa2018]. To compare the method of Yan et al. [Yan_SIGa19] with our method, we also use the mean Average Precision (mAP) as defined in their paper to measure the segmentation accuracy.

Method MD OE TA
Yuan et al. [Yuan_CGF2016] 0.71 - - -
Wang et al. [Wang_CVPR2019] 0.67 0.051 0.055 0.92
Ours 0.87 0.032 0.027 0.99
Table 1: Comparison between our method and those of Yuan et al. [Yuan_CGF2016] and Wang et al. [Wang_CVPR2019] in terms of , MD, OE, and TA. A higher value indicates more complete motion part segmentation. Lower values of MD and OE and higher TA values indicate more accurate axis prediction.

Yuan et al. [Yuan_CGF2016] adopted a clustering method to obtain local segments and propagated these segments to the neighboring frames. Finally, they merged all frames using a space–time segment grouping technique to obtain the final motion part segmentation. However, their method tends to have degraded performance when dealing with tiny motion parts. Moreover, their approach has difficulty in processing 3D shapes with many diverse large motions. Table 1 shows that our approach outperforms theirs in terms of . They did not exploit the fact that motion parts have local characteristics. They generated a motion hypothesis by randomly choosing a number of trajectory triplets. This resulted in the creation of several fragments that were difficult to group, especially in tiny parts. Additionally, the parameters of their approaches were different depending on the case.

Wang et al. [Wang_CVPR2019] parsed 3D shapes into part mobility by observing of a single static point cloud based on a large, well-labeled dataset. Their method can only predict motion parameters of 3D shapes that have been seen in the training data, because of the supervised learning based on a large, well-annotated dataset. It seems very easy to get confused in extracting motion parts from a single snapshot. For example, it would be very difficult to discover whether a door opens on the right or left when it is closed. More importantly, their network is likely to consider that is not a motion part when a door is closed. They used a similar matrix to segment motion parts. However, this type of method is also not conducive to dealing with tiny parts. In contrast to their method, which directly predicts an axis as a regression problem, we minimize the cosine distance for and solve linear algebraic equations to compute , which makes our result more robust and stable than theirs (Table 1).

Figure 6: Our results are compared with those of Yuan et al. [Yuan_CGF2016] and Wang et al. [Wang_CVPR2019].

Yi et al. [Yi_SIGa2018] aimed to discover motion parts of objects and estimated 3D flows by analyzing the underlying articulation states and geometry of shapes. Their network architecture alternates between correspondence, deformation flow, and segmentation prediction iteratively in an ICP-like fashion. Theirs is a supervised learning method, which estimates 3D point-wise flows at first and then segments the motion parts. In contrast to their method, we compute the motion axes and motion ranges, and then segment motion parts using these motion parameters. The results demonstrate that our method achieves higher and RI, and lower EPE than theirs.

Method RI EPE
Yi et al. [Yi_SIGa2018] 0.71 0.80 0.029
Ours 0.83 0.87 0.024
Table 2: Comparison between our method and that of Yi et al. [Yi_SIGa2018] in terms of , RI, and EPE. Higher and RI values indicate more complete motion part segmentation. A lower EPE indicates more accurate flow prediction.

Yan et al. [Yan_SIGa19] predicted a temporal sequence of pointwise displacements from the input shape using LSTM. Then, the RPM-Net used these displacements to learn all the movable points. Next, they obtained a pointwise distance matrix by computing the Euclidean distance between these learned point features. Finally, they selected a clustering method to separate the points in the motion parts according to a distance matrix. They tended to study the part mobility from point cloud sequence using LSTM, but they still used a single point cloud. Table 3 shows that the results obtained from these models are close to each other. Although their method deal with motion segmentation well, it has the same defect as that of Wang et al. [Wang_CVPR2019], which is due to supervised learning. It is difficult to imagine a motion sequence from a single point cloud.

Method mAP
Yan et al. [Yan_SIGa19] 0.86 0.76
Ours 0.87 0.77
Table 3: Comparison our method and that of Yan et al. [Yan_SIGa19] in terms of and mAP. Higher and mAP values indiate more complete motion part segmentation.

4.3 Analysis of the effects of different network designs.

Compared to the contrast experiments, there are three additional novel metrics, , , and , presented in Sec. 4.3 and Sec. 4.4. Our approach also estimates motion ranges such that and are errors between the network outputs and ground truth. is the error of rebuilding the last frame from the first frame, which is used for a comprehensive evaluation. Moreover, we employ the model size to compare the size of different models and timing, which includes training time and testing time, to evaluate the efficiency of the algorithm.

To verify the effectiveness of PointRNN on the tasks of motion part segmentation and motion attribute estimation, we design two baselines to replace the network backbone. We adopt a slightly modified version of that of Liu et al. [Liu_2019_ICCV] to build baseline1. We combine point clouds to obtain a single point cloud. Next, we feed this single point cloud to PointNet++ to extract features. Finally, we minimize the loss function to estimate motion axis and motion range. Simultaneously, we design baseline2 by referring to Yan et al. [Yan_SIGa19]. We extract global features per frame using a shared PointNet++. Then, we adopt LSTM to encode the time information. Similarly, we use as a loss function. For more details about these baselines, please refer to the appendix. The results show that we can achieve better performance by using PointRNN, because our network can extract local and detailed spatiotemporal features.

Method MD OE TA
baseline1 0.86 0.029 0.054 0.054 0.042 0.025 0.97
baseline2 0.84 0.034 0.062 0.064 0.052 0.029 0.96
Ours 0.88 0.027 0.028 0.046 0.028 0.023 0.98
Table 4: Comparison of our network with two other network designs in terms of , MD, OE, , , , and TA. A higher indicates more complete motion part segmentation. Lower MD, OE, , , and higer TA indicate more accurate axis prediction. is a comprehensive evaluation of motion part segmentation and axis prediction, where a lower value is better.

4.4 Ablation experiments

Effect of different loss function designs. To analyze the consequences of different loss function designs, we experiment with an ablated version of our network with four combinations of all loss functions. Different loss designs mainly influence the network performance with respect to axis generation. Thus, we train and test our network on f-data to focus on examining the quality of the predicted motion axes. Table 5 lists the results of including , , , and . The results demonstrate that our method achieves optimal performance for motion axis estimation, thereby verifying its optimal effect when using , , and together. The reason for this is that it is possible to generate the same using two different parameter sets of , , and . There is an interactive constraint between two loss functions.

Loss Function MD OE TA
0.035 0.017 0.058 0.026 0.061 0.96
0.031 0.017 0.051 0.036 0.086 0.90
0.032 0.016 0.053 0.022 0.045 0.94
0.027 0.014 0.047 0.023 0.020 0.97
Table 5: Effectiveness of different loss function designs. of f-data is not needed. Lower MD, OE, , , , and higer TA indicate more accurate axis prediction.

Comparison between our method and that of Myronenko and Song [Myronenko_and_Song_arXiv2009]. Our method is self-supervised, because our data are not manually annotated, either in training or testing. Moreover, the supervised information is auto generated from the characteristic distribution of the data. We solve an absolute orientation problem to generate a motion matrix () using the method of Myronenko and Song [Myronenko_and_Song_arXiv2009]. Then, we parse the supervised information (, , , ) from . We adopt the neural network approach rather than the traditional method of Myronenko and Song [Myronenko_and_Song_arXiv2009] because the traditional method is easily influenced by noise, which would interfere with and alter the degree of accuracy of the motion matrix. The neural network approach has better stability and robustness in complex situations (e.g., an object has many motion parts and the data are noisy). In the experiment of loss function design, it is reflected from the side that the network output is more robust than only using supervised information parsed from Myronenko and Song [Myronenko_and_Song_arXiv2009], because the result of is better than that of . In addition, we design an experiment to compare our method with that of Myronenko and Song [Myronenko_and_Song_arXiv2009], and the results presented in Table 6 show that directly using supervised information as an output is better in f-data than ours, but worse in s-data, because s-data is more complex and noisy than f-data.

[width=13em]MethodDataset f-data s-data
Myronenko and Song [Myronenko_and_Song_arXiv2009] 0.018 0.027
Ours 0.021 0.023
Table 6: Comparison between our method and that of Myronenko and Song [Myronenko_and_Song_arXiv2009] in terms of the performance of EPE, where a lower EPE value is better.

Performance on three datasets. We train and test our network using three different datasets. The performance of these datasets is reported in Table 7. These three datasets have unique features. The difficulty of f-data stems from its randomness. Each element of f-data has totally different motion parameters, including motion axes and motion ranges. s-data has various categories. Most 3D shapes in s-data have two or more motion parts. r-data generally has a partial single view of 3D shapes with more noise. The results demonstrate that not only can our approach learn motion parameters from synthetic data, but it can also be applied to noisy real data.

Dataset MD OE TA
f-data - 0.027 0.014 0.047 0.023 0.020 0.97
s-data 0.87 0.032 0.027 0.046 0.027 0.023 0.99
r-data 0.79 0.058 0.036 0.079 0.037 0.058 0.99
Table 7: Performance of our method on three datasets in terms of , MD, OE, , , , and TA. A higher value indicates more complete motion part segmentation. Lower MD, OE, , , and higer TA indicate more accurate axis prediction. is a comprehensive evaluation of motion part segmentation and axis prediction, where a lower value is better.
Figure 7: Impact of the length of a point cloud sequence, where a higher value is better. The x-axis corresponds to the length of the point cloud sequence ().

Analysis on the length of point cloud sequence. Considering that our motivation is to learn part mobility from a point cloud sequence, the number of frames () is an important hyper parameter in our approach. Thus, we must determine how many frames are sampled from one motion sequence. To achieve this, we conduct an ablation experiment on to discuss the impact of the length of the point cloud sequence. We reduce the number of frames from 17 to 3 for a sufficient verification. The results are depicted in Fig. 7. As the number of frame increases, the performance gain is consistent but more and more slow. Hence, we adopt 11 frames as input to balance the performance and efficiency.

Analysis on transfer ability. Here, we aim to verify that our network can estimate the motion axes by learning the feature representation of trajectories other than only by learning geometrical characteristics. Our approach is a self-supervised deep learning algorithm. It can abstract out motion patterns from trajectories, which can enhance the ability of generalization. Therefore, we design an experiment in which we train our network on f-data, and then test it on s-data. We present three categories with different levels of complexity results in Table 8. The results demonstrate that the network can produce correct motion axes, even though it did not see any real objects. This implies that our network can learn motion patterns from trajectories and our algorithm indeed has the capability of generalization.

category MD OE TA
Fan 0.89 0.014 0.014 0.032 0.028 1
Laptop 0.95 0.018 0.013 0.035 0.021 1
Scissor 0.79 0.022 0.017 0.087 0.031 1
Table 8: Experiment on the generalization ability. Our method can handle unseen categories, because motion patterns and type are learned by the network.

Effect of trajectory aggregation The aggregation of trajectories is of great importance in PointRNN, as it describes a motion part consisting of a group of similar motion trajectories. Thus, it is significant to analyze how different aggregation designs influence performance. We report five different sizes with two types of aggregations: max and average, as shown in Fig. 8. As the aggregation radius (

) increases, PointRNN improves until it peaks at approximately 32. The results demonstrate that it is difficult to learn the features when we cluster too few trajectories in a local region. In contrast, a larger region usually contains more than one motion part and results in decreased performance. Moreover, max pooling achieves better results than average by aggregating the features in local regions.

Figure 8: Trajectory aggregation analysis, where a higher value is better. The x-axis corresponds to different hyperparameters of trajectory aggregation size ().

Analysis on the quality of trajectory. We also discuss the impact of different qualities of trajectory. Given a point cloud sequence, we first generate a set of trajectories. Then, we add different levels of random noise to these trajectories. Fig. 9 shows the relationship between and noise. Our method maintained more than 0.7 () after adding 0.04 random noise. As the noise degree increases, the performance declines sharply. The results demonstrate that our method can tolerate noise to a certain extent, but it will fail when under the addition of excessive noise.

Method Model size Timing
Yuan et al. [Yuan_CGF2016] - -/180s
Wang et al. [Wang_CVPR2019] 86.7MB 38h/7.32s
Yi et al. [Yi_SIGa2018] 78MB 35.5h/4.88s
Yan et al. [Yan_SIGa19] 153.3 MB 13.2h/ 0.55s
Ours 25.2MB 11.6h/0.43s
Table 9: Model size and processing time. Our model is smaller and faster.

Model size and speed. Our proposed model is highly efficient, as it leverages sparsity in point clouds by using clusters and does not require many stages like those of Wang et al. [Wang_CVPR2019] and Yi et al [Yi_SIGa2018]. Compared to previous methods (Table 9), our model is more than smaller in size and more than times faster than those of Wang et al. [Wang_CVPR2019] and Yi et al [Yi_SIGa2018]. While the method in Yuan et al. [Yuan_CGF2016] is a non-learning method, it costs approximately 3 min to segment a 3D shape without giving motion parameters. Furthermore, our model is more than smaller in size than that of Yan et al. [Yan_SIGa19].

Figure 9: Effect of different qualities of trajectory. The result illustrates that the performance decreases with increasing of noise.

4.5 Limitations and future work

There are several avenues for future research. First, our method is based on trajectories such that if the quality of trajectories is extremely poor with many fragments, the performance decreases. Second, if two parts have exactly the same motion, e.g., two drawers in a cabinet have the same motion, our method considers that they have the same motion and should be one motion part. Third, for hierarchical mobility extraction, we can segment the 3D object into different motion parts, but the motion axis describes a compound movement. For example, the motion of a bulb holder is a composite movement with that of a lamp post. In the future, it would be interesting to infer parts and motions and discover common articulation patterns from various of sensor data. Moreover, our method relies on a pre-registration of point clouds using an ICP-like optimization. There are also many deep learning methods for point cloud registration, such as [Pais20203DRegNetAD]. It is deserved to design a network encompassing both trajectory generation and motion attribute estimation in an end-to-end fashion in future work. To facilitate future research and reproduce our method more easily, the source code and dataset are available on Github:

5 Conclusion

In this paper, we introduce a self-supervised method for parsing 3D shapes into several motion parts with motion parameters from point cloud sequences. We first transform point cloud sequences into trajectories. Then, we propose a novel neural network architecture called PointRNN to extract the feature representations from trajectories and produce motion hypotheses. Finally, we propose an iterative algorithm for segmenting motion parts. In contrast to previous studies, our approach first predicts part motion parameters including motion axes and motion ranges, and then segments motion parts guided by these motion hypotheses. We experimentally demonstrated that our approach yields significantly better results compared with both the traditional and network-based methods.


We thank the anonymous reviewers for their valuable comments. This work was supported in part by National Key Research and Development Program of China (2018YFC0831003 and 2019YFF0302902), National Natural Science Foundation of China (61902014 and U1736217 and 61932003), and Pre-research Project of the Manned Space Flight (060601).