3D point cloud is the raw output of most 3D sensors and multiview stereo pipelines  and a widely used representation for 3D structures in applications such as autonomous driving  and augmented reality . Due to its efficiency and flexibility, there is a growing interest in using point clouds for high level tasks such as object recognition, skipping the need for meshing or other post-processing. These tasks require an understanding of the semantic concept represented by the points. On other modalities like images, deep neural networks [8, 12]
have proven to be a powerful model for extracting semantic information from raw sensor data, and have gradually replaced hand-crafted features. A similar trend is happening on point clouds. With the introduction of deep learning architectures like PointNet, it is possible to train powerful feature extractors that outperform traditional geometric descriptors on tasks such as shape classification and object part segmentation.
However, existing benchmark datasets [23, 25] that are used to evaluate performance on these tasks make two simplifying assumptions: first, the point clouds are sampled from complete shapes; second, the shapes are aligned in a canonical coordinate system111In ModelNet , shapes are allowed to have rotations, but only along the vertical axis. (see Figure 2). These assumptions are rarely met in real world scenarios. First, due to occlusions and sensor limitations, real world 3D scans usually contain holes and missing regions. Second, point clouds are often obtained in the sensor’s coordinates, which do not align with the canonical coordinates of the object model. In other words, real 3D point cloud data are partial and unaligned.
In this work, we tackle the problem of learning from partial, unaligned point cloud data. To this end, we build a dataset consisting of partial point clouds generated from virtual scans of CAD models in ModelNet and ShapeNet. Our dataset contains challenging inputs with arbitrary 3D rotation, translation and realistic self-occlusion patterns.
A key challenge in learning from such data is how to learn features that are invariant or equivariant with respect to geometric transformations. For tasks like classification, we want the output to remain the same if the input is transformed. This is called invariance. For tasks like pose estimation, we want the output to vary according to the transformation applied on the input. This is called equivariance.
This property can be achieved via a transformer network 
, which predicts a transformation that is applied to the input before feature extraction. This allows explicit geometric manipulation of data within networks, so the networks can learn to align the inputs into a canonical space that makes subsequent tasks easier. T-Net is a transformer network based on PointNet that operates on 3D point clouds. However, T-Net outputs an unconstrained affine transformation. This can introduce undesirable shearing and scaling which causes the object to lose its shape (see Figure 3). Moreover, T-Net is evaluated on inputs with 2D rotations only.
Therefore, we propose a novel transformer network on 3D point clouds, named Iterative Transformer Network (IT-Net). It has two major differences from T-Net. First, it outputs a rigid transformation instead of an affine transformation. This preserves the shape of the inputs and leads to better performance on subsequent tasks such as shape classification and part segmentation. Further, this allows us to use the outputs of IT-Net directly as estimates of object poses. Second, instead of predicting the transformation in a single step, IT-Net takes advantage of an iterative scheme which decomposes a large transformation into smaller ones that are easier to predict. We note that similar ideas of iterative pose refinement have been employed by classical image and point cloud alignment algorithms [3, 15].
We evaluate the performance of IT-Net on 3 different tasks – pose estimation, shape classification and object part segmentation (Figure 1) – using inputs with 3D rotation and translation from our partial point cloud dataset. Our experiments demonstrate that models trained with IT-Net can learn transformation-invariant or equivariant features from challenging partial inputs. Further, IT-Net can be readily integrated into various state-of-the-art networks and improve their performance on different tasks.
The key contributions of our work are as follows:
We propose a new transformer network on 3D point clouds that uses iterative refinement to predict rigid transformations;
We show that our transformer can be used independently as a pose estimator or trained jointly with classification and segmentation networks and increases performance over baselines in both sets of tasks;
We introduce a benchmark dataset for shape classification and object part segmentation consisting of partial, unaligned point clouds.
2 Related Work
Feature Learning on Point Clouds
Traditional point cloud features [1, 19, 21] rely on statistical properties of points such as local curvatures. They do not encode semantic information and it is non-trivial to find the combination of features that is optimal for specific tasks.
proposes a way to extract semantic and task-specific features from point clouds using a deep neural network. The key idea of PointNet is to use a symmetric function (e.g. max-pooling) to aggregate pointwise features so that the global feature is invariant to permutations of the points. A drawback of PointNet is that it does not account for local interactions among points. Thus, several extensions[13, 22] which augment the input with information from local neighborhoods of points have been proposed.
Most datasets [23, 25] used to evaluate feature learning on point clouds consist of complete point clouds. A few works [16, 26] have investigated feature learning from partial point clouds. However, they all assume that the point clouds are aligned in a canonical coordinate system. In this work, we show how to remove this assumption using a transformer network.
Spatial Transformer Network
Spatial Transformer Network (STN)  is a network module that performs explicit geometric transformations on the input data. STN can be thought of as a geometry predictor which models the complicated non-linear relationship between the appearance of the image and geometric transformations. It can be trained jointly with classification networks and has the benefit of introducing invariance to geometric transformations.
3 Iterative Transformer Network
Iterative Transformer Network (IT-Net) takes a 3D point cloud and produces a transformation that can be used directly as a pose estimate or applied to the input before feature extraction for subsequent tasks. Alternatively, we can think of its output as the transformation between the input’s coordinates and the canonical coordinates of the semantic concept represented by input, which can be defined either explicitly as in Figure 2 or implicitly as any coordinate system that makes subsequent tasks (e.g. classification) easier.
IT-Net has two key features that differentiate it from existing transformer networks on 3D point clouds: first, it predicts a 3D rigid transformation; second, the final output is composed of multiple transformations produced in an iterative fashion. We will introduce these features one by one followed by additional implementation details.
3.1 Rigid Transformation Prediction
A 3D rigid transformation is an element of the Special Euclidean group . It consists of a rotation and translation where is a matrix satisfying and is a vector. Due to the constraints on , it is inconvenient to represent the rotation as a matrix during optimization. Thus, many classical  as well as modern deep learning methods [10, 24] parametrize 3D rotations with unit quaternions. This parametrization allows us to map an arbitrary 4D vector to a valid rotation.
Similar to PointNet 
used for classification, a single iteration IT-Net consists of a shared multi-layer perceptron for each point and a max-pooling aggregation function followed by fully connected layers, but instead of predicting class scores, it outputs 7 numbers – the first 4 are normalized into a unit quaternionand the last 3 are treated as a 3D translation vector . Then, and are assembled into a matrix where is the rotation matrix corresponding to . Representing the output transformation as matrices turns the composition of two rigid transformations into a matrix multiplication, which is useful for the multi-iteration IT-Net introduced in 3.2.
In contrast to the affine transformation produced by T-Net, the rigid transformation predicted by IT-Net can be directly interpreted as a 6D pose, making it possible to use IT-Net independently for pose estimation. More importantly, rigid transformations preserve scales and angles. As a result, the appearance of a point cloud will not vary drastically if it is transformed by the output of IT-Net. This makes it possible to apply the same network iteratively (see Figure 4) to obtain a more accurate estimation of the transformation.
We note that it is possible to add a regularization term
that forces an affine matrixto be orthogonal in order to achieve similar effects of predicting a rigid transformation222In , this regularization is added to the feature transformation, but not to the input transformation.. We explore this possibility (named T-Net reg) in our experiments and show that the results are not as good as the network that directly predicts a rigid transformation.
3.2 Iterative Alignment
The idea of using an iterative scheme for predicting geometric transformations goes back to the classical Lucas-Kanade (LK) algorithm  for estimating dense alignment between images. The key insight of LK is that the complex non-linear mapping from image appearance to geometric transformations can be estimated iteratively using simple linear predictors. Specifically, at each iteration, a warp transformation is predicted with a linear function that takes a source and a target image as inputs. Then, the source image is warped by and the process is repeated. The final transformation is a composition of at each step. Later,  shows that the parameters used to predict can remain constant across iterations while achieving the same effect as non-constant predictors.
The same idea is employed in the Iterative Closest Point (ICP) algorithm  for the alignment of 3D point clouds. At each iteration of ICP, a corresponding set is identified and a rigid transformation is produced to align the corresponding points. Then, the source point cloud is transformed by and the process is repeated. Again, the final output is a composition of at each step. The effectiveness of ICP shows that the iterative refinement framework applies not only to images, but also to 3D point clouds.
The multi-iteration IT-Net (Figure 4) can be viewed as an instantiation of this iterative framework. Specifically, the prediction of the transformation is unfolded into multiple iterations. At the -th iteration, an update transformation is predicted using the network introduced in 3.1. Then, the input is transformed by and the process is repeated. The final output after iterations is a composition of the transformations predicted at each iteration, which can be written as a simple matrix product . Following , we use a fixed predictor (i.e. share the network’s parameters) across iterations.
The iterative refinement scheme can be interpreted as a form of gradient descent. In LK, each update is a gradient step that brings the source closer to the target. In ICP, the update is no longer a gradient step, but it has similar effects of bringing the source closer to the target. Consequently, the magnitudes of the update transformations diminish as the algorithms converge, just like the gradient approaches 0 near a local optimum. In the case of IT-Net, the input is a single point cloud instead of a pair of images or point clouds and the transformation is predicted by a neural network. Despite these differences, we observe that the transformations predicted by IT-Net also have diminishing magnitudes (Figure 5). This behavior shows that in a sense, IT-Net is learning the “gradient transformation” that will bring the input shape closer to its appearance in the canonical frame.
3.3 Implementation Details
In addition to the key ingredients above, there are a couple of details that are important for the training of IT-Net.
First, we initialize the network to predict the identity transformation, i.e. , . In this way, the default behavior of each iteration is to preserve the transformation predicted by previous iterations, similar to residual networks . This initialization is especially important when the number of iterations becomes large.
Second, we stop the gradients flowing through the input transformations, indicated by red edges in Figure 4. This removes the dependence of the input at each iteration from the predictions of previous iterations. The reason behind this design choice can be perceived in terms of the analogy between IT-Net and gradient descent. If we interpret the prediction of IT-Net at each iteration as the gradient direction at a point, then it should not depend on the path that leads to the point.
In this section, we evaluate the efficacy of IT-Net on various tasks. First, we demonstrate its use as a plug-in module that improves the performance of state-of-the-art classification and segmentation networks. Next, we show that it can also be used independently for pose estimation.
4.1 Partial 3D Shape Classification
In this section, we evaluate IT-Net on the classification of partial, unaligned shapes and show that it can improve the performance of classification networks by introducing invariance to geometric transformations.
fail to capture the incomplete and unaligned nature of real world 3D data. To test the performance of classifiers under a more realistic setting, we build the partial ModelNet40 dataset. To the best of our knowledge, this is the first controlled dataset for partial 3D shape classification.
The dataset consists of 81,212 point clouds from 40 categories, split into 78,744 for training and 2,468 for testing. Each point cloud is generated by fusing a sequence of depth scans of models in ModelNet40 into a point cloud. Figure 6 includes some qualitative examples. It can be seen that the point clouds are in a variety of poses with realistic self-occlusion patterns. Please refer to the supplementary for more details on the data generation.
The network used for the partial shape classification task consists of two parts – the transformer and the classifier. The transformer takes a point cloud and produces a transformation . The classifier takes the point cloud transformed by and outputs a score for each class. The entire network is trained with cross-entropy loss on the class scores and no explicit supervision is applied on .
We compare classifiers trained with three different transformers, IT-Net, T-Net and regularized T-Net (T-Net reg) against the baseline classifier that does not use a transformer. All three transformers have the same architecture except for the last layer. Specifically, the architecture consists of three parts. The first part is a multi-layer perceptron that is applied on each point independently. It takes the coordinate matrix and produces a feature matrix. The second part is a max-pooling function which aggregates the features in to a vector. The third part is another multi-layer perceptron which outputs the transformation parameters. For IT-Net, the last layer outputs 7 numbers for rotation (quaternion) and translation. For T-Net and T-Net reg, the last layer outputs 9 numbers to form a affine transformation matrix . For T-Net reg, a regularization term
is added to the loss with weight 0.001. Batch normalization is applied to all layers except the last layer. Further, we apply the iterative scheme described in Section3.2 to each transformer for 2 iterations.
The networks are trained for 50 epochs with batch size 32. We use the Adam optimizer with an initial learning rate of 0.001, decayed by 0.7 every 6250 steps. The initial decay rate for batch normalization is 0.5 and gradually increased to 0.99. We clip the gradient norm to 20.
|Transformer||None||T-Net||T-Net reg||IT-Net (ours)|
|Accuracy (class avg)||53.07||60.57||30.57||60.38||62.19||63.42||65.05|
|Transformer||None||T-Net||T-Net reg||IT-Net (ours)|
|Accuracy (class avg)||58.34||62.66||13.73||65.49||66.40||65.16||68.34|
The overall accuracy and the average of accuracy per class are reported in Table 1. For both classifiers, IT-Net trained with 2 iterations brings the largest amount of improvement in accuracy over baselines that do not use any transformer network. This is evidence that the advantage of IT-Net over other transformer networks is agnostic to the architecture of the classifier.
Figure 6 shows some examples of how the inputs are transformed by IT-Net. It can be seen that without explicit supervision, IT-Net learns to transform the inputs into a canonical frame similar to the manually defined one shown in Figure 2. Inputs in various initial poses and even different categories are aligned in this canonical frame. This removes the variations introduced by geometric transformations of the input and simplifies the classification problem. Moreover, unlike T-Net, IT-Net does not introduce any scaling or shearing and preserves the shape of the input.
We note that in this experiment, we don’t need to iterate as many times as in the case of pose estimation shown later in 4.3. The reason is that the classifiers themselves should be robust against small transformations to the input. Thus, instead of getting precise alignments, the transformer’s job is to roughly align the inputs so that they can be recognized by the classifiers. Nevertheless, there is still a critical difference between using and not using the iterative scheme, as shown by the difference between IT-Net trained with 2 iterations and 1 iteration.
We also note that the performance of T-Net drops significantly if we try to apply the same iterative scheme on its output. This can be explained by the fact that the output of T-Net is on a very different scale than the original input (see Figure 6). As a result, the network sees vastly different inputs from different iterations. This will cause the training to diverge. This issue is resolved with the regularized T-Net. However, its performance is still worse than IT-Net.
|mean||table||chair||air plane||lamp||car||guitar||laptop||knife||pistol||motor cycle||mug||skate board||bag||ear phone||rocket||cap|
4.2 Partial Object Part Segmentation
To show that the invariance to geometric transformations learnt by IT-Net is not specific to the classification task, we demonstrate IT-Net’s capability to improve the performance of part segmentation networks.
Similar to ModelNet40, the ShapeNet part dataset  commonly used for the evaluation of object part segmentation also consists of complete and aligned point clouds. Thus, similar to partial ModelNet40, we build a more realistic version of the ShapeNet part dataset with partial and unaligned point clouds to evaluate the performance of part segmentation models. Since the ShapeNet part dataset does not come with meshed models, we use an approximate rendering procedure that mimics an orthographic depth camera to create realistic-looking partial point clouds. Please refer to the supplementary for more details.
The dataset contains 16,881 shapes from 16 categories, annotated with 50 parts in total and 2-6 parts per category. We use the same training and testing split provided in .
The network used for part segmentation simply replaces the classifier in the joint transformer-classifier model from 4.1 with a segmentation network. We use DGCNN  as the baseline segmentation network and compare the performance gain of adding T-Net and IT-Net. The architecture of the transformers are identical to the ones in 4.1. Following [16, 22], we treat part segmentation as a per-point classification problem and train the network with per-point cross entropy loss. Similar to 4.1, no explicit supervision is applied on the transformations.
The networks are trained for 200 epochs with batch size 32. Other hyperparameters are the same as the ones used in partial shape classification.
We use the mean Intersection-over-Union (mIoU) on points as the evaluation metric following[16, 22]. The results are summarized in Table 2. Again, IT-Net with 2 iterations leads to the largest amount of improvement in mean mIoU and mIoU for most categories. As in the case of classification, IT-Net removes variations in the inputs caused by geometric transformations (see Figure 7). This demonstrates the potential of IT-Net as a plug-in for any task that requires invariance to geometric transformations without task-specific adjustments to the model architecture.
4.3 Pose Estimation
We have shown that IT-Net outperforms existing transformers on the tasks of partial shape classification and partial object part segmentation. Next, we perform an analysis to try to better understand the efficacy of the iterative refinement scheme on predicting accurate geometric transformations. To this end, we choose the task of estimating the canonical pose of an object from a partial observation.
Specifically, given a partial point cloud, we use IT-Net to estimate the transformation that aligns it to a canonical frame defined across all models in the same category, as illustrated in Figure 2. Unlike most existing works on pose estimation, we do not assume knowledge of the object’s model and we train a single network that generalizes to different objects in the same category.
Our dataset for pose estimation consists of 12,000 partial point clouds from 2,400 car models in ShapeNet . The point clouds are generated by back-projecting depth scans of models from random viewpoints into the camera’s coordinates. Each point cloud is labeled with the transformation that aligns it to the model’s coordinates. The data are split into training, validation and testing with a 10:1:1 ratio. Note that the test set and the training set are created using different car models.
For the loss function, we use a variant of PLoss proposed in. It measures the average distance between the same set of points under the estimated pose and the ground truth pose. Compared to the L2 loss used in earlier works , this loss has the advantage of automatically handling the tradeoff between small rotations and small translations. The loss can be written as
where are the ground truth pose and are the estimated pose. In the original formulation, is the set of 3D model points. Since we do not assume knowledge of the model, we let be the set of input points instead.
The networks are trained for 20000 steps with batch size 100. We use Adam optimizer with an initial learning rate of 0.001, decayed by 0.7 every 2000 steps. The initial decay rate for batch normalization is 0.5 and gradually increased to 0.99.
The results are summarized in Figure 8. From Figure 7(b), we can observe that for all but the single iteration network, the pose accuracy keeps increasing even if we apply the network for more iterations than it is trained for. This is another evidence that IT-Net is learning the “gradient transformation” noted in Section 3.2. Moreover, this property allows us to use IT-Net as an any-time pose estimator. In particular, we can train IT-Net for a fixed number of iterations. During inference, we can keep applying the trained IT-Net to obtain increasingly accurate pose estimates until the time budget runs out.
We note that another way to interpret the iterative scheme is to treat it as a form of Dataset Aggregation (DAGGER) method 
used in imitation learning. From Figure7(c), we can see that there is almost no example with rotation error less than 15 degrees in the initial inputs (iteration 0), but there are many such examples in the inputs to later iterations. In some sense, the network is generating the data it needs to learn more accurate pose estimates. It is even more convenient than DAGGER since there is no need to explicitly add examples to the dataset as the aggregation happens automatically with the training.
In this work, we propose a new transformer network on 3D point clouds. IT-Net predicts a rigid transformation by iteratively outputting update transformations that are used to transform the input for the next iteration. The advantage of using iterative transformers in various tasks shows that the classical idea of iterative pose refinement still applies in the context of deep learning.
Our transformer can be easily integrated with existing deep learning architectures for shape classification and segmentation, and improve the performance on these tasks with partial, unaligned inputs by introducing invariance to geometric transformations. This opens up many avenues for future research on using neural networks to extract semantic information from real world point cloud data.
This project is supported by Carnegie Mellon University’s Mobility21 National University Transportation Center, which is sponsored by the US Department of Transportation. We would like to thank Brian Okorn and Chen-Hsuan Lin for their helpful comments and suggestions.
-  M. Aubry, U. Schlickewei, and D. Cremers. The wave kernel signature: A quantum mechanical approach to shape analysis. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 1626–1633. IEEE, 2011.
-  S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
-  P. J. Besl and N. D. McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.
-  Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam, 2018.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
-  Y. Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2010.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for autonomous driving? the kitti vision benchmark
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3354–3361. IEEE, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
-  A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
-  G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pages 225–234. IEEE, 2007.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Y. Li, R. Bu, M. Sun, and B. Chen. Pointcnn. arXiv preprint arXiv:1801.07791, 2018.
-  C.-H. Lin and S. Lucey. Inverse compositional spatial transformer networks. arXiv preprint arXiv:1612.03897, 2016.
-  B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
S. Ross, G. Gordon, and D. Bagnell.
A reduction of imitation learning and structured prediction to
no-regret online learning.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
-  R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217, 2009.
-  N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox. Orientation-boosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351, 2016.
-  J. Sun, M. Ovsjanikov, and L. Guibas. A concise and provably informative multi-scale signature based on heat diffusion. In Computer graphics forum, volume 28, pages 1383–1392. Wiley Online Library, 2009.
-  Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
-  Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
-  L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
-  W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert. Pcn: Point completion network. In 2018 International Conference on 3D Vision (3DV), pages 728–737. IEEE, 2018.
In this document we provide technical details, ablation studies and visualizations in support of our paper. Here is a summary of the contents:
In Section B, we describe the generation of partial point clouds in our dataset. Section C contains additional results on part segmentation and Section D includes more architecture details. Section E provides ablation studies on the network design. Finally, we show visualizations of pose clusters learned by IT-Net in Section F.
B Data Generation
In this section, we cover details of the generation of partial, unaligned point clouds used in our experiments.
Each partial point cloud is generated by fusing a sequence of depth scans in the initial camera’s coordinates. The initial camera’s orientation is random and the distance between the camera center and the object center is varied. Subsequent camera positions are generated by rotating the camera around the object center for up to 30 degrees. Compared to the datasets used in [16, 22], which consist of point clouds uniformly sampled from mesh models, our dataset contains much more challenging inputs with various poses, densities and levels of incompleteness (see Figure 9). More statistics and detail parameters are summarized in Table 3.
|Model source||ModelNet40 ||ShapeNet part ||ShapeNet |
|Camera distance to object center||2-4||1-2|
We use Blender  to render depth scans of mesh models in ModelNet and ShapeNet, but the ShapeNet part dataset only has point clouds, so we use an approximate rendering procedure that mimics an orthographic depth camera. Specifically, we randomly rotate the complete point cloud and project the points onto a virtual image with pixel size 0.02 in the -plane. For each pixel of the virtual image, we keep the point with the smallest value and discard all other points that project onto the same pixel as they are considered as occluded by the selected point.
We translate the partial point clouds so that their centroids are at the origin. This reduces the range of translation and makes training more stable.
C Additional Part Segmentation Results
Table 4 shows part segmentation results using PointNet  as the base segmentation model. The results are similar to those using DGCNN  as the base segmentation model. This is evidence that the advantage of IT-Net is agnostic to the architecture of the segmentation network.
|mean||table||chair||air plane||lamp||car||guitar||laptop||knife||pistol||motor cycle||mug||skate board||bag||ear phone||rocket||cap|
|Transformer||IT-Net (no init)||IT-Net (no stop)||IT-Net|
|Accuracy (class avg)||36.68||2.55||61.05||61.38||63.42||65.05|
D Architecture Details
Figure 10 shows the detailed architecture of the transformer networks used in all the experiments. For the classification and segmentation networks, we use publicly available implementations of PointNet  and DGCNN . The detailed architectures can be found in Section C of the supplementary for  and Section 5.1 and 5.4 of .
E Ablation Studies
In this section, we provide ablation studies for the design decisions mentioned in Section 3.3 of our paper. In particular, we evaluate the effect of initializing the network’s prediction with the identity transformation and stopping the gradient during input transformations.
Table 5 and Table 6 show the performance of ablated models on classification and pose estimation respectively. As expected, the performance degrades especially for IT-Net trained with larger number of iterations, which indicates that both identity initialization and gradient stopping are crucial for the iterative refinement scheme to achieve desired behavior.
|No init||No stop||-|
F Pose Cluster Visualizations
In this section, we visualize the transformations learned by IT-Net. Specifically, we take the IT-Net trained with DGCNN on the classification task and calculate the difference between the canonical orientation of the input shape and the orientation of the transformed input at different iterations. Then, we convert the orientation difference into axis-angle representation, which is a 3D vector, and plot these vectors for all test examples in a particular category.
respectively. We observe that although the object poses are uniformly distributed initially, clusters of poses emerge after applying the transformations predicted by IT-Net (Figure10(a), 11(a)). This is evidence that IT-Net discover a canonical space to align the inputs with no explicit supervision. Interestingly, there are usually more than one cluster and the shapes of the clusters are related to the symmetries of the object (Figure 10(b), 11(b)). Further, we note that sometimes even objects across different categories are aligned after being transformed by IT-Net (Figure 10(d), 11(d)). Nevertheless, the pose clusters are less apparent for certain categories where the shapes are nearly spherical, e.g. flower pots.