Learning to Orient Surfaces by Self-supervised Spherical CNNs

Defining and reliably finding a canonical orientation for 3D surfaces is key to many Computer Vision and Robotics applications. This task is commonly addressed by handcrafted algorithms exploiting geometric cues deemed as distinctive and robust by the designer. Yet, one might conjecture that humans learn the notion of the inherent orientation of 3D objects from experience and that machines may do so alike. In this work, we show the feasibility of learning a robust canonical orientation for surfaces represented as point clouds. Based on the observation that the quintessential property of a canonical orientation is equivariance to 3D rotations, we propose to employ Spherical CNNs, a recently introduced machinery that can learn equivariant representations defined on the Special Orthogonal group SO(3). Specifically, spherical correlations compute feature maps whose elements define 3D rotations. Our method learns such feature maps from raw data by a self-supervised training procedure and robustly selects a rotation to transform the input point cloud into a learned canonical orientation. Thereby, we realize the first end-to-end learning approach to define and extract the canonical orientation of 3D shapes, which we aptly dub Compass. Experiments on several public datasets prove its effectiveness at orienting local surface patches as well as whole objects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 14

page 17

01/19/2022

ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes

Progress in 3D object understanding has relied on manually canonicalized...
02/01/2021

Adjoint Rigid Transform Network: Self-supervised Alignment of 3D Shapes

Most learning methods for 3D data (point clouds, meshes) suffer signific...
03/18/2021

Concentric Spherical GNN for 3D Representation Learning

Learning 3D representations that generalize well to arbitrarily oriented...
03/11/2020

Self-supervised Point Set Local Descriptors for Point Cloud Registration

In this work, we propose to learn local descriptors for point clouds in ...
11/23/2018

PRIN: Pointwise Rotation-Invariant Network

In recent years, point clouds have earned quite some research interests ...
11/17/2017

3D object classification and retrieval with Spherical CNNs

3D object classification and retrieval presents many challenges that are...
03/29/2019

FrameNet: Learning Local Canonical Frames of 3D Surfaces from a Single RGB Image

In this work, we introduce the novel problem of identifying dense canoni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans naturally develop the ability to mentally portray and reason about objects in what we perceive as their neutral, canonical orientation, and this ability is key for correctly recognizing and manipulating objects as well as reasoning about the environment. Indeed, mental rotation abilities have been extensively studied and linked with motor and spatial visualization abilities since the 70s in the experimental psychology literature Shepard and Metzler (1971); Vandenberg and Kuse (1978); Jansen and Kellner (2015).

Robotic and computer vision systems similarly require neutralizing variations w.r.t. rotations when processing 3D data and images in many important applications such as grasping, navigation, surface matching, augmented reality, shape classification and detection, among the others. In these domains, two main approaches have been pursued so to define rotation-invariant methods to process 3D data: rotation-invariant operators and canonical orientation estimation. Pioneering works applying deep learning to point clouds, such as PointNet

Qi et al. (2017a, b)

achieved invariance to rotation by means of a transformation network used to predict a canonical orientation to apply direclty to the coordinates of the input point cloud. Despite being trained by sampling the range of all possible rotations at training time through data augmentation, this approach, however, does not generalize to rotations not seen during training. Hence, invariant operators like rotation-invariant convolutions were introduced, allowing to train on a reduced set of rotations (ideally one, the unmodified data) and test on the full spectrum of rotations

Masci et al. (2015); Esteves et al. (2018); You et al. (2018); Zhang et al. (2019b); Rao et al. (2019); Zhang et al. (2019a). Canonical orientation estimation, instead, follows more closely the human path to invariance and exploits the geometry of the surface to estimate an intrinsic 3D reference frame which rotates with the surface.

Transforming the input data by the inverse of the 3D orientation of such reference frame brings the surface in an orientation-neutral, canonical coordinate system wherein rotation invariant processing and reasoning can happen. While humans have a preference for a canonical orientation matching one of the usual orientations in which they encounter an object in everyday life, in machines this paradigm does not need to favour any actual reference orientation over others: as illustrated in Figure 1, an arbitrary one is fine as long as it can be repeatably estimated from the input data.

Despite mental rotation tasks being solved by a set of unconscious abilities that humans learn through experience, and despite the huge successes achieved by deep neural networks in addressing analogous unconscious tasks in vision and robotics, the problem of estimating a canonical orientation is still solved solely by

handcrafted proposals Salti et al. (2014); Petrelli and Di Stefano (2011); Melzi et al. (2019); Guo et al. (2013); Yang et al. (2017); Gojcic et al. (2019); Aldoma et al. (2012). This may be due to convnets, the standard architectures for vision applications, reliance on the convolution operator in Euclidean domains, which possesses only the property of equivariance to translations of the input signal. However, the essential property of a canonical orientation estimation algorithm is equivariance with respect to 3D rotations because, upon a 3D rotation, the 3D reference frame which establishes the canonical orientation of an object should undergo the same rotation as the object. We also point out that, although, in principle, estimation of a canonical reference frame is suitable to pursue orientation neutralization for whole shapes, in past literature it has been intensively studied mainly to achieve rotation-invariant description of local surface patches.

In this work, we explore the feasibility of using deep neural networks to learn to pursue rotation-invariance by estimating the canonical orientation of a 3D surface, be it either a whole shape or a local patch. Purposely, we propose to leverage Spherical CNNs Cohen et al. (2018); Esteves et al. (2018), a recently introduced variant of convnets which possesses the property of equivariance w.r.t. 3D rotations by design, in order to build Compass, a self-supervised methodology that learns to orient 3D shapes. As the proposed method computes feature maps living in , i.e. feature map coordinates define 3D rotations, and does so by rotation-equivariant operators, any salient element in a feature map, e.g. its , may readily be used to bring the input point cloud into a canonical reference frame. However, due to discretization artifacts, Spherical CNNs turn out to be not perfectly rotation-equivariant Cohen et al. (2018). Moreover, the input data may be noisy and, in case of 2.5D views sensed from 3D scenes, affected by self-occlusions and missing parts. To overcome these issues, we propose a robust end-to-end training pipeline which mimics sensor nuisances by data augmentation and allows the calculation of gradients with respect to feature maps coordinates. The effectiveness and general applicability of Compass is established by achieving state-of-the art results in two challenging applications: robust local reference frame estimation for local surface patches and rotation-invariant global shape classification.

Figure 1: Canonical orientations in humans and machines. Randomly rotated mugs are depicted in (a). To achieve rotation-invariant processing, e.g. to check if they are the same mug, humans mentally neutralize rotation variations preferring an upright canonical orientation, as illustrated in (b). A machine may instead use any canonical reference orientation, even unnatural to humans, e.g. like in (c).

2 Related Work

The definition of a canonical orientation of a point cloud has been studied mainly in the field of local features descriptors Salti et al. (2014); Guo et al. (2013); Johnson and Hebert (1999); Rusu et al. (2009) used to establish correspondences between sets of distinctive points, i.e. keypoints. Indeed, the definition of a robust local reference frame, i.e. , with respect to which the local neighborhood of a keypoint

is encoded, is crucial to create rotation-invariant features. Several works define the axes of the local canonical system as eigenvectors of the 3D covariance matrix between points within a spherical region of radius

centered at . As the signs of the eigenvectors are not repeatable, some works focus on the disambiguation of the axes Mian et al. (2010); Salti et al. (2014); Guo et al. (2013). Alternatively, another family of methods leverages the normal to the surface at , i.e. , to fix the axis, and then exploits geometric attributes of the shape to identify a reference direction on the tangent plane to define the axis Chua and Jarvis (1997); Petrelli and Di Stefano (2012); Melzi et al. (2019). Compass differs sharply from previous methods because it learns the cues necessary to canonically orient a surface without making a priori assumptions on which details of the underlying geometry may be effective to define a repeatable canonical orientation.

On the other hand, PointNetsQi et al. (2017a, b)

employ a transformation network to predict an affine rigid motion to apply to the input point clouds in order to correctly classify global shapes under rigid transformations. In

Esteves et al. (2018) Esteves et al. prove the limited generalization of PointNet to unseen rotations and define the Spherical convolutions to learn an invariant embedding for mesh classification. In parallel, Cohen et al. Cohen et al. (2018) use Spherical correlation to map Spherical inputs to features then processed with a series of convolutions on . Similarly, PRIN You et al. (2018) proposes a network based on Spherical correlations to operate on spherically voxelized point clouds. SFCNN Rao et al. (2019) re-defines the convolution operator on a discretized sphere approximated by a regular icosahedral lattice. Differently, in Zhang et al. (2019b), Zhang et al. adopt low-level rotation invariant geometric features (angles and distances) to design a convolution operator for point cloud processing. Deviating from this line of work on invariant convolutions and operators, we show how rotation-invariant processing can be effectively realized by preliminary transforming the shape to a canonical orientation learned by Compass.

Finally, it is noteworthy that several recent works Novotny et al. (2019); Wang et al. (2019); Sridhar et al. (2019) rely on the notion of canonical orientation to perform category-specific 3D reconstruction from a single or multiple views.

3 Proposed Method

In this section, we provide a brief overview on Spherical CNNs to make the paper self-contained, followed by a detailed description of our method. For more details, we point readers to Cohen et al. (2018); Esteves et al. (2018).

3.1 Background

The base intuition for Spherical CNNs can be lifted from the classical planar correlation used in CNNs, where the value in the output map at location is given by the inner product between the input feature map and the learned filter translated by . We can likewise define the value in an output map, computed by a spherical or correlation, at location as the inner product between the input feature map and the learned filter rotated by . Below we provide formal definitions of the main operations carried out in a Spherical CNN, then we summarize the standard flow to process point clouds with them.

The Unit Sphere: is a two-dimensional manifold defined as the set of points with unitary norm, and parametrized by spherical coordinates and .

Spherical Signal: a -valued function defined on , , is the number of channels.

3D Rotations: 3D rotations live in a three-dimensional manifold, the group, which can be parameterized by ZYZ-Euler angles as in Cohen et al. (2018). Given a triplet of Euler angles , the corresponding 3D rotation matrix is given by the product of two rotations about the -axis, , and one about the -axis, , i.e.

. Points represented as 3D unit vectors

can be rotated by using the matrix-vector product .

Spherical correlation: recalling the inner product definition from Cohen et al. (2018), the correlation between a -valued spherical signal and a filter , can be formulated as:

(1)

where the operator rotates the function by , by composing its input with , i.e. , where . Although both the input and the filter live in , the spherical correlation produces an output signal defined on Cohen et al. (2018).

SO(3) correlation: similarly, by extending to operate on signals, i.e. where and denotes the composition of rotations, we can define the correlation between a signal and a filter on the rotation group, :

(2)

Spherical and SO(3) correlation equivariance w.r.t. rotations: It can be shown that both correlations in (1) and (2) are equivariant with respect to rotations of the input signal. The feature map obtained by correlation of a filter with an input signal rotated by , can be equivalently computed by rotating with the same rotation the feature map obtained by correlation of with the original input signal , i.e.:

(3)

Signal Flow: In Spherical CNNs, the input signal, e.g. an image or, as it is the case of our settings, a point cloud, is first transformed into a -valued spherical signal. Then, the first network layer ( layer) computes feature maps by spherical correlations (). As the computed feature maps are SO(3) signals, the successive layers (SO(3) layers) compute deeper feature maps by SO(3) correlations ().

3.2 Methodology

Our problem can be formalized as follows. Given the set of 3D point clouds, , and two point clouds , with and , we indicate by the application of the 3D rotation matrix to all the points of . We then aim at learning a function, , such that:

(4)
(5)

We define the rotated cloud, , in (4) to be the canonical, rotation-neutral version of , i.e. the function outputs the inverse of the 3D rotation matrix that brings the points in into their canonical reference frame. (5) states the equivariance property of : if the input cloud is rotated, the output of the function should undergo the same rotation. As a result, two rotated versions of the same cloud are brought into the same canonical reference frame by (4).

Due to the equivariance property of Spherical CNNs layers, upon a rotation of the input signal each feature map does rotate accordingly. Moreover, the domain of the feature maps in Spherical CNNs is SO(3), i.e. each value of the feature map is naturally associated with a rotation. This means that one could just track any distinctive feature map value to establish a canonical orientation satisfying (4) and (5). Indeed, defining as the composition of and correlation layers in our network, if the last layer produces the feature map when processing the spherical signal for the cloud , the same network will compute the feature map when processing the rotated cloud , with spherical signal . Hence, if for instance we select the maximum value of the feature map as the distinctive value to track, and the location of the maximum is at in , the maximum will be found at in the rotated feature map. Then, by letting , we get , which satisfies (4) and (5). Therefore, we realize function by a Spherical CNN and we utilize the operator on the feature map computed by the last correlation layer to define its output. In principle, equivariance alone would guarantee to satisfy (4) and (5). Unfortunately, while for continuous functions the network is exactly equivariant, this does not hold for its discretized version, mainly due to feature map rotation, which is exact only for bandlimited functions Cohen et al. (2018). Moreover, equivariance to rotations does not hold for altered versions of the same cloud, e.g. when a part of it is occluded due to view-point changes. We tackle these issues using a self-supervised loss computed on the extracted rotations when aligning a pair of point clouds to guide the learning, and an ad-hoc augmentation to increase the robustness to occlusions. Through the use of a soft-argmax layer, we can back-propagate the loss gradient from the estimated rotations to the positions of the maxima we extract from the feature maps and to the filters, which overall lets the network learn a robust function.

From point clouds to spherical signals: Spherical CNNs require spherical signals as input. A known approach to compute them for point cloud data consists in transforming point coordinates from the input Euclidean reference system into a spherical one and then constructing a quantization grid within this new coordinate system You et al. (2018); Spezialetti et al. (2019). The -th cell of the grid is indexed by three spherical coordinates where and represent the azimuth and inclination angles of its center and is the radial distance from the center. Then, the cells along the radial dimension with constant azimuth and inclination are seen as channels of a -valued signal at location onto the unit sphere . The resulting -valued spherical signal measures the density of the points within each cell at distance .

Figure 2: Training pipeline. We illustrate the pipeline for local patches, but the same apply for point clouds representing full shapes. During training we apply the network on a randomly extracted 3D patch, , and on its augmented version, , in order to extract the aligning rotation and , respectively. At test time only one branch is involved. The numbers below the spherical signal indicate the bandwidths along , and , while the triplets under the layers indicate input bandwidth, output bandwidth and number of channels.

Training pipeline: An illustration of the Compass training pipeline is shown in Figure 2. During training, our objective is to strengthen the equivariance property of the Spherical CNN, such that the locations selected on the feature maps by the function vary consistently between rotated versions of the same point cloud. To this end, we train our network with two streams in a Siamese fashion Chopra et al. (2005). In particular, given , , with and a known random rotation matrix, the first branch of the network computes the aligning rotation matrix for , , while the second branch the aligning rotation matrix for , . Should the feature maps on which the two maxima are extracted be perfectly equivariant, it would follow that . For that reason, the degree of misalignment of the maxima locations can be assessed by comparing the actual rotation matrix predicted by the second branch, , to the ideal rotation matrix that should be predicted, . We can thus cast our learning objective as the minimization of a loss measuring the distance between these two rotations. A natural geodesic metric on the manifold is given by the angular distance between two rotations Hartley et al. (2013). Indeed, any element in can be parametrized as a rotation angle around an axis. The angular distance between two rotations parametrized as rotation matrices and is defined as the angle that parametrizes the rotation and corresponds to the length along the shortest path from to on the manifold Hartley et al. (2013); Huynh (2009); Mahendran et al. (2017); Zhou et al. (2019). Thus, our loss is given by the angular distance between and :

(6)

As our network has to predict a single canonicalizing rotation, we apply the loss once, i.e. only to the output of the last layer of the network.

Soft-argmax: The result of the operation on a discrete feature map returns the location along the dimensions corresponding to the ZYZ Euler angles, where the maximum correlation value occurs. To optimize the loss in (6), the gradients w.r.t. the locations of the feature map where the maxima are detected have to be computed. To render the operation differentiable we add a soft-argmax operator Honari et al. (2018); Chapelle and Wu (2010) following the last layer of the network. Let us denote as the last feature map computed by the network for a given input point cloud . A straightforward implementation of a soft-argmax layer to get the coordinates of the maximum in is given by

(7)

where softmax is a 3D spatial softmax. The parameter

controls the temperature of the resulting probability map and

iterate over the coordinates. A soft-argmax operator computes the location as a weighted sum of all the coordinates where the weights are given by a softmax of a map . Experimentally, this proved not effective. As a more robust solution, we scale the output of the softmax according to the distance of each bin from the feature map . To let the bins near the contribute more in the final result, we smooth the distances by a Parzen function Parzen (1962) yielding a maximum value in the bin corresponding to the and decreasing monotonically to .

Learning to handle occlusions: In real-world settings, rotation of an object or scene (i.e. a viewpoint change) naturally produces occlusions to the viewer. Recalling that the second branch of the network operates on , a randomly rotated version of , it is possible to improve

Figure 3: Local support of a keypoint depicting the corner of a table, divided in 3 shells. Randomly selected point in black; removed points in red.

robustness of the network to real-world occlusions and missing parts by augmenting . A simple way to handle this problem is to randomly select a point from and delete some of its surrounding points. In our implementation, this augmentation happens with an assigned probability. is divided in concentric spherical shells, with the probability for the random point to be selected in a shell increasing with its distance from the center of . Additionally, the number of removed points around the selected point is a bounded random percentage of the total points in the cloud. An example can be seen in Figure 3.

Network Architecture: The network architecture comprises 1 layer followed by 3 layers, with bandwidth and the respective number of output channels are set to 40, 20, 10, 1. The input spherical signal is computed with channels.

4 Applications of Compass

We evaluate Compass on two challenging tasks. The first one is the estimation of a canonical orientation of local surface patches, a key step in creating rotation-invariant local 3D descriptors Salti et al. (2014); Gojcic et al. (2019); Yang et al. (2017). In the second task, the canonical orientation provided by Compass is instead used to perform highly effective rotation-invariant shape classification by leveraging a simple PointNet classifier. The source code for training and testing Compass is available at https://github.com/CVLAB-Unibo/compass.

4.1 Canonical orientation of local surface patches

Problem formulation: On local surface patches, we evaluate Compass through the repeatability Petrelli and Di Stefano (2011); Melzi et al. (2019) of the local reference frame (LRF) it estimates at corresponding keypoints in different views of the same scene. All the datasets provide several D scans, i.e. fragments, representing the same model, i.e. an object or a scene depending on the dataset, acquired from different viewpoints. All fragments belonging to a test model can be grouped into pairs, where each pair (, and , has an area of overlap. A set of correspondences, , can be computed for each pair by applying the known rigid ground-truth transformation, , which aligns to into a common reference frame. is obtained by uniformly sampling points in the overlapping area between and . Finally, the percentage of repeatable LRFs, , for , can be calculated as follows:

(8)

where is an indicator function, denotes the dot product between two vectors, and is a threshold on the angle between the corresponding axes, in our experiments. Rep measures the percentage of reference frames which are aligned, i.e. differ only by a small angle along all axes, between the two views. The final value of Rep for a given model is computed by averaging on all the pairs.

Test-time adaptation

: Due to the self-supervised nature of Compass, it is possible to use the test set to train the network without incurring in data snooping, since there is no external ground-truth information involved. This test-time training can be carried out very quickly, right before the test, to adapt the network to unseen data and increase its performance, especially in transfer learning scenarios. This is common practice with self-supervised approaches

Luo et al. (2020).

Datasets: We conduct experiments on three heterogeneous publicly available datasets: 3DMatch Zeng et al. (2017), ETH Pomerleau et al. (2012); Gojcic et al. (2019), and Stanford Views Curless and Levoy (1996). 3DMatch is the reference benchmark to assess learned local 3D descriptors performance in registration applications Zeng et al. (2017); Deng et al. (2018); Spezialetti et al. (2019); Choy et al. (2019); Gojcic et al. (2019). It is a large ensemble of existing indoor datasets. Each fragment is created fusing 50 consecutive depth frames of an RGB-D sensor. It contains 62 scenes, split into 54 for training and 8 for testing. ETH is a collection of outdoor landscapes acquired in different seasons with a laser scanner sensor. Finally, Stanford Views contains real scans of 4 objects, from the Stanford 3D Scanning Repository Curless and Levoy (1996), acquired with a laser scanner.

Experimental setup: We train Compass on 3DMatch following the standard procedure of the benchmark, with 48 scenes for training and 6 for validation. From each point cloud, we uniformly pick a keypoint every cm, the points within cm are used as local surface patch and fed to the network. Once trained, the network is tested on the test split of 3DMatch. The network learned on 3DMatch is tested also on ETH and Stanford Views, using different radii to account for the different sizes of the models in these datasets: respectively cm and

cm. We also apply test-time adaptation on ETH and Stanford Views: the test set is used for a quick 2-epoch training with a 20% validation split, right before being used to assess the performance of the network. We use Adam

Kingma and Ba (2014) as optimizer, with 0.001 as the learning rate when training on 3DMatch and for test-time adaptation on Stanford Views, and 0.0005 for adaptation on ETH. We compare our method with recent and established LRFs proposals: GFramesMelzi et al. (2019), TOLDIYang et al. (2017), a variant of TOLDI recently proposed in Gojcic et al. (2019) that we refer to here as 3DSN, FLARE Petrelli and Di Stefano (2012), and SHOT Salti et al. (2014). For all methods we use the publicly available implementations. However, the implementation provided for GFrames could not process the large point clouds of 3DMatch and ETH due to memory limits, and we can show results for GFrames only on Stanford Views.

LRF Repeatability (Rep )
Dataset SHOT Tombari et al. (2010) FLARE Petrelli and Di Stefano (2012) TOLDI Yang et al. (2017) 3DSN Gojcic et al. (2019) GFrames Melzi et al. (2019) Compass
Compass
(Adapted)
3DMatch 0.212 0.360 0.215 0.220 n.a 0.375 n.a.
ETH 0.273 0.264 0.185 0.202 n.a 0.308 0.317
Stanford Views 0.132 0.241 0.197 0.173 0.256 0.361 0.388
Table 1: LRF repeatability on the datasets. Best result for each dataset (row) in bold.

Results: The first column of Table 1 reports Rep on the 3DMatch test set. Compass outperforms the most competitive baseline FLARE, with larger gains over the other baselines. Results reported in the second column for ETH and the third column for Stanford Views confirm the advantage of a data-driven model like Compass over hand-crafted proposals: while the relative rank of the baselines changes according to which of the assumptions behind their design fits better the traits of the dataset under test, with SHOT taking the lead on ETH and the recently introduced GFrames on Stanford Views, Compass consistently outperforms them. Remarkably, this already happens when using pure transfer learning for Compass, i.e. the network trained on 3DMatch: in spite of the large differences in acquisition modalities and shapes of the models between training and test time, Compass has learned a robust and general notion of canonical orientation for a local patch. This is also confirmed by the slight improvement achieved with test-time augmentation, which however sets the new state of the art on these datasets. Finally, we point out that Compass extracts the canonical orientation for a patch in 17.85ms.

4.2 Rotation-invariant Shape Classification

Problem formulation: Object classification is a central task in computer vision applications, and the main nuisance that methods processing 3D point clouds have to withstand is rotation. To show the general applicability of our proposal and further assess its performance, we wrap Compass in a shape classification pipeline. Hence, in this experiment, Compass is used to orient full shapes rather than local patches. To stress the importance of correct rotation neutralization, as shape classifier we rely on a simple PointNet Qi et al. (2017a), and Compass is employed at train and test time to canonically orient shapes before sending them through the network.

Datasets: We test our model on the ModelNet40 Zhirong Wu et al. (2015) shape classification benchmark. This dataset has 12,311 CAD models from 40 man-made object categories, split into 9,843 for training and 2,468 for testing. In our trials, we actually use the point clouds sampled from the original CAD models provided by the authors of PointNet. We also performed a qualitative evaluation of the transfer learning performance of Compass by orienting clouds from the ShapeNet Chang et al. (2015) dataset.

Experimental setup: We train Compass on ModelNet40 using 8,192 samples for training and 1,648 for validation. Once Compass is trained, we train PointNet following the settings in Qi et al. (2017a), disabling t-nets, and rotating the input point clouds to reach the canonical orientation learned by Compass. We followed the protocol described in You et al. (2018) to assess rotation-invariance of the selected methods: we do not augment the dataset with rotated versions of the input cloud when training PointNet; we then test it with the original test clouds, i.e. in the canonical orientation provided by the dataset, and by arbitrary rotating them. We use Adam Kingma and Ba (2014) as optimizer, with 0.001 as the learning rate.

Classification Accuracy (Acc. %)
PointNet
Qi et al. (2017a)
PointNet++
Qi et al. (2017b)
Point2Seq
Liu et al. (2019)
Spherical CNN
Cohen et al. (2018)
LDGCNN
Zhang et al. (2019a)
SO-Net
Li et al. (2018)
PRIN
You et al. (2018)
Compass +
PointNet
NR 88.45 89.82 92.60 81.73 92.91 94.44 80.13 80.51
AR 12.47 21.35 10.53 55.62 17.82 9.64 70.35 72.20
Table 2: Classification accuracy on the ModelNet40 dataset when training without augmentation. NR column reports the accuracy attained when testing on the cloud in the canonical orientation provided by the dataset and AR column when testing under arbitrary rotations. Best result for each row in bold.

Results: Results are reported in Table 2. Results for all the baselines come from You et al. (2018)

. PointNet fails when trained without augmenting the training data with random rotations and tested with shapes under arbitrary rotations. Similarly, in these conditions most of the state-of-the-art methods cannot generalize to unseen rotations. If, however, we first neutralize the orientation by Compass and then we run PointNet, it gains almost 60 points and achieves 72.20 accuracy, outperforming the state-of-the-art on the arbitrarily rotated test set. This shows the feasibility and the effectiveness of pursuing rotation-invariant processing by canonical orientation estimation. It is also worth observing how, in the simplified scenario where the input data is always under the same orientation (NR), a plain PointNet performs better than Compass+PointNet. Indeed, as the T-Net is trained end-to-end with PointNet, it can learn that the best orientation in the simplified scenario is the identity matrix. Conversely, Compass performs am unneeded canonicalization step that may only hinder performance due to its errors.

In Figure 6, we present some models from ModelNet40, randomly rotated and then oriented by Compass. The models estimate a very consistent canonical orientation for each object class, despite the large shape variations within the classes.

(a) ModelNet40
(b) ShapeNet
Figure 6: Qualitative results on ModelNet40 and ShapeNet in transfer learning. Top row: randomly rotated input cloud. Bottom row: cloud oriented by Compass.

Finally, to assess the generalization abilities of Compass for full shapes as well, we performed qualitative transfer learning tests on the ShapeNet dataset, reported in Figure 6. Even if there are different geometries, the model trained on ModelNet40 is able to generalize to an unseen dataset and recovers a similar canonical orientation for the same object.

5 Conclusions

We have presented Compass, a novel self-supervised framework to canonically orient 3D shapes that leverages the equivariance property of Spherical CNNs. Avoiding explicit supervision, we let the network learn to predict the best-suited orientation for the underlying surface geometries. Our approach robustly handles occlusions thanks to an effective data augmentation. Experimental results demonstrate the benefits of our approach for the tasks of definition of a canonical orientation for local surface patches and rotation-invariant shape classification. Compass demonstrates the effectiveness of learning a canonical orientation in order to pursue rotation-invariant shape processing, and we hope it will raise the interest and stimulate further studies about this approach.

While in this work we evaluated invariance to global rotation according to the protocol used in You et al. (2018) to perform a fair comparison with the state-of-the-art method You et al. (2018), it would also be interesting to investigate on the behavior of Compass and the competitors when trained on the full spectrum of SO(3) rotations as done in Esteves et al. (2018). This is left as future work.

6 Broader Impact

In this work we presented a general framework to canonically orient 3D shapes based on deep-learning. The proposed methodology can be especially valuable for the broad spectrum of vision applications that entail reasoning about surfaces. We live in a three-dimensional world: cognitive understanding of 3D structures is pivotal for acting and planning.

Acknowledgments

We would like to thank Injenia srl and UTFPR for partly supporting this research work.

References

  • A. Aldoma, F. Tombari, R. B. Rusu, and M. Vincze (2012)

    OUR-cvfh–oriented, unique and repeatable clustered viewpoint feature histogram for object recognition and 6dof pose estimation

    .
    In

    Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium

    ,
    pp. 113–122. Cited by: §1.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.
  • O. Chapelle and M. Wu (2010) Gradient descent optimization of smoothed information retrieval metrics. Information retrieval 13 (3), pp. 216–235. Cited by: §3.2.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, pp. 539–546. Cited by: §3.2.
  • C. Choy, J. Park, and V. Koltun (2019) Fully convolutional geometric features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8958–8966. Cited by: §4.1.
  • C. S. Chua and R. Jarvis (1997) Point signatures: a new representation for 3d object recognition. International Journal of Computer Vision 25 (1), pp. 63–85. Cited by: §2.
  • T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical cnns. arXiv preprint arXiv:1801.10130. Cited by: §1, §2, §3.1, §3.1, §3.2, §3, Table 2.
  • B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 303–312. Cited by: §4.1.
  • H. Deng, T. Birdal, and S. Ilic (2018)

    Ppf-foldnet: unsupervised learning of rotation invariant 3d local descriptors

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 602–618. Cited by: §4.1.
  • C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis (2018) Learning so (3) equivariant representations with spherical cnns. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–68. Cited by: §1, §1, §2, §3, §5.
  • Z. Gojcic, C. Zhou, J. D. Wegner, and A. Wieser (2019) The perfect match: 3d point cloud matching with smoothed densities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5545–5554. Cited by: §1, §4.1, §4.1, Table 1, §4.
  • Y. Guo, F. A. Sohel, M. Bennamoun, J. Wan, and M. Lu (2013) RoPS: a local feature descriptor for 3d rigid objects based on rotational projection statistics. In 2013 1st International Conference on Communications, Signal Processing, and their Applications (ICCSPA), pp. 1–6. Cited by: §1, §2.
  • R. Hartley, J. Trumpf, Y. Dai, and H. Li (2013) Rotation averaging. International journal of computer vision 103 (3), pp. 267–305. Cited by: §3.2.
  • S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz (2018)

    Improving landmark localization with semi-supervised learning

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1546–1555. Cited by: §3.2.
  • D. Q. Huynh (2009) Metrics for 3d rotations: comparison and analysis. Journal of Mathematical Imaging and Vision 35 (2), pp. 155–164. Cited by: §3.2.
  • P. Jansen and J. Kellner (2015) The role of rotational hand movements and general motor ability in children’s mental rotation performance. Frontiers in Psychology 6, pp. 984. Cited by: §1.
  • A. E. Johnson and M. Hebert (1999) Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on pattern analysis and machine intelligence 21 (5), pp. 433–449. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1, §4.2.
  • J. Li, B. M. Chen, and G. Hee Lee (2018) So-net: self-organizing network for point cloud analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9397–9406. Cited by: Table 2.
  • X. Liu, Z. Han, Y. Liu, and M. Zwicker (2019) Point2Sequence: learning the shape representation of 3d point clouds with an attention-based sequence to sequence network. In

    Thirty-Third AAAI Conference on Artificial Intelligence

    ,
    Cited by: Table 2.
  • X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf (2020) Consistent video depth estimation. 39 (4). Cited by: §4.1.
  • S. Mahendran, H. Ali, and R. Vidal (2017)

    3d pose regression using convolutional neural networks

    .
    In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2174–2182. Cited by: §3.2.
  • J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst (2015) Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pp. 37–45. Cited by: §1.
  • S. Melzi, R. Spezialetti, F. Tombari, M. M. Bronstein, L. Di Stefano, and E. Rodola (2019) Gframes: gradient-based local reference frame for 3d shape matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4629–4638. Cited by: §1, §2, §4.1, §4.1, Table 1.
  • A. Mian, M. Bennamoun, and R. Owens (2010) On the repeatability and quality of keypoints for local feature-based 3d object retrieval from cluttered scenes. International Journal of Computer Vision 89 (2-3), pp. 348–361. Cited by: §2.
  • D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi (2019) C3dpo: canonical 3d pose networks for non-rigid structure from motion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7688–7697. Cited by: §2.
  • E. Parzen (1962)

    On estimation of a probability density function and mode

    .
    The annals of mathematical statistics 33 (3), pp. 1065–1076. Cited by: §3.2.
  • A. Petrelli and L. Di Stefano (2011) On the repeatability of the local reference frame for partial shape matching. In 2011 International Conference on Computer Vision, pp. 2244–2251. Cited by: §1, §4.1.
  • A. Petrelli and L. Di Stefano (2012) A repeatable and efficient canonical reference for surface matching. In 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, pp. 403–410. Cited by: §2, §4.1, Table 1.
  • F. Pomerleau, M. Liu, F. Colas, and R. Siegwart (2012) Challenging data sets for point cloud registration algorithms. The International Journal of Robotics Research 31 (14), pp. 1705–1711. Cited by: §4.1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §1, §2, §4.2, §4.2, Table 2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 5105–5114. External Links: ISBN 9781510860964 Cited by: §1, §2, Table 2.
  • Y. Rao, J. Lu, and J. Zhou (2019) Spherical fractal convolutional neural networks for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 452–460. Cited by: §1, §2.
  • R. B. Rusu, N. Blodow, and M. Beetz (2009) Fast point feature histograms (fpfh) for 3d registration. In 2009 IEEE international conference on robotics and automation, pp. 3212–3217. Cited by: §2.
  • S. Salti, F. Tombari, and L. Di Stefano (2014) SHOT: unique signatures of histograms for surface and texture description. Computer Vision and Image Understanding 125, pp. 251–264. Cited by: §1, §2, §4.1, §4.
  • R. N. Shepard and J. Metzler (1971) Mental rotation of three-dimensional objects. Science 171 (3972), pp. 701–703. Cited by: §1.
  • R. Spezialetti, S. Salti, and L. D. Stefano (2019) Learning an effective equivariant 3d descriptor without supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6401–6410. Cited by: §3.2, §4.1.
  • S. Sridhar, D. Rempe, J. Valentin, B. Sofien, and L. J. Guibas (2019) Multiview aggregation for learning category-specific shape reconstruction. In Advances in Neural Information Processing Systems, pp. 2351–2362. Cited by: §2.
  • F. Tombari, S. Salti, and L. Di Stefano (2010) Unique signatures of histograms for local surface description. In European conference on computer vision, pp. 356–369. Cited by: Table 1.
  • S. G. Vandenberg and A. R. Kuse (1978) Mental rotations, a group test of three-dimensional spatial visualization. Perceptual and motor skills 47 (2), pp. 599–604. Cited by: §1.
  • H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas (2019) Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2642–2651. Cited by: §2.
  • J. Yang, Q. Zhang, Y. Xiao, and Z. Cao (2017) TOLDI: an effective and robust approach for 3d local shape description. Pattern Recognition 65, pp. 175–187. Cited by: §1, §4.1, Table 1, §4.
  • Y. You, Y. Lou, Q. Liu, Y. Tai, W. Wang, L. Ma, and C. Lu (2018) Prin: pointwise rotation-invariant network. arXiv preprint arXiv:1811.09361. Cited by: §1, §2, §3.2, §4.2, §4.2, Table 2, §5.
  • A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser (2017) 3dmatch: learning local geometric descriptors from rgb-d reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1802–1811. Cited by: §4.1.
  • K. Zhang, M. Hao, J. Wang, C. W. de Silva, and C. Fu (2019a) Linked dynamic graph cnn: learning on point cloud via linking hierarchical features. arXiv preprint arXiv:1904.10014. Cited by: §1, Table 2.
  • Z. Zhang, B. Hua, D. W. Rosen, and S. Yeung (2019b) Rotation invariant convolutions for 3d point clouds deep learning. In 2019 International Conference on 3D Vision (3DV), pp. 204–213. Cited by: §1, §2.
  • Zhirong Wu, S. Song, A. Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1912–1920. Cited by: §4.2.
  • Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li (2019) On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5745–5753. Cited by: §3.2.