### RASF

[ICLR'22] paper Representation-Agnostic Shape Fields

view repo

3D shape analysis has been widely explored in the era of deep learning. Numerous models have been developed for various 3D data representation formats, e.g., MeshCNN for meshes, PointNet for point clouds and VoxNet for voxels. In this study, we present Representation-Agnostic Shape Fields (RASF), a generalizable and computation-efficient shape embedding module for 3D deep learning. RASF is implemented with a learnable 3D grid with multiple channels to store local geometry. Based on RASF, shape embeddings for various 3D shape representations (point clouds, meshes and voxels) are retrieved by coordinate indexing. While there are multiple ways to optimize the learnable parameters of RASF, we provide two effective schemes among all in this paper for RASF pre-training: shape reconstruction and normal estimation. Once trained, RASF becomes a plug-and-play performance booster with negligible cost. Extensive experiments on diverse 3D representation formats, networks and applications, validate the universal effectiveness of the proposed RASF. Code and pre-trained models are publicly available https://github.com/seanywang0408/RASF

READ FULL TEXT VIEW PDF[ICLR'22] paper Representation-Agnostic Shape Fields

view repo

3D shape analysis is the foundation to understand the physical world. It has a wide range of applications in real life including robotics (liu2020keypose; liang2018deep), autopilot (shi2020spsequencenet; song2019apollocar3d; qi2018frustum; zhou2018voxelnet), medical imaging (yang2021reinventing; li2018h; yang2021ribseg) and movie animation (xu2019predicting; aberman2020skeleton; hertz2020deep). In recent years, the research on this topic has prevailed and showed promising results in various tasks, such as object classification, part segmentation, scene segmentation, etc.

Shape could be represented in different data formats, among which meshes, point clouds and voxels are most commonly used. In the deep learning era, most studies on these three representations use coordinates or coordinates-like feature as input to feed into the backbone network. For point clouds, the direct input to the network is the point coordinates. For meshes, the node feature of the graph is the vertices coordinates. For volumetric data, the 3D shape is denoted by whether a voxel in a particular position is occupied or not. Using coordinates to characterize a shape is simple and straightforward. However, the major problem with this is that coordinate lacks contextual geometric information. Hence the capacity of the backbone network could be restricted. Even though various operators and backbones are proposed to extract high-level feature from the combination of coordinates by aggregating the local geometry, the effect of the input feature is not clear yet.

The practice in Natural Language Processing (NLP) and Data Mining (DM) might shed some light on this issue. Word embeddings

(mikolov2013linguistic; pennington2014glove)is proposed very early in NLP to map words into an embedding space using embedding layers, which work as lookup tables that are indexed by the one-hot encoding of words. The word embeddings retrieved from embedding layers have similar values for words with similar meanings. This technique is widely adopted across the NLP area and notably boosts the overall performance, regardless of how the text data is distributed and which language model is used

(NEURIPS2020_1457c0d6; devlin2019bert; radford2019language; vaswani2017attention). Studies in the field of data mining (DM) also learn continuous embedding representation for graph nodes (grover2016node2vec; perozzi2014deepwalk). Embedding learning in NLP and DM indicates that the effect of input feature and the capacity of the backbone network are somehow orthogonal to each other. It raises the question that whether there is a better way to denote a shape instead of vanilla coordinates, so as to facilitate the backbone network to execute downstream tasks, regardless of which backbone model is used.In this work, we introduce Representation-Agnostic Shape Fields (RASF), a shape embedding layer that maps coordinates to shape embeddings with rich geometry information. RASF is implemented using a learnable multi-channel 3D grid. Similar to the lookup table in word embedding layer, coordinates within a local shape index from this 3D grid and retrieve shape embeddings. With simple operation on data, we make RASF compatible with major 3D representations, including point clouds, meshes and voxels. To obtain the weights of RASF, we investigate several pre-training schemes for RASF, among which we find that self-supervised training schemes, i.e., reconstruction and normal estimation, yield the best performance and generalizability. Once trained, RASF could be seamlessly plugged into any 3D deep learning pipeline to improve performance across diverse downstream tasks and datasets, with little code modification and computation cost (See Fig. 1). We empirically show that RASF consistently brings significant improvement under diverse backbones and applications, including object classification, part segmentation and scene segmentation.

In this work, we introduce a generalizable (i.e., it could be used in different 3D representations, backbones and downstream tasks) and computation-efficient shape embedding layer for 3D deep learning, named Representation-Agnostic Shape Fields (RASF). It applies a learnable multi-channel 3D grid to store local geometry. Shape embeddings for various 3D shape representations (point clouds, meshes and voxels) are retrieved by coordinates indexing. While there are multiple ways to obtain the coefficients of RASF, we provide two effective schemes among all in this paper for RASF pre-training, that is shape reconstruction and normal estimation. Abundant experiments across different representations, backbones and downstream tasks are conducted to validate the generalization and efficiency of our proposed RASF.

Point cloud data could be directly obtained from 3D LiDAR sensors. Due to its compactness that represents only the surface of the objects and its aligned format (an matrix) that suits the common deep learning frameworks, point clouds are the most extensively discussed representation in the field of 3D shape analysis. PointNet (qi2017pointnet)

is the pioneering deep learning network to process point clouds. It learns the feature of each point with a shared MLP and aggregates all points by global max-pooling. PointNet++

(qi2017pointnet++) supports hierarchical points aggregation to extract geometric information at different scales. DGCNN (wang2019dynamic) builds a dynamic graph in each layer of the network to incorporate neighbor nodes by its proposed EdgeConv module. PointCNN (li2018pointcnn), RSCNN (liu2019relation), DPAM (liu2019dynamic), ShellNet (zhang2019shellnet) and KPConv (thomas2019kpconv) extend 3D convolution operation to the irregular point clouds data in different ways, while PAT (yang2019modeling) and PCT (guo2021pct) leverage transformer to process point clouds. Shape Self-Correction (chen2021shape) proposed a self-supervised method to for point clouds analysis.Meshes are mostly used in computer graphics (kato2018neural; liu2019soft; pfaff2020learning; smirnov2021hodgenet). GWCNN (ezuz2017gwcnn) maps unstructured geometric data to a regular domain for nonrigid shape analysis. MeshCNN (hanocka2019meshcnn)

adopts specialized convolution and pooling operations on mesh edges. The convolution is conducted on the edge and the four adjacent edges around the incident triangles, while the pooling operation generates new geometry via adaptive edge collapse. Besides, MeshCNN designs several hand-crafted features that characterize the edge geometry, that is the dihedral angle, two inner angles and two edge-length ratios for each face. The 5-dimensional vector is fed into the MeshCNN network as input. Alternatively, MeshNet

(feng2019meshnet) regards faces as units. It introduces face-unit and feature-splitting to learn on meshes directly. A more recent work, HodgeNet (smirnov2021hodgenet), operates on feature of vertices and edges simultaneously. There are also works in the field of single image reconstruction, which consider mesh data as graphs (wang2018pixel2mesh; gkioxari2019mesh; pan2019deep; wen2019pixel2mesh++). In this case, graph convolutions on nodes (vertexes) are adopted to aggregate local geometry.Volumetric data provides regular grids to represent 3D shapes. They could be processed by methods analogous to 2D grids. 3DShapeNets (wu20153d)

proposed to represent 3D shapes with volumetric grids and introduced 3D Convolution Neural Networks (3DCNN) for voxel classification. VoxNet

(maturana2015voxnet) utilized 3DCNN for robust 3D object recognition. The Voxception-ResNet (VRN) (brock2016generative) introduced popular 2D network blocks into volumetric networks. The drawback of volumetric representations lies in that 3DCNN is computationally expensive. In recent years, LP-3DCNN (kumawat2019lp)is proposed to alleviate the computation issue. It applies 3D Short Term Fourier Transform (STFT) to replace the 3D convolution layers.

Surface feature descriptors or shape descriptors for non-rigid objects (generally in mesh format) are another line of work in 3D shape analysis, often applied in shape correspondence, retrieval and segmentation. Shape descriptors aim at representing the local geometry around a point in a non-rigid shape. Classic surface descriptors describe a local shape patch based on diffusion and spectral geometry to achieve isometry-invariance, for example, heat kernel signature (HKS) (sun2009concise), wave kernel signature (WKS) (aubry2011wave) and optimal spectral descriptors (OSD) (litman2013learning). Masci et al (masci2015geodesic)

proposed to automatically learn various shape descriptors for Riemannian manifolds using a geodesic convolutional neural network.

Studies in NLP embedding learning maps the one-hot encoding of words to a real-value vector via a learnable embedding layer. An embedding layer is a lookup table, using words as indices and retrieve word embeddings from it. The embedding layer could be trained in an unsupervised manner for general proposes, or in a supervised manner for a specific task, e.g., document classification.

We propose a shape embedding layer named Representation-Agnostic Shape Fields (RASF), to facilitate 3D shape analysis across various representations. As shown in Fig. 2 (a), RASF is implemented as a trainable 3D grid with the shape of , where is the resolution of the grids and C is the channel dimension. Common 3D shape representations are based on point coordinates. Suppose the shape is denoted as a point set . For each point in the shape, we extract the K-Nearest-Neighbor points from the whole point set. Regarding as the central point, we normalize the coordinates of the local point set to the range of the grid. In other words, would be placed at the center of the grid, while the farthest neighbor to

would be placed at the border of the grid. Similar to the word embedding layer that is indexed by one-hot encoding of words, the normalized local point set serves as indexes to retrieve shape features from the grid. The difference is that one-hot encoding is discrete while coordinates are continuous. Therefore the indexing is accomplished via trilinear interpolation. This operation inspired by Spatial Transformer Networks (STNs)

(jaderberg2015spatial) could be efficiently implemented by modern deep learning frameworks (e.g., grid_samplein PyTorch

(paszke2017automatic)). This function enables continuous indexing in batches to retrieve feature from discrete regular grid data. The trilinear interpolation returns a matrix for ( includes the point itself), which is then reduced to by max-pooling on the first dimension. The vector, named shape embedding (SE), encodes the shape in the local area around the central point . The whole process to obtain shape embedding of point could be formulated as follows:(1) |

where denotes the grid_sample function and

denotes the normalization. Note that the central point always corresponds to the central feature of RASF due to the normalization. Since RASF is a cubic grid, we use L1-distance in KNN searching for parameters efficiency. After processing each point in this way, we obtain shape embeddings

for all the points. The computation in RASF is negligible compared to the backbone networks, as analyzed in Sec. 5.3D shape using point clouds representation could be denoted by a matrix, where is the number of points. Therefore applying RASF on point clouds is natural. During the downstream tasks, the matrix is fed into the backbone network together with coordinates as auxiliary features.

Suppose a mesh is denoted by three elements: vertexes , edges and faces . Since mesh-based backbone networks receive different elements as units, there are minor adjustments on how the shape embedding is obtained in the RASF. Some backbone networks accept vertex features as input. Since the vertex positions in meshes are irregular (meaning that some vertexes are closer to others, while some are further), it is hard to extract meaningful shapes given only the vertex positions. In this regard, we re-sample denser points on the faces of meshes and combine the re-sampled points along with the vertexes together, as shown in the middle of Fig. 2 (b). Then we feed the combined point set into RASF, to obtain shape embedding of the vertexes . Some backbone networks accept edge features as input, e.g. MeshCNN (hanocka2019meshcnn). For similar reasons, we combine the midpoint of the edges with the re-sampled points to obtain the shape embedding of the edges in a similar way. For those backbone networks that receive face feature as input (feng2019meshnet), we combine the barycenter of faces with the re-sampled points to obtain the shape embedding of faces. Note that the shape embeddings obtained through the above approaches are entirely compatible with any mesh-based backbone networks, owing to the flexibility of RASF.

An equivalent representation to volumetric data is that points lie in positions where the voxel value is 1 while no point exists when the voxel value is . We leverage these “virtual” points to fetch voxel features from RASF during the downstream tasks, as shown at the bottom of Fig. 2 (b). Besides, we adjust the receptive field of RASF to a determinant distance instead of K-Nearest-Neighbors. In this case, voxels that are outside and far from the shape surface (no occupied voxel exists within the receptive field of RASF), would yield a shape embedding of a zero vector. Voxels that are inside and far from the shape surface (all voxels around it are occupied), would yield a shape embedding of the same value. The volumetric data is transformed to an tensor and fed into the backbone network.

We provide several schemes to pre-train RASF. The major scheme that we use in most of the experiments is reconstruction. Given the shape embeddings of each point , we randomly sample embeddings from , and concatenate the embeddings with the coordinates. The coordinates are indispensable since the shape embeddings only encode the local geometry, without knowledge of the global geometry. The tensor is fed into the reconstruction network, which consists of a shared MLP, a max-pooling layer, and several linear layers. The last linear layer outputs () elements. We reshaped the output to , which is the predicted point set

. The loss function is the chamfer distance between

and the ground-truth .This pretext task is to predict the normal of each point. We feed the shape embeddings and coordinates of each point into a shared MLP, which outputs the predicted normals

. We use cosine similarity as the loss function.

The supervised task is shape classification, given only a small portion of points and their shape embeddings. The overall process is quite similar to that of the reconstruction task, only that the output of the network is the probability of each class, instead of coordinates.

Each of these pre-training pushes RASF to encode the local geometry. We empirically find that self-supervised pre-training schemes (reconstruction and normal estimation) outperform the supervised one (classification). The self-supervised training enables RASF to be robust and transferable across different datasets and downstream tasks. For simplicity, we show the experimental results of reconstruction in Sec. 4 and then analyze different ways of pre-training in ablation study (Sec. 5). In practical implementations, we pre-train RASF using ShapeNetPart (yi2016scalable) and fix its weights in downstream tasks. (All the downstream tasks use the same RASF.) Empirical study shows that fine-tuning the weights during the downstream tasks leads to unstable performance. The detailed pre-training settings and hyper-parameter analysis are presented in the appendix A.1, A.6.

We analyze the performance of pretext tasks by visualizing RASF weights in two ways (Fig. 3) and verify RASF using a linear evaluation protocol.

The first visualization is to directly illustrate the channels in 3D and 2D. The 3D illustrations use visvis (Almar2020) to show the three-dimensional grids, while the 2D illustrations show the max-pooling output on one axis of the grids. We figure that each channel focuses on a different area, representing a particular local geometry. The other visualization is to investigate RASF’s response to geometrically-varying shapes. Practically, we feed RASF with a set of semi-ellipsoids with different curvatures, as shown in Fig. 3. It is observed that some of the channels (channel 6, 11, 12, 20, 25, 29, 30) have strong correspondence to the ellipsoid curvature, i.e., the response of these channels changes gradually from large curvature to small curvature. On the other hand, some channels have the same response given different curvatures. We conjecture these channels could be related to other geometries, such as cones or cubes. (We randomly choose some channels for demonstrations. Visualizations of all channels and detailed settings could be found in appendix A.4.)

Moreover, we verify RASF by a linear classification protocol on point clouds and voxels data. For point clouds, we experiment on three kinds of linear classifier: 1) max-pooling on the number of points followed by a fully-connected (FC) layer; 2) a point-wise (shared) FC layer followed by max-pooling and a FC layer; 3) flatten the input to one dimension and use a FC layer as classifier. For voxels data, we simply flatten the input to one dimension and use a FC layer as a classifier. The results are shown in Table 4.1. It is observed that RASF feature outperforms raw feature in all settings, indicating RASF feature is more discriminative than raw feature. Especially for flattened points cloud, the out-of-order raw feature leads to extreme performance degradation, while RASF feature still yields an accuracy of .

We use five different datasets to evaluate the general effectiveness of RASF, which are various on tasks, characteristics, and shape representations. The summarized introduction is listed in Table 2. All settings are identical for each backbone with and without RASF, including training hyper-parameters, train-test splits, and so on.

Dataset | Train | Test | Representation | Description | Task | Metric |
---|---|---|---|---|---|---|

ModelNet10 (wu20153d) | 3,991 | 908 | Point Cloud & Voxel | 3D Objects (Rigid) | Classification | ACC |

ModelNet40 (wu20153d) | 9,843 | 2,468 | Point Cloud & Voxel | 3D Objects (Rigid) | Classification | ACC |

ShapeNetPart (yi2016scalable) | 12,137 | 2,874 | Point Cloud | 3D Objects (Rigid) | Part Segmentation | mIOU |

S3DIS (armeni20163d) | 6-Fold Cross-Val | Point Cloud | Indoor Scene (Rigid) | Semantic Segmentation | mIOU | |

SHREC10 (lian2011shape) | 300 | 300 | Mesh | 3D Objects (Rigid & Non-Rigid) | Classification | ACC |

SHREC16 (lian2011shape) | 480 | 120 | Mesh | 3D Objects (Rigid & Non-Rigid) | Classification | ACC |

HUMAN (maron2017convolutional) | 370 | 18 | Mesh | Human Bodies (Non-Rigid) | Part Segmentation | mIOU |

For point cloud representation, we conduct experiments on ModelNet40 for classification, ShapeNetPart for part segmentation and S3DIS for semantic scene segmentation. We compare the performance with and without the RASF input on various point cloud backbones. For classification, we experiment on PointNet, PointNet++, KPConv, DGCNN and PCT. For part segmentation, we experiment on PointNet, PointNet++, KPConv and DGCNN. For scene segmentation, we evaluate the performance of RASF in DGCNN. All the results are obtained by running the public official code ourselves. For RASF, we simply increase the input channel of all the backbone by (the channel dimension of RASF), while other hyper-parameters remain unchanged and default. Normals are excluded for all the backbones. The number of sample points and the data augmentation techniques for each backbone are aligned with its original setting.

As shown in Table 4.3 and Table 4.3. RASF consistently improves the performance of all the backbone models on different tasks. For PointNet which lacks local operations, RASF notably improve over baseline. For other backbones which contain various neighbor-based operators, RASF still brings consistent improvement, even though RASF is also based on local operations. It demonstrates that the effect of input feature and the capacity of the backbone network could complement each other. In the scene segmentation task, where the point clouds contain richer and more complex local geometry, RASF notably improves the result over DGCNN. It demonstrates that RASF remains robust even when the downstream data distribution is thoroughly distinct from the pre-training data. With the additional shape embedding input, networks have a better understanding of the shape semantics than just using coordinates. RASF is able to generally boost the performance under various backbones and tasks with little memory and computation cost.

For Mesh Representation, we use MeshCNN to evaluate the effectiveness of RASF. We conduct classification on SHREC dataset and segmentation on HUMAN dataset, following the settings in MeshCNN. MeshCNN considers the edge as the unit of meshes. It manually calculates the edge feature based on the two adjacent triangles. For classification, we train the model for epochs using Adam (kingma2014adam) optimizer with an initial learning rate of and linearly decrease the learning rate to 0 from the epoch. For segmentation, the number of epochs is while the initial learning rate is set to . For RASF, we extract shape embeddings of the edges following our proposed method for mesh, concatenate it with its original feature, then increase the channel of the first layer of the model to feed them in the network.

As shown in Table 4.3, RASF notably boosts the performance of MeshCNN over diverse datasets and tasks. Especially, MeshCNN with RASF reaches a classification accuracy of on SHREC10 and SHREC16. Besides, we observed that the increment of the performance on meshes is much more significant compared to other representations. For one reason, the surface sampling of meshes yields a similar data distribution to point clouds, meaning little mismatch between pre-training and downstream training in terms of shape embedding. For another, the sampling on the mesh surface actually introduces new geometry compared to using vertexes and edges only. Through RASF, the geometry is encoded in the shape embedding, resulting in a higher performance boost. Moreover, RASF could be integrated to mesh networks seamlessly without sacrificing efficiency in terms of training time and model size.

Noticed that MeshCNN accepts not only coordinates input but also the hand-crafted edge feature. In this case, RASF is still able to increase the performance of the network. The comparison between the hand-crafted feature and RASF will be thoroughly discussed in Sec. 5.

We adopt VoxNet to evaluate RASF on voxels data, using ModelNet10 and ModelNet40 for classification task. The voxels data is in a shape of . We train the model for epochs using Adam optimizer with an initial learning rate of 0.001 and multiply the learning rate by 0.5 for every 20 epochs. We obtain shape embedding of each voxel from RASF following the proposed RASF implementation for voxels. Then we feed the shape embedding into VoxNet, by changing the input channel of the first layer. The best accuracy on test-set among all epochs is reported.

RASF boosts the performance of VoxNet on both datasets (Table 4.3). Note that RASF is pre-trained on point clouds, in which the scale is different from voxels. What’s more, point clouds only exist on the surface of the objects, while voxels are solid, which indicates they have entirely different distributions. Even in this case, RASF could still increase the performance by aggregating local geometry. It enriches the shape semantics of voxel representation by transferring the learned geometry to distinct data distributions, demonstrating the robustness of RASF.

The adoption of a learnable grid in the shape embedding layer is motivated from embedding layer in NLP. In this part we investigate how a learnable grid is superior to other module architecture choices, including a PointNet model (consisting of a point-wise MLP, a Max-pooling layer and global MLP) and a simplified PointNet-like module (consisting of a single fully-connected layer and a Max-pooling layer). Besides, we include EdgeConv module in DGCNN, which has been proved to be an effective neighborhood-based point clouds operator. We pre-train these three modules using the same reconstruction pre-text task described in Sec. 3. As the results shown in Table 5.1, RASF has better performance in each downstream task. EdgeConv yields comparable performance on point clouds, yet it rapidly deteriorates on mesh data. We argue that RASF is a better choice with respect to generalizability.

We compare the performance of the three pre-training schemes, including reconstruction, normal estimation and classification. We experiment on ModelNet40 for point clouds, SHREC10 for meshes and ModelNet10 for voxels. We use PointNet, MeshCNN and VoxNet as the backbones for point clouds, meshes and voxels respectively. As shown in the middle bar of Table 5.1, all the pre-training schemes improve the performance over baseline. RASF obtained from self-supervised pre-training consistently outperforms the supervised one. We argue that self-supervised pre-training enables RASF to learn more general and robust embeddings that are more related to the local geometry while less related to high-level semantics.

We evaluate the effect of the pre-trained RASF weights by comparing fixed, fine-tuned, and random-initialized RASF in the downstream tasks. Fixed RASF is the setting we use in all the experiments. Fine-tuned RASF refers to optimizing the pre-trained RASF together with the backbone network in the training of the downstream tasks, while random-initialized refers to a non-pre-trained RASF that is optimized within the downstream tasks. The detailed settings are the same as described in the previous paragraph. As shown in the bottom bar of Table 5.1, random-initialized RASFs are consistently worse than fixed RASFs by a large margin. It proves that RASF considered as a feature layer would not work without pre-training. Besides, fine-tuned RASFs yield unstable results compared to fixed RASFs. This phenomenon demonstrates the pre-trained RASF generalizes well in downstream tasks.

We analyze the complexity of RASF. The most computationally-intense step in RASF occurs in K-Nearest-Neighbors algorithm, in which the time complexity is , while the trilinear step yields . The number of parameters in RASF is . As RASF could be fixed in the downstream tasks, it introduces no additional memory cost apart from its parameters. We evaluate the actual running time of RASF and multiple backbones with a batch size of (Table 5.1). Note that our current implementation of K-Nearest-Neighbors is a naive version with PyTorch. The actual time cost is expected to be lower by integrating more sophisticated KNN implementation, e.g., Faiss (JDH17). All the results are measured on an RTX 2080 Ti.

In point clouds analysis, the point normals are widely used as auxiliary features to improve performance. In meshes analysis, MeshCNN proposed to use the geometric statistics around the edges as auxiliary features, that is the dihedral angle, two inner angles and two edge-length ratios for each face. In this work, we learn local geometry by optimizing the RASF grid in pre-text tasks. We compare shape embeddings from RASF with these two hand-crafted features. For the point cloud baseline, we only feed the point coordinates as input, using a PointNet backbone. For the mesh baseline, we remove the hand-crafted features on edges and replace them with the mid-points of the edges, representing the position of the edges. We experiment on SHREC10 for meshes and ModelNet40 (classification), ShapeNetPart (segmentation) for point clouds.

Backbones with RASF consistently outperform baselines by large margins (Table 5.1). Hand-crafted features also improve the performance, but not as good as RASF, indicating that RASF provides richer geometric information than hand-crafted features. We reckon that RASF is pre-trained through a reconstruction task which requires RASF to provide comprehensive local geometric information. We also noticed that combining RASF and the hand-crafted feature could further increase the performance. It demonstrates that the effect of RASF is somehow orthogonal to these hand-crafted features. RASF brings additional notable improvements over the existing methods.

We propose Representation-Agnostic Shape Fields (RASF), a generalizable and computation-efficient shape embedding layer for 3D deep learning. Shape embeddings for various 3D shape representations (point clouds, meshes and voxels) are retrieved by coordinates indexing. We provide two effective schemes for RASF pre-training, that is shape reconstruction and normal estimation, to enable RASF to learn robust and general shape embeddings. Once trained, RASF could be plugged into any 3D neural network with negligible cost. RASF widely boosts the performance for various 3D representations, neural backbones and applications.

This work was supported by National Science Foundation of China (U20B2072, 61976137).

To demonstrate the general effectiveness of RASF, we use the same RASF weights in all the experiments below. We pre-train RASF based on a large point clouds dataset, ShapeNetPart (yi2016scalable). It includes training samples of and testing samples of . We randomly sample points for each object as the input shape to RASF. We set the resolution of RASF to and the channel dimension to . During the pretraining in ShapeNetPart, we set the number of neighbors to for points per sample. changes adaptively according to the total number of points in one sample, so as to preserve a relatively stable scale in RASF across different shapes. In the reconstruction pre-text task, we sample rows in the shape embeddings to feed into the reconstruction network. The choosing of the RASF hyper-parameters are analyzed in Sec. A.6.

The training lasts for epochs, with an initial learning rate of using Adam. We decay the learning rate by for every epochs. The chamfer distance on the test-set converges to at the end. It is hard to tell the difference between the reconstructed shape and the ground-truth with human eyes.

When RASF is used in down-stream tasks, it is used only before the first layer in the backbones networks, making it very simple to implement. Suppose RASF embeddings have channels and the original backbone receives input of channels. We simply enlarge the channels of the first layer to to implement backbones with RASF. All other settings are identical between bakcbones with and without RASF.

We illustrate the reconstructed shapes during the reconstruction task for RASF pre-training in Fig. 4. The predicted shapes are obtained on test-set after convergence. As shown in the figure, the reconstructed shapes and the ground truths are hardly distinguishable with human eyes, demonstrating that RASF has learned the local geometry for favourable reconstruction performance.

We design two approaches to visualize the pre-trained RASF. One is directly illustrating the weights. The other is feeding RASF with a set of semi-ellipsoids with different curvatures to explore how RASF would respond to geometrically-varying shapes.

We visualize each channel of the pre-trained RASF. First, we visualize the three-dimensional grid using visvis, a python library (Almar2020). Besides, we reduce the three-dimensional grid of each channel to two-dimensional by conducting max-pooling on the x-axis (the other two axes have a similar phenomenon). The max-pooling here corresponds to the max-pooling in RASF to obtain shape embeddings. As illustrated in Fig. 6, we find that each channel focus on a different area, representing a particular local shape. Besides, some of the channels show a characteristic of symmetry, which might represent a symmetrical local shape.

We explore how RASF would respond to geometrically-varying shapes by designing a proof-of-concept experiment. We generate a set of semi-ellipsoids in diverse curvatures by varying the radius lengths on the three axes. A portion of the generated semi-ellipsoids are illustrated in Fig. 7 (left). A semi-ellipsoid could be parameterized by spherical coordinates. Suppose the semi-ellipsoid axes coincide with coordinate axes, we have:

(2) | |||

(3) | |||

(4) |

where denote the radius lengths on x-axis, y-axis and z-axis, . To vary the radius length along the x-axis, we set as 1 and increases a from 0.1 to 2 in a step of 0.1, so as y-axis and z-axis. In this way, we generate three groups of semi-ellipsoids, each of which contains 20 semi-ellipsoids in different shapes for one axis. We feed these shapes into the pre-trained RASF and obtain their shape embeddings. We consider the peak of the semi-ellipsoids as the central point of RASF, which are marked red in Fig. 7 (left), while the others are considered as the local shape of the peak. At last, we obtain 20 shape embedding vectors for 20 shapes in each group. We illustrate the shape embeddings by arranging them in order in rows, as shown in Fig. 7 (right).

It is observed that some channels (channel 6, 11, 12, 20, 25, 29, 30) have strong correspondence to the ellipsoid curvature. The response of these channels changes gradually from large curvature shapes to small curvature shapes. On the other hand, some channels have the same response given different curvature shapes. We suppose these channels could be related to other geometries, such as cones or cubes.

The ModelNet datasets are introduced by Wu et al. (wu20153d) for 3D object classification. ModelNet40 includes categories of 3D rigid objects. ModelNet10 is a subset of ModelNet40. Point clouds (qi2017pointnet) and volumetric representations (wu20153d) are available for this dataset.

ShapeNetPart (yi2016scalable) is a point cloud dataset for benchmarking 3D shape segmentation. It contains shapes from categories. Each point in the point cloud is annotated with one of labels. Most categories are labeled with two to five parts. The size of the train-set and test-set are and respectively.

The Stanford Large-Scale 3D Indoor Spaces (S3DIS) (armeni20163d) is for a semantic indoor scene segmentation task, containing indoor areas with rooms. Each point is annotated with one of the categories, e.g., board, chair, ceiling, etc. plus clutter. We conduct cross-validation on the areas, the same protocol as prior works (armeni20163d; qi2017pointnet; wang2019dynamic).

SHREC (lian2011shape) is a 30-classes dataset for mesh classification, with examples for each class. The categories include rigid objects such as lamps, and also non-rigid objects such as aliens. We follow the protocol in (ezuz2017gwcnn), which generates two kinds of split for training and testing. The first is to randomly sample 10 examples from 20 per class for training to form SHREC10, yielding a train-test-split. The other is to randomly sample 16 from 20 for training, named SHREC16. Since (ezuz2017gwcnn) did not release their train test split, we conduct the random split ourselves.

The HUMAN dataset is proposed by Maron et al. (maron2017convolutional) for mesh segmentation. The train-set includes models from SCAPE (anguelov2005scape), FAUST (bogo2014faust), MIT (vlasic2008articulated) and Adobe Fuse (adobe20163d), while the test-set includes models from SHREC07 (giorgi2007shape) (humans) dataset. The models in total are manually annotated with 8 labels based on (kalogerakis2010learning).

We analyze the sensitivity of the hyper-parameters in RASF. There are four hyper-parameters in RASF, that is the resolution , the channel , the number of neighbor points to obtain shape embedding, and the number of shape embeddings to feed into the reconstruction network. We change one parameter at a time, using the set of hyper-parameters to pre-train the RASF and adopt it in the downstream task. At last, we decrease the , and the neighbor points, while increasing at the same time, given that the reconstruction network needs more input shape embeddings when RASF models smaller shapes. We adopt PointNet (qi2017pointnet) on ModelNet40 (wu20153d) for demonstration. The results are shown in Fig. 5. The results show that RASF grid needs a proper size to achieve the best performance. However, increasing the size (resolution and channel) brings more harm than decreasing it. We argue that RASF needs a proper resolution and number of channels to achieve optimal performance.