SVT-Net: A Super Light-Weight Network for Large Scale Place Recognition using Sparse Voxel Transformers

by   Zhaoxin Fan, et al.
Nanjing University
Tsinghua University

Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Though previous methods have achieved good performance by learning short range local features, long range contextual properties have long been neglected. And model size has became a bottleneck for further popularizing. In this paper, we propose model SVTNet, a super light-weight network, for large scale place recognition. In our work, building on top of the highefficiency 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short range local features and long range contextual features. Consisting of ASVT and CSVT, our SVT-Net can achieve state-of-art performance in terms of both accuracy and speed with a super-light model size (0.9M). Two simplified version of SVT-Net named ASVT-Net and CSVT-Net are also introduced, which also achieve state-of-art performances while further reduce the model size to 0.8M and 0.4M respectively.



There are no comments yet.


page 1

page 2

page 3

page 4


Attentive Rotation Invariant Convolution for Point Cloud-based Large Scale Place Recognition

Autonomous Driving and Simultaneous Localization and Mapping(SLAM) are b...

TransLoc3D : Point Cloud based Large-scale Place Recognition using Adaptive Receptive Fields

Place recognition plays an essential role in the field of autonomous dri...

Sparse Cross-scale Attention Network for Efficient LiDAR Panoptic Segmentation

Two major challenges of 3D LiDAR Panoptic Segmentation (PS) are that poi...

Efficient 3D Point Cloud Feature Learning for Large-Scale Place Recognition

Point cloud based retrieval for place recognition is still a challenging...

Voxel Transformer for 3D Object Detection

We present Voxel Transformer (VoTr), a novel and effective voxel-based T...

Learning Dense Voxel Embeddings for 3D Neuron Reconstruction

We show dense voxel embeddings learned via deep metric learning can be e...

Forming a sparse representation for visual place recognition using a neurorobotic approach

This paper introduces a novel unsupervised neural network model for visu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: (Top) Pipeline of point cloud based place recognition: A database of point clouds are stored as reference sub-maps for each location. During inference, we retrieve the closest match to the query scan in the database using embedding features (descriptors) to get the location of the query scan. (Bottom) Compared with other SOTA methods, our proposed models perform the best in terms of both accuracy and model size.

Large scale place recognition and localization is fundamental for a wide range of applications like Simultaneous Localization and Mapping (SLAM) [27, 28], autonomous driving [21, 11], robot navigation [29, 33], etc. For example, the place recognition result is always used as a signal for loop-closure [4] in SLAM systems, when GPS signal is not available. A line of works [22, 13, 43] have chosen to use images for place recognition, which have shown promising performance. However, images are sensitive to illumination, weather change, diurnal variation, etc, making models based on them unstable and unreliable. Besides, due to lack of depth information, image based methods are hard to fully understand the scene and are easily cheated by planar puzzles.

To tackle these challenges, a feasible solution is replacing images with point clouds collected by LiDAR. Scenes represented by point clouds are inherently invariant to illumination and weather changes and contain accurate and detailed 3D information meanwhile. Recently, a range of deep learning models

[36, 44, 35, 24, 8, 41, 20] that utilize point clouds for place recognition have been proposed. Figure 1 (top) illustrates the common pipeline of these point cloud based place recognition methods. For a large scale region, a database of LiDAR scans tagged with UTM coordinates acquired from GPS/INS readings are constructed in advance. When a query scan is collect by the LiDAR from scratch, the most similar point cloud to the query scan is retrieved from the database to determine where the location of the query scan is.

To achieve accurate retrieval, a powerful deep learning model is needed to learn discriminative global descriptors or embeddings of point clouds. In fact, learning powerful scene descriptors is the key to the recognition task. However, we observe that, when learning global descriptors, most of previous methods only consider how to better extracted short range local features, while the equally important long range contextual properties have long been neglected. And we argue that lacking awareness of long range contextual properties, power of final descriptors would be greatly limited. Besides, we also notice that model size has been a bottleneck for further performance improving and practical popularizing. More specifically, in most of SLAM or robot navigation systems, the available memory is tight. Therefore, the smaller the model size, the more favorable it is to deploy the place recognition algorithm on more hardware products and serve for more scenarios. Therefore, designing light-weight descriptor learning models with small size and fast running time is necessary.

Motivated by the above observations, we propose a novel super light-weight network named SVT-Net. In our work, the point clouds are firstly voxelized into sparse voxel representations to better characterize structured information of the scene. Then, we choose the light weight 3D Sparse Convolution (SP-Conv) [5]

as our basic unit to extract local features owing to its flexibility and powerful local feature learning ability. However, simply stacking SP-Conv layers may ignore long range contextual properties. Therefore, inspired by recently proposed Vision Transformer Networks

[7, 40], we propose two kinds of Sparse Voxel Transformers (SVTs) named Atom-based Sparse Voxel Transformer (ASVT) and Cluster-based Sparse Voxel Transformer (CSVT) on top of SP-Conv layers. ASVT and CSVT can extract long range contextual features implicit in the sparse voxel representation from two perspectives: attending on different key atoms and clustering different key regions in the feature space, thereby helping to obtain more discriminative descriptors through interacting different atoms and different clusters respectively. Since SP-Conv only conduct convolution operation on non-empty voxels, it is efficient and flexible for computation, and so do the two SVTs built upon it. Thanks to the strong capabilities of the two SVTs, our model can finally learn sufficiently powerful descriptors from an extremely shallow network architecture. Therefore, model size of SVT-Net is very small as shown in Figure 1 (bottom). Experiment result shows that though small, SVT-Net achieves state-of-the-art performance in terms of both accuracy and speed on Oxford RobotCar dataset [26] and three in-house datasets [36]. What’s more, to further increase speed and reduce model size, we propose two simplified version of SVT-Net: ASVT-Net and CSVT-Net, which can also achieve state-of-the-art performances with model sizes of only 0.8M and 0.4M respectively. Our contributions can be summarized as:

  • We propose a novel light-weight point cloud based place recognition model named SVT-Net as well as two simplified versions: ASVT-Net and CSVT-Net, which all achieve state-of-the-art performance in terms of both accuracy and speed with a extremely small model size.

  • We propose Atom-based Sparse Voxel Transformer(ASVT) and Cluster-based Sparse Voxel Transformer(CSVT) for learning long range contextual features hidden in point clouds. To the best of our knowledge, we are the first to propose Transformers for sparse voxel representations.

  • We have conducted extensive quantitative and qualitative experiments to verity the effectiveness and efficiency of our proposed models and analyse what the two proposed Transformers actually learn.

2 Related Work

Large scale place recognition involves a wide range of technologies and fields. In this section, we briefly introduce two kinds of closely related works. Specifically, we first introduce some studies about image based or point cloud based place recognition. Then, we simply review some recently proposed Vision Transformer Networks.

2.1 Large scale place recognition

Large scale place recognition has been long interested in by researchers. In early years, hand-craft features like SIFT [25], SURF [2] and ORB [34] extracted from images are always used on this task [10, 9, 18]. Though simple, hand-craft features own limit powers. Therefore, with the development of deep learning, learned features become more popular. A typical deep learning method for place recognition is NetVLAD [1], which learns global descriptors by clustering CNN features into several different visual words. Then, a variety of following works [43, 15] are proposed to improve it and have achieved promising results. However, though much more powerful compared to hand-craft features, learned image features still suffer from their sensitivity towards illumination, weather change, diurnal variation, etc.

Recently, utilizing point clouds for large scale place recognition has attracted much attention owing to point cloud’s robustness towards environmental changes. PointNetVLAD [36] is a pioneering work. It first uses PointNet [31] and NetVLAD [1] to learn global descriptors. Then a K-Nearest-Neighbors (KNNs) algorithm is used for retrieval and recognition. Then, Zhang and Xiao [44] propose a contextual aware attention mechanism to help the model learn stronger local features in their proposed model PCAN. Models DAGC [35], SRNet [8] and LPD-Net [24] all use Graph Convolutional Networks (GCNs) to better capture local features. More recently, SOE-Net [41] introduces a PointOE module that encodes local features from eight orientations to improve place recognition performance. The above mentioned methods all learn point cloud features by taking non-structure point-wise representations as input, which requires a huge model with large amount of parameters to learn reliable features. In contrast, Minloc3D [20] uses sparse voxel representations as input and builds a simple Feature Pyramid Network (FPN) like architecture for learning point cloud descriptors and ranks the current state-of-the-art. Minloc3D has significantly reduced the model size. However, it neglects the importance of long range contextual properties hidden in the point cloud. In our work, we also use sparse voxel representations as input. But we further propose two Sparse Voxel Tranformers (SVTs) to learn these long range contextual properties. And thanks to the strong capability of the two SVTs, the model size of our work is further reduced.

2.2 Vision transformers

Transformer [38]

is originally proposed for natural language processing (NLP) tasks

[6, 16, 42, 19, 3]

. In Transformer, self-attention mechanism is the core of its function owing to its ability of capturing long range contextual information. At present, Transformer has become the most important basic module in NLP field. Inspired by the great success of Transformer in the field of NLP, researchers gradually begin to think whether self-attention mechanism can also play a role in the field of computer vision.

Therefore, Vision Transformer (ViT) [7] is proposed recently. It adopts the idea of self-attention and divides images to 16x16 words. In this way, images can be processed like nature languages. Then, a variety of following works [40, 39, 23] are proposed to improve it. For example, Wu et al. [40] propose Visual Transformer (VT), which elegantly projects image features into tokens and process these tokens by means of the classic Transformer [38], reducing computational cost greatly. PVT [39] introduces a FPN like structure to better cope with dense prediction tasks. Swin-Transformer [23] presents a hierarchical architecture, where by limiting self attention to non overlapping local windows, higher efficiency can be achieved. More recently, Jiang et al. [17] successfully employ vision Transformers in GANs. For a more comprehensive introduction of Vision Transformers on 2D images, we refer readers to [14]. Till now, introduced vision Transformers are all designed for processing images. When it comes to Transformers for processing point cloud, there are only a few works [45, 12], which means that the 3D vision of Transformer is still under-explored. In this paper, we propose two kinds Transformers that can be used for processing sparse voxel representations of point clouds. To the best of our knowledge, it’s the first work designing Sparse Voxel Transformers in literature.

3 Methodology

3.1 Problem statement

Let be a database of pre-defined 3D submaps (represented as point clouds), and Q be a query scan. The place recognition problem is defined as retrieving a submap from that is closest to Q. To achieve accurate retrieving, we have to design a deep learning model that can embed all point clouds into discriminative global descriptors, e.g. , so that a following KNNs algorithm can be used for finding . To employ 3D Sparse Convolution [5], we firstly voxelize all point clouds into sparse voxel representations, e.g. , where in each voxel, 1 means that it is occupied by any points in , called non-empty voxels, and otherwise 0, called empty voxels. 3D Sparse Convolution operation is only done among these non-empty voxels. Hence, it is very efficient and flexible.

Next, we will firstly introduce the Atom-based Sparse Voxel Transformer (ASVT) and the Cluster-based Voxel Transformer (CSVT) respectively. And then, the overall network architecture of SVT-Net as well as network architectures of the two simplified versions (ASVT-Net and CSVT-Net) will be introduced in detail. The loss function will be presented finally.

Figure 2: The network architecture of ASVT.

3.2 Atom-based sparse voxel transformer

As mentioned before, simply stacking SP-convs can only learn local information from nearby voxels. To capture long range contextual properties hidden in point cloud, we design ASVT, which adopts the idea of self-attention to aggregate information from both nearby and far-away voxels. In ASVT, we define each individual voxel as an atom. During processing, each atom should be interacted with all other atoms according to the learned per-atom contributions. By doing so, different key atoms could be attended by other atoms so that both local relationship of nearby atoms and long range contextual relationship of far way atoms will be learned. Note that learning such kind of long range contextual relationship is very important for the model. For example, in a scene, assume there are two atoms that belong to different instances of the same category. If only SPConv is used, the ”same-category” information may be ignored due to the small receptive field. While if AVST is added to learn such kind of information, the model can better encode what the scene describes. Hence the final global descriptor would be more powerful.

The architecture of ASVT is illustrated in Figure 2.

Let be the input sparse voxel features learned by sparse convolutions (SP-voxel features for simplicity). We firstly learn the sparse voxel values (SP-values for simplicity) , SP-queries , and SP-keys through three different SP-Convs respectively:


where we often set to reduce computational cost in later steps. That is to say, the dimension of SP-queries and SP-keys are reduced from to

for efficiency. After that, SP-voxel features of SP-values (SP-queries/keys) are rearranged to a tensor of

(), where is the number of non-empty voxels.

Then, we use and to calculate the SP-attention map :


where encodes the contribution relationship of each atom with all the other atoms. In the following attending operation, these relationships will contribute to aggregating both short range local information and long range contextual information by interacting atoms. The attending operation can be summarized as:


where is called atom-attended SP-voxel features. In , features of each atom have accepted contributions from all the other atoms. Thus it could encode meaningful contextual information to describe the scene.

Finally we rearrange back to sparse voxel representations with a dimension of and regard it as a residential term. The final ASVT feature is defined as the sum of and :

Figure 3: The network architecture of CSVT.

3.3 Cluster-based sparse voxel transformer

Another observation we find is that in the sparse voxel representation, some atoms may share the same characteristics. For example, atoms representing walls always form a plane, while atoms representing columns easily form a cylinder like structure. This means that atoms can actually cluster into different clusters according to their characteristics, and the long range contextual properties can also be extracted from the perspective of interacting between these clusters. Motivated by this intuition, we propose CSVT, which is illustrated in Figure 3. As shown in the figure, CSVT consists of three component, a Tokenizer module, a Transformer module and a Projector module.

The Tokenizer module is used to transform the input SP-voxel features into tokens, where each token represents a cluster in the latent space. We again define as the initial SP-voxel features. To achieve the goals of the tokenizer, we first use a SP-Conv operation followed by a rearrange operation to generate a grouping map :


where is the rearrange operation. is the number of tokens we choose to generate.

stores the probabilities of each voxel belonging to each token. Therefore, we can use

to capture representations of tokens as grouping different tokens into different clusters:


where denotes representations of tokens with each of them described by features.

A Transformer module is then used to learn long range properties among different clusters through interaction of these tokens. First, we generate values, keys, and queries using shared convolutional kernels:


Then, tokens are interacted with each other through the following attention operation:


where is the attended tokens.

The Projector module is then used to project token features back to the sparse voxel representations. Specifically, we use to calculate a projection map :


Then, the projection operation is defined as:


Again, we rearrange back to sparse voxel representations with a dimension of and regard it as a residential term. The final CSVT feature is defined as:

Figure 4: Pipeline of our proposed model.

3.4 Network architecture

The overall architecture of SVT-Net is built upon the above introduced ASVT and CSVT as well as the 3D Sparse Convolution (SP-Conv). Specifically, as shown in Figure 4. The initial sparse voxel representation is first fed into an initial SP-conv layer with an output dimension of 32 to learn initial sparse voxel features. Then a SP-Res-Block consists of two SP-conv layers with skip connection is used to enhance learned features and increase the feature dimension to 64. Next, another SP-conv layer is used to increase the feature dimension to be equal to the final descriptor’s dimension . After that, the SP voxel features are fed into two branches for learning the ASVT feature and the CSVT feature using the two proposed Sparse Voxel Transformers(SVTs) respectively. Finally, the learned ASVT feature and CSVT feature are fused by directly adding together. Finally, the final global descriptor is calculated using a GeM Pooling operation [32, 20].

Thanks to the strong power of ASVT and CSVT, though our network architecture is simple and small, our proposed model SVT-Net can achieve superior performance compared to previous methods.

Note that though we use both ASVT and CSVT in SVT-Net, it is also possible to use them separately in different networks. Therefore, we propose two simplified versions of SVT-Net: ASVT-Net and CSVT-Net by eliminating ASVT and CSVT respectively. And according to our experimental results, both ASVT-Net and CSVT-Net also achieve state-of-the-art performance while further reduces the model size.

3.5 Loss function

To train our model, we adopt the following triplet loss as proposed in [20]:


where is the descriptor of the query scan, and are descriptors of the positive sample and the negative sample respectively, and is a margin. means the Euclidean distance between and . To build informative triplets, we use batch-hard negative mining following [20].

After the network is trained, all point clouds are embedded into descriptors using the model. And we use the KNNs algorithm to find in the database that is most similar and locate closest to query scan .

4 Experiments

Average recall at top 1% (%) Average recall at top 1 (%)
Oxford U.S. R.A. B.D. Oxford U.S. R.A. B.D.
PointNetVLAD [36] 80.3 72.6 60.3 65.3 - - - -
PCAN [44] 83.8 79.1 71.2 66.8 - - - -
DAGC [35] 87.5 83.5 75.7 71.2 - - - -
SOE-Net [41] 96.4 93.2 91.5 88.5 - - - -
SR-Net [8] 94.6 94.3 89.2 83.5 86.8 86.8 80.2 77.3
LPD-Net [24] 94.9 96 90.5 89.1 86.3 87 83.1 82.3
Minkloc3D [20] 97.9 95 91.2 88.5 93 86.7 80.4 81.5
SVT-Net(Ours) 97.8 96.5 92.7 90.7 93.7 90.1 84.3 85.5
ASVT-Net(Ours) 98 96.1 92 88.4 93.9 87.9 83.3 82.3
CSVT-Net(Ours) 97.7 95.5 92.3 89.5 93.1 88.3 82.7 83.3
Table 1: Accuracy comparison of our method with state-of-the-art methods at the baseline setting.
Methods Time Parameters
PointNetVLAD [36] - 19.8M
PCAN [44] - 20.4M
LPD-Net [24] - 19.8M
Minkloc3D [20] 12.16ms 1.1M
SVT-Net(Ours) 12.97ms 0.9M
ASVT-Net(Ours) 11.04ms 0.4M
CSVT-Net(Ours) 11.75ms 0.8M
Table 2: Efficiency comparison of our method with other methods.

4.1 Datasets and Metrics

We use the benchmark datasets proposed by [36] to evaluate our methods. The benchmark contains four datasets: one outdoor dataset named Oxford generated from Oxford RobotCar [26] and three in-house datasets: university sector (U.S.), residential area (R.A.) and business district (B.D.). The benchmark contains 21,711, 400, 320, 200 submaps for training and 3,030, 80, 75, 200 submaps for testing for Oxford., U.S., R.A. and B.D. respectively. Ground points of each submaps are removed and finally each point cloud contains 4096 points. In training, point clouds are regarded as correct matches if they are at maximum 10m apart and wrong matches if they are at least 50m apart. In testing, the retrieved point cloud is regarded as a correct match if the distance is within 25m between the retrieved point cloud and the query scan. Following previous works [36, 44, 35, 24, 8, 41, 20], we choose average recall at top N as our metric, which means if one of the top N retrieved submaps matches the query scan, we regard the retrieval is correct. Among top “N”, average recall at top 1% and average recall at top 1 are most frequently reported.

4.2 Implementation details

In all experiments, we voxelize 3D point coordinates with 0.01 quantization step. The voxelization and the following SP-Conv operation are performed by the MinkowskiEngine auto differentiation library [5]. The dimension of the final descriptor is set to 256. The number of tokens is set to 8. In ASVT, dimension of and is reduced by a factor of 8 from the input, i.e. . The margin in the loss function is set to 0.2. The same as in [20]

, to prevent embedding collapse in early epochs of training, we use a dynamic batch sizing strategy. During training, we count the number of active triplets, when it falls below 70% of the current batch size, the batch is increased by 40% until the maximum size of 256 elements is reached. Following previous work, we train two versions of models: baseline model and refined model. The baseline model is trained only using the training set of Oxford dataset, and the refined model is trained by adding the training set of U.S. and R.A. (Note that training set of B.D. is not added). In the baseline setting, the initial batch size is 32 and the initial learning rate is

. The model is trained for 40 epochs and the learning rate is decayed by 10 at the end of the 30th epoch. The refined model is trained with an initial batch size of 16 and an initial learning rate of

. The model is trained for 80 epochs and the learning rate is decayed by 10 at the end of the 60th epoch. The model is implemented by pytorch

[30] and optimized by Adam optimizer. Random jitter, random translation, random points removal and random erasing augmentation are adopted for data augmentation during training. All experiments are performed on a Tesla V100 GPU with a memory of 32G.

Average recall at top 1% (%) Average recall at top 1 (%)
Oxford U.S. R.A. B.D. Oxford U.S. R.A. B.D.
PointNetVLAD [36] 80.1 90.1 93.1 86.5 63.3 86.1 82.7 80.1
PCAN [44] 86.4 94.1 92.3 87 70.7 83.7 82.3 80.3
DAGC [35] 87.8 94.3 93.4 88.5 71.4 86.3 82.8 81.3
SOE-Net [41] 96.4 97.7 95.9 92.6 89.3 91.8 90.2 89
SR-Net [8] 95.3 98.5 93.6 90.8 88.5 93.5 86.8 85.9
LPD-Net [24] 98.2 98.2 94.4 91.6 93 90.5 97.4 85.9
Minkloc3D [20] 98.5 99.7 99.3 96.7 94.8 97.2 96.7 94
SVT-Net(Ours) 98.4 99.9 99.5 97.2 94.7 97 95.2 94.4
ASVT-Net(Ours) 98.3 99.6 98.9 97 94.6 97.5 95 94.5
CSVT-Net(Ours) 98.6 99.8 98.7 97.3 94.8 96.6 96.2 94.3
Table 3: Accuracy comparison of our method with state-of-the-art methods at the refined setting.

4.3 Results

In this section, we would like to experimentally answer the following questions: Can SVT-Net surpass existing methods in terms of accuracy ? Does SVT-Net really meet the requirements of super light-weight in terms of model size and inference speed? And what features have ASVT and CSVT learned to help improve performance?

Accuracy: To verify the effectiveness of our method, we compare our models with PointNetVLAD [36], PCAN [44], DAGC [35], SR-Net [8], LPD-Net [24], SOE-Net [41] and Minkloc3D [20]. In Table 1, we show the results of all methods on the baseline setting. It can be found that SVT-Net significantly outperforms all state-of-the-art methods, especially for the average recall at top 1 metric on U.S., R.A., and B.D., where SVT-Net wins for 3.4%, 3.9%, 4% compared to Minkloc3D respectively. Compared to SVT-Net, performances of ASVT-Net and CSVT-Net have dropped in some extent. However, their performances still largely outperform the previous best model Minkloc3D. We contribute the accuracy gain to the two novel SVTs we design. Note that Minkloc3D is also built upon SP-Conv and shares the same loss function as us, while its performance is not as excellent as our models, which further confirms the superiority of our two proposed Transformers. For a comprehensive comparison, we also show the results of all models at the refined setting in Table 3. We find that at the refined setting, our models still significantly outperform all models except Minkloc3D. In fact, our models still performs better than Minklo3d in most cases, although only by a small margin. The difference between our three models becomes narrow. We attribute this to that all models have reached the upper bound accuracy.

Model size and speed: To verify the efficiency of our method, we compare our models with previous works in terms of model size and inference time in Table 2 and Figure 1. For model size, it can be seen that SVT-Net and CSVT-Net save 18.2% and 27.3% parameters respectively compared to the current smallest model Minkloc3D. As for ASVT-Net, it even only has 36.4% parameters of Minkloc3D, which is a significant reduction. And it is worth noting that all of our three models outperform Minkloc3D for a large margin in terms of accuracy at the baseline setting. The ability of significantly improving accuracy under the condition of drastically reduced parameters has further fully demonstrated the superiority of our two Transformers. For speed, compared to the current fastest model Minkloc3D, SVT-Net only add ignorable additional inference time. And both ASVT-Net and CSVT-Net run faster than Minkloc3D. In a word, our models are good enough in terms of both model size and speed.

Figure 5: Visualization of what ASVT and CSVT have learned.
Average recall at top 1% (%) Average recall at top 1 (%)
Oxford U.S. R.A. B.D. Oxford U.S. R.A. B.D.
A: L=4, d=256, add 97.9 96.4 92.5 89 93.7 89 83.9 82.5
B: L=6, d=256, add 98 96.2 92.3 90.1 93.8 88.3 83.7 84.4
C: L=10, d=256, add 97.9 96.2 92 89.4 93.8 87.2 83.3 83.5
D: L=8, d=128, add 97.8 95.2 92 89 93.3 88.9 81.9 82.5
E: L=8, d=384, add 98.2 94.8 92.5 89 94.4 86.9 84.9 83.7
F: L=8, d=512, add 98 97.3 92.1 88.2 93.9 90.1 84 82.7
G: L=8,d =256, cat 97.5 93.4 85.8 84.7 92.7 81.9 73.9 77.1
H: L=8, d=256, cat&spconv 96.5 89.8 84.5 82.4 89.5 78.2 71.2 74
SVT-Net: L=8, d=256, add 97.8 96.5 92.7 90.7 93.7 90.1 84.3 85.5
Table 4: Results of ablation study.

What Transformers have learned: One may be interested in what ASVT and CSVT have learned that could make our models so elegant. To explore it, we show some visualization results in Figure 5. The first row show the original point clouds randomly selected from Oxford, U.S., R.A. and B.D respectively. Then, in the second row, we visualize the features of each non-empty voxel after ASVT using T-SNE [37]. Different colors represent different distribution of these features in the feature space. It can be seen that by interacting each atom with all the others, the model indeed learns the relationship between atoms. Specifically, it is obvious that nearby atoms share the same color, which means they are attended similarly since they may belong to the same object parts. And it can be seen far away atoms in the 3D space sharing the same implicit mode have similar colors, which means long range contextual information like relationship between semantic similar atoms located in different and far way positions (e.g., the ”same-category” information) has been discovered by the model.

In the third row, we visualize which token that each non-empty voxel belongs to. Different color represents different tokens. It can been seen that voxels belong to the same token always represent the same objects and share some geometric characteristics. This observation means that voxels indeed have been clustered together in the feature space according to their geometric characteristics. And obviously, the interaction between clusters or tokens could enhance model’s understanding towards the scene. For example, the long range context properties like the relative positions between clusters would be encoded through such kind of interaction.

In a word, the visualization results confirm our intuition of designing ASVT and CSVT and they all contribute to the performance improvement.

Figure 6: Visualization of training stability.

4.4 Ablation study

In this section, we study the impact of the number of token , dimension of the final global descriptor , Transformer feature fusion strategy and training stability of our models. We design experiments from A to H to evaluate the impacts of , and different fusion strategies. Table 4 shows the results under different values of , and different fusion strategies including adding features (add), concatenating features (cat) and concatenating features followed by SP-Conv (cat&spconv). ”SVT-Net” in the last row of Table 4 refers to the model version we finally choose.

Impact of number of tokens: The number of tokens () decides how many clusters we divide the scene into. We change the value of and compare the result in Table 4. Comparing the experiment A, B, C and SVT-Net shows that setting as 8 is the best choice. When is too small, interaction can do between only a few tokens, which cannot help our model to fully discover some long range properties between different regions. And when is too large, it is easy to cause over fitting.

Impact of descriptor dimension: To a certain extent, the dimension directly determines the global descriptor’s capability of describing a scene. From experiment D, E, F and SVT-Net shown in Table 4, we find that when is smaller than 256, the results drops significantly, which means that small dimension would degrade the performance. As the dimension increases, the performance indeed gains. However, when it is larger than 256, the increase is minimal while the model size is significantly increase to 1.8M and 3.0M for and respectively. Therefore, for a better trade-off between accuracy and model size, we choose in our implementation.

Impact of fusion strategy: In SVT-Net, we need to fuse features learned by ASVT and CSVT before aggregating voxel features into a global descriptor. In experiment G, we investigate the effectiveness of another fusion method, concatenation. In this way, the output dimension is . However, the performance of concatenating the two features is not as good as simply adding them (the dimension is 256). Then, we suspect if it is the higher dimension that causes the performance drop. Therefore, in experiment , we add an additional SP-Conv layer after concatenation. Unfortunately, the performance of the model becomes even worse than before. Therefore, finally, we believe that direct adding together is the best way to fuse features of the two SVTs.

Training stability: We notice that for each training, there are some small differences on the evaluation results. To avoid bias, we train each model for multiple times and show the boxplot of each model in Figure 6, which reflects the training stability of each model. Considering the trade off between accuracy, model size, and training stability, we claim that SVT-Net is the best performed model.

5 Conclusions

In this paper, we introduce a super light-weight network for large scale place recognition named SVT-Net. In SVT-Net, two Sparse Voxel Transformers: Atom-based Sparse Voxel Transformer (ASVT) and Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn long range contextual properties. Extensive experiments have demonstrated that SVT-Net as well as its two simplified versions ASVT-Net and CSVT-Net can achieve state-of-the-art performances with an extremely light-weight network architecture. In the future, we will investigate how to migrate the two proposed Sparse Voxel Transformers into other point cloud based tasks.