Large scale place recognition and localization is fundamental for a wide range of applications like Simultaneous Localization and Mapping (SLAM) [27, 28], autonomous driving [21, 11], robot navigation [29, 33], etc. For example, the place recognition result is always used as a signal for loop-closure  in SLAM systems, when GPS signal is not available. A line of works [22, 13, 43] have chosen to use images for place recognition, which have shown promising performance. However, images are sensitive to illumination, weather change, diurnal variation, etc, making models based on them unstable and unreliable. Besides, due to lack of depth information, image based methods are hard to fully understand the scene and are easily cheated by planar puzzles.
To tackle these challenges, a feasible solution is replacing images with point clouds collected by LiDAR. Scenes represented by point clouds are inherently invariant to illumination and weather changes and contain accurate and detailed 3D information meanwhile. Recently, a range of deep learning models[36, 44, 35, 24, 8, 41, 20] that utilize point clouds for place recognition have been proposed. Figure 1 (top) illustrates the common pipeline of these point cloud based place recognition methods. For a large scale region, a database of LiDAR scans tagged with UTM coordinates acquired from GPS/INS readings are constructed in advance. When a query scan is collect by the LiDAR from scratch, the most similar point cloud to the query scan is retrieved from the database to determine where the location of the query scan is.
To achieve accurate retrieval, a powerful deep learning model is needed to learn discriminative global descriptors or embeddings of point clouds. In fact, learning powerful scene descriptors is the key to the recognition task. However, we observe that, when learning global descriptors, most of previous methods only consider how to better extracted short range local features, while the equally important long range contextual properties have long been neglected. And we argue that lacking awareness of long range contextual properties, power of final descriptors would be greatly limited. Besides, we also notice that model size has been a bottleneck for further performance improving and practical popularizing. More specifically, in most of SLAM or robot navigation systems, the available memory is tight. Therefore, the smaller the model size, the more favorable it is to deploy the place recognition algorithm on more hardware products and serve for more scenarios. Therefore, designing light-weight descriptor learning models with small size and fast running time is necessary.
Motivated by the above observations, we propose a novel super light-weight network named SVT-Net. In our work, the point clouds are firstly voxelized into sparse voxel representations to better characterize structured information of the scene. Then, we choose the light weight 3D Sparse Convolution (SP-Conv) 
as our basic unit to extract local features owing to its flexibility and powerful local feature learning ability. However, simply stacking SP-Conv layers may ignore long range contextual properties. Therefore, inspired by recently proposed Vision Transformer Networks[7, 40], we propose two kinds of Sparse Voxel Transformers (SVTs) named Atom-based Sparse Voxel Transformer (ASVT) and Cluster-based Sparse Voxel Transformer (CSVT) on top of SP-Conv layers. ASVT and CSVT can extract long range contextual features implicit in the sparse voxel representation from two perspectives: attending on different key atoms and clustering different key regions in the feature space, thereby helping to obtain more discriminative descriptors through interacting different atoms and different clusters respectively. Since SP-Conv only conduct convolution operation on non-empty voxels, it is efficient and flexible for computation, and so do the two SVTs built upon it. Thanks to the strong capabilities of the two SVTs, our model can finally learn sufficiently powerful descriptors from an extremely shallow network architecture. Therefore, model size of SVT-Net is very small as shown in Figure 1 (bottom). Experiment result shows that though small, SVT-Net achieves state-of-the-art performance in terms of both accuracy and speed on Oxford RobotCar dataset  and three in-house datasets . What’s more, to further increase speed and reduce model size, we propose two simplified version of SVT-Net: ASVT-Net and CSVT-Net, which can also achieve state-of-the-art performances with model sizes of only 0.8M and 0.4M respectively. Our contributions can be summarized as:
We propose a novel light-weight point cloud based place recognition model named SVT-Net as well as two simplified versions: ASVT-Net and CSVT-Net, which all achieve state-of-the-art performance in terms of both accuracy and speed with a extremely small model size.
We propose Atom-based Sparse Voxel Transformer(ASVT) and Cluster-based Sparse Voxel Transformer(CSVT) for learning long range contextual features hidden in point clouds. To the best of our knowledge, we are the first to propose Transformers for sparse voxel representations.
We have conducted extensive quantitative and qualitative experiments to verity the effectiveness and efficiency of our proposed models and analyse what the two proposed Transformers actually learn.
2 Related Work
Large scale place recognition involves a wide range of technologies and fields. In this section, we briefly introduce two kinds of closely related works. Specifically, we first introduce some studies about image based or point cloud based place recognition. Then, we simply review some recently proposed Vision Transformer Networks.
2.1 Large scale place recognition
Large scale place recognition has been long interested in by researchers. In early years, hand-craft features like SIFT , SURF  and ORB  extracted from images are always used on this task [10, 9, 18]. Though simple, hand-craft features own limit powers. Therefore, with the development of deep learning, learned features become more popular. A typical deep learning method for place recognition is NetVLAD , which learns global descriptors by clustering CNN features into several different visual words. Then, a variety of following works [43, 15] are proposed to improve it and have achieved promising results. However, though much more powerful compared to hand-craft features, learned image features still suffer from their sensitivity towards illumination, weather change, diurnal variation, etc.
Recently, utilizing point clouds for large scale place recognition has attracted much attention owing to point cloud’s robustness towards environmental changes. PointNetVLAD  is a pioneering work. It first uses PointNet  and NetVLAD  to learn global descriptors. Then a K-Nearest-Neighbors (KNNs) algorithm is used for retrieval and recognition. Then, Zhang and Xiao  propose a contextual aware attention mechanism to help the model learn stronger local features in their proposed model PCAN. Models DAGC , SRNet  and LPD-Net  all use Graph Convolutional Networks (GCNs) to better capture local features. More recently, SOE-Net  introduces a PointOE module that encodes local features from eight orientations to improve place recognition performance. The above mentioned methods all learn point cloud features by taking non-structure point-wise representations as input, which requires a huge model with large amount of parameters to learn reliable features. In contrast, Minloc3D  uses sparse voxel representations as input and builds a simple Feature Pyramid Network (FPN) like architecture for learning point cloud descriptors and ranks the current state-of-the-art. Minloc3D has significantly reduced the model size. However, it neglects the importance of long range contextual properties hidden in the point cloud. In our work, we also use sparse voxel representations as input. But we further propose two Sparse Voxel Tranformers (SVTs) to learn these long range contextual properties. And thanks to the strong capability of the two SVTs, the model size of our work is further reduced.
2.2 Vision transformers
is originally proposed for natural language processing (NLP) tasks[6, 16, 42, 19, 3]
. In Transformer, self-attention mechanism is the core of its function owing to its ability of capturing long range contextual information. At present, Transformer has become the most important basic module in NLP field. Inspired by the great success of Transformer in the field of NLP, researchers gradually begin to think whether self-attention mechanism can also play a role in the field of computer vision.
Therefore, Vision Transformer (ViT)  is proposed recently. It adopts the idea of self-attention and divides images to 16x16 words. In this way, images can be processed like nature languages. Then, a variety of following works [40, 39, 23] are proposed to improve it. For example, Wu et al.  propose Visual Transformer (VT), which elegantly projects image features into tokens and process these tokens by means of the classic Transformer , reducing computational cost greatly. PVT  introduces a FPN like structure to better cope with dense prediction tasks. Swin-Transformer  presents a hierarchical architecture, where by limiting self attention to non overlapping local windows, higher efficiency can be achieved. More recently, Jiang et al.  successfully employ vision Transformers in GANs. For a more comprehensive introduction of Vision Transformers on 2D images, we refer readers to . Till now, introduced vision Transformers are all designed for processing images. When it comes to Transformers for processing point cloud, there are only a few works [45, 12], which means that the 3D vision of Transformer is still under-explored. In this paper, we propose two kinds Transformers that can be used for processing sparse voxel representations of point clouds. To the best of our knowledge, it’s the first work designing Sparse Voxel Transformers in literature.
3.1 Problem statement
Let be a database of pre-defined 3D submaps (represented as point clouds), and Q be a query scan. The place recognition problem is defined as retrieving a submap from that is closest to Q. To achieve accurate retrieving, we have to design a deep learning model that can embed all point clouds into discriminative global descriptors, e.g. , so that a following KNNs algorithm can be used for finding . To employ 3D Sparse Convolution , we firstly voxelize all point clouds into sparse voxel representations, e.g. , where in each voxel, 1 means that it is occupied by any points in , called non-empty voxels, and otherwise 0, called empty voxels. 3D Sparse Convolution operation is only done among these non-empty voxels. Hence, it is very efficient and flexible.
Next, we will firstly introduce the Atom-based Sparse Voxel Transformer (ASVT) and the Cluster-based Voxel Transformer (CSVT) respectively. And then, the overall network architecture of SVT-Net as well as network architectures of the two simplified versions (ASVT-Net and CSVT-Net) will be introduced in detail. The loss function will be presented finally.
3.2 Atom-based sparse voxel transformer
As mentioned before, simply stacking SP-convs can only learn local information from nearby voxels. To capture long range contextual properties hidden in point cloud, we design ASVT, which adopts the idea of self-attention to aggregate information from both nearby and far-away voxels. In ASVT, we define each individual voxel as an atom. During processing, each atom should be interacted with all other atoms according to the learned per-atom contributions. By doing so, different key atoms could be attended by other atoms so that both local relationship of nearby atoms and long range contextual relationship of far way atoms will be learned. Note that learning such kind of long range contextual relationship is very important for the model. For example, in a scene, assume there are two atoms that belong to different instances of the same category. If only SPConv is used, the ”same-category” information may be ignored due to the small receptive field. While if AVST is added to learn such kind of information, the model can better encode what the scene describes. Hence the final global descriptor would be more powerful.
The architecture of ASVT is illustrated in Figure 2.
Let be the input sparse voxel features learned by sparse convolutions (SP-voxel features for simplicity). We firstly learn the sparse voxel values (SP-values for simplicity) , SP-queries , and SP-keys through three different SP-Convs respectively:
where we often set to reduce computational cost in later steps. That is to say, the dimension of SP-queries and SP-keys are reduced from to
for efficiency. After that, SP-voxel features of SP-values (SP-queries/keys) are rearranged to a tensor of(), where is the number of non-empty voxels.
Then, we use and to calculate the SP-attention map :
where encodes the contribution relationship of each atom with all the other atoms. In the following attending operation, these relationships will contribute to aggregating both short range local information and long range contextual information by interacting atoms. The attending operation can be summarized as:
where is called atom-attended SP-voxel features. In , features of each atom have accepted contributions from all the other atoms. Thus it could encode meaningful contextual information to describe the scene.
Finally we rearrange back to sparse voxel representations with a dimension of and regard it as a residential term. The final ASVT feature is defined as the sum of and :
3.3 Cluster-based sparse voxel transformer
Another observation we find is that in the sparse voxel representation, some atoms may share the same characteristics. For example, atoms representing walls always form a plane, while atoms representing columns easily form a cylinder like structure. This means that atoms can actually cluster into different clusters according to their characteristics, and the long range contextual properties can also be extracted from the perspective of interacting between these clusters. Motivated by this intuition, we propose CSVT, which is illustrated in Figure 3. As shown in the figure, CSVT consists of three component, a Tokenizer module, a Transformer module and a Projector module.
The Tokenizer module is used to transform the input SP-voxel features into tokens, where each token represents a cluster in the latent space. We again define as the initial SP-voxel features. To achieve the goals of the tokenizer, we first use a SP-Conv operation followed by a rearrange operation to generate a grouping map :
where is the rearrange operation. is the number of tokens we choose to generate.
stores the probabilities of each voxel belonging to each token. Therefore, we can useto capture representations of tokens as grouping different tokens into different clusters:
where denotes representations of tokens with each of them described by features.
A Transformer module is then used to learn long range properties among different clusters through interaction of these tokens. First, we generate values, keys, and queries using shared convolutional kernels:
Then, tokens are interacted with each other through the following attention operation:
where is the attended tokens.
The Projector module is then used to project token features back to the sparse voxel representations. Specifically, we use to calculate a projection map :
Then, the projection operation is defined as:
Again, we rearrange back to sparse voxel representations with a dimension of and regard it as a residential term. The final CSVT feature is defined as:
3.4 Network architecture
The overall architecture of SVT-Net is built upon the above introduced ASVT and CSVT as well as the 3D Sparse Convolution (SP-Conv). Specifically, as shown in Figure 4. The initial sparse voxel representation is first fed into an initial SP-conv layer with an output dimension of 32 to learn initial sparse voxel features. Then a SP-Res-Block consists of two SP-conv layers with skip connection is used to enhance learned features and increase the feature dimension to 64. Next, another SP-conv layer is used to increase the feature dimension to be equal to the final descriptor’s dimension . After that, the SP voxel features are fed into two branches for learning the ASVT feature and the CSVT feature using the two proposed Sparse Voxel Transformers(SVTs) respectively. Finally, the learned ASVT feature and CSVT feature are fused by directly adding together. Finally, the final global descriptor is calculated using a GeM Pooling operation [32, 20].
Thanks to the strong power of ASVT and CSVT, though our network architecture is simple and small, our proposed model SVT-Net can achieve superior performance compared to previous methods.
Note that though we use both ASVT and CSVT in SVT-Net, it is also possible to use them separately in different networks. Therefore, we propose two simplified versions of SVT-Net: ASVT-Net and CSVT-Net by eliminating ASVT and CSVT respectively. And according to our experimental results, both ASVT-Net and CSVT-Net also achieve state-of-the-art performance while further reduces the model size.
3.5 Loss function
To train our model, we adopt the following triplet loss as proposed in :
where is the descriptor of the query scan, and are descriptors of the positive sample and the negative sample respectively, and is a margin. means the Euclidean distance between and . To build informative triplets, we use batch-hard negative mining following .
After the network is trained, all point clouds are embedded into descriptors using the model. And we use the KNNs algorithm to find in the database that is most similar and locate closest to query scan .
|Average recall at top 1% (%)||Average recall at top 1 (%)|
4.1 Datasets and Metrics
We use the benchmark datasets proposed by  to evaluate our methods. The benchmark contains four datasets: one outdoor dataset named Oxford generated from Oxford RobotCar  and three in-house datasets: university sector (U.S.), residential area (R.A.) and business district (B.D.). The benchmark contains 21,711, 400, 320, 200 submaps for training and 3,030, 80, 75, 200 submaps for testing for Oxford., U.S., R.A. and B.D. respectively. Ground points of each submaps are removed and finally each point cloud contains 4096 points. In training, point clouds are regarded as correct matches if they are at maximum 10m apart and wrong matches if they are at least 50m apart. In testing, the retrieved point cloud is regarded as a correct match if the distance is within 25m between the retrieved point cloud and the query scan. Following previous works [36, 44, 35, 24, 8, 41, 20], we choose average recall at top N as our metric, which means if one of the top N retrieved submaps matches the query scan, we regard the retrieval is correct. Among top “N”, average recall at top 1% and average recall at top 1 are most frequently reported.
4.2 Implementation details
In all experiments, we voxelize 3D point coordinates with 0.01 quantization step. The voxelization and the following SP-Conv operation are performed by the MinkowskiEngine auto differentiation library . The dimension of the final descriptor is set to 256. The number of tokens is set to 8. In ASVT, dimension of and is reduced by a factor of 8 from the input, i.e. . The margin in the loss function is set to 0.2. The same as in 
, to prevent embedding collapse in early epochs of training, we use a dynamic batch sizing strategy. During training, we count the number of active triplets, when it falls below 70% of the current batch size, the batch is increased by 40% until the maximum size of 256 elements is reached. Following previous work, we train two versions of models: baseline model and refined model. The baseline model is trained only using the training set of Oxford dataset, and the refined model is trained by adding the training set of U.S. and R.A. (Note that training set of B.D. is not added). In the baseline setting, the initial batch size is 32 and the initial learning rate is. The model is trained for 40 epochs and the learning rate is decayed by 10 at the end of the 30th epoch. The refined model is trained with an initial batch size of 16 and an initial learning rate of
. The model is trained for 80 epochs and the learning rate is decayed by 10 at the end of the 60th epoch. The model is implemented by pytorch and optimized by Adam optimizer. Random jitter, random translation, random points removal and random erasing augmentation are adopted for data augmentation during training. All experiments are performed on a Tesla V100 GPU with a memory of 32G.
|Average recall at top 1% (%)||Average recall at top 1 (%)|
In this section, we would like to experimentally answer the following questions: Can SVT-Net surpass existing methods in terms of accuracy ? Does SVT-Net really meet the requirements of super light-weight in terms of model size and inference speed? And what features have ASVT and CSVT learned to help improve performance?
Accuracy: To verify the effectiveness of our method, we compare our models with PointNetVLAD , PCAN , DAGC , SR-Net , LPD-Net , SOE-Net  and Minkloc3D . In Table 1, we show the results of all methods on the baseline setting. It can be found that SVT-Net significantly outperforms all state-of-the-art methods, especially for the average recall at top 1 metric on U.S., R.A., and B.D., where SVT-Net wins for 3.4%, 3.9%, 4% compared to Minkloc3D respectively. Compared to SVT-Net, performances of ASVT-Net and CSVT-Net have dropped in some extent. However, their performances still largely outperform the previous best model Minkloc3D. We contribute the accuracy gain to the two novel SVTs we design. Note that Minkloc3D is also built upon SP-Conv and shares the same loss function as us, while its performance is not as excellent as our models, which further confirms the superiority of our two proposed Transformers. For a comprehensive comparison, we also show the results of all models at the refined setting in Table 3. We find that at the refined setting, our models still significantly outperform all models except Minkloc3D. In fact, our models still performs better than Minklo3d in most cases, although only by a small margin. The difference between our three models becomes narrow. We attribute this to that all models have reached the upper bound accuracy.
Model size and speed: To verify the efficiency of our method, we compare our models with previous works in terms of model size and inference time in Table 2 and Figure 1. For model size, it can be seen that SVT-Net and CSVT-Net save 18.2% and 27.3% parameters respectively compared to the current smallest model Minkloc3D. As for ASVT-Net, it even only has 36.4% parameters of Minkloc3D, which is a significant reduction. And it is worth noting that all of our three models outperform Minkloc3D for a large margin in terms of accuracy at the baseline setting. The ability of significantly improving accuracy under the condition of drastically reduced parameters has further fully demonstrated the superiority of our two Transformers. For speed, compared to the current fastest model Minkloc3D, SVT-Net only add ignorable additional inference time. And both ASVT-Net and CSVT-Net run faster than Minkloc3D. In a word, our models are good enough in terms of both model size and speed.
|Average recall at top 1% (%)||Average recall at top 1 (%)|
|A: L=4, d=256, add||97.9||96.4||92.5||89||93.7||89||83.9||82.5|
|B: L=6, d=256, add||98||96.2||92.3||90.1||93.8||88.3||83.7||84.4|
|C: L=10, d=256, add||97.9||96.2||92||89.4||93.8||87.2||83.3||83.5|
|D: L=8, d=128, add||97.8||95.2||92||89||93.3||88.9||81.9||82.5|
|E: L=8, d=384, add||98.2||94.8||92.5||89||94.4||86.9||84.9||83.7|
|F: L=8, d=512, add||98||97.3||92.1||88.2||93.9||90.1||84||82.7|
|G: L=8,d =256, cat||97.5||93.4||85.8||84.7||92.7||81.9||73.9||77.1|
|H: L=8, d=256, cat&spconv||96.5||89.8||84.5||82.4||89.5||78.2||71.2||74|
|SVT-Net: L=8, d=256, add||97.8||96.5||92.7||90.7||93.7||90.1||84.3||85.5|
What Transformers have learned: One may be interested in what ASVT and CSVT have learned that could make our models so elegant. To explore it, we show some visualization results in Figure 5. The first row show the original point clouds randomly selected from Oxford, U.S., R.A. and B.D respectively. Then, in the second row, we visualize the features of each non-empty voxel after ASVT using T-SNE . Different colors represent different distribution of these features in the feature space. It can be seen that by interacting each atom with all the others, the model indeed learns the relationship between atoms. Specifically, it is obvious that nearby atoms share the same color, which means they are attended similarly since they may belong to the same object parts. And it can be seen far away atoms in the 3D space sharing the same implicit mode have similar colors, which means long range contextual information like relationship between semantic similar atoms located in different and far way positions (e.g., the ”same-category” information) has been discovered by the model.
In the third row, we visualize which token that each non-empty voxel belongs to. Different color represents different tokens. It can been seen that voxels belong to the same token always represent the same objects and share some geometric characteristics. This observation means that voxels indeed have been clustered together in the feature space according to their geometric characteristics. And obviously, the interaction between clusters or tokens could enhance model’s understanding towards the scene. For example, the long range context properties like the relative positions between clusters would be encoded through such kind of interaction.
In a word, the visualization results confirm our intuition of designing ASVT and CSVT and they all contribute to the performance improvement.
4.4 Ablation study
In this section, we study the impact of the number of token , dimension of the final global descriptor , Transformer feature fusion strategy and training stability of our models. We design experiments from A to H to evaluate the impacts of , and different fusion strategies. Table 4 shows the results under different values of , and different fusion strategies including adding features (add), concatenating features (cat) and concatenating features followed by SP-Conv (cat&spconv). ”SVT-Net” in the last row of Table 4 refers to the model version we finally choose.
Impact of number of tokens: The number of tokens () decides how many clusters we divide the scene into. We change the value of and compare the result in Table 4. Comparing the experiment A, B, C and SVT-Net shows that setting as 8 is the best choice. When is too small, interaction can do between only a few tokens, which cannot help our model to fully discover some long range properties between different regions. And when is too large, it is easy to cause over fitting.
Impact of descriptor dimension: To a certain extent, the dimension directly determines the global descriptor’s capability of describing a scene. From experiment D, E, F and SVT-Net shown in Table 4, we find that when is smaller than 256, the results drops significantly, which means that small dimension would degrade the performance. As the dimension increases, the performance indeed gains. However, when it is larger than 256, the increase is minimal while the model size is significantly increase to 1.8M and 3.0M for and respectively. Therefore, for a better trade-off between accuracy and model size, we choose in our implementation.
Impact of fusion strategy: In SVT-Net, we need to fuse features learned by ASVT and CSVT before aggregating voxel features into a global descriptor. In experiment G, we investigate the effectiveness of another fusion method, concatenation. In this way, the output dimension is . However, the performance of concatenating the two features is not as good as simply adding them (the dimension is 256). Then, we suspect if it is the higher dimension that causes the performance drop. Therefore, in experiment , we add an additional SP-Conv layer after concatenation. Unfortunately, the performance of the model becomes even worse than before. Therefore, finally, we believe that direct adding together is the best way to fuse features of the two SVTs.
Training stability: We notice that for each training, there are some small differences on the evaluation results. To avoid bias, we train each model for multiple times and show the boxplot of each model in Figure 6, which reflects the training stability of each model. Considering the trade off between accuracy, model size, and training stability, we claim that SVT-Net is the best performed model.
In this paper, we introduce a super light-weight network for large scale place recognition named SVT-Net. In SVT-Net, two Sparse Voxel Transformers: Atom-based Sparse Voxel Transformer (ASVT) and Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn long range contextual properties. Extensive experiments have demonstrated that SVT-Net as well as its two simplified versions ASVT-Net and CSVT-Net can achieve state-of-the-art performances with an extremely light-weight network architecture. In the future, we will investigate how to migrate the two proposed Sparse Voxel Transformers into other point cloud based tasks.
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic.
Netvlad: Cnn architecture for weakly supervised place recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
-  Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
-  Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899, 2021.
-  Xieyuanli Chen, Thomas Läbe, Andres Milioto, Timo Röhling, Olga Vysotska, Alexandre Haag, Jens Behley, Cyrill Stachniss, and FKIE Fraunhofer. Overlapnet: Loop closing for lidar-based slam. In Proc. of Robotics: Science and Systems (RSS), 2020.
Christopher Choy, JunYoung Gwak, and Silvio Savarese.
4d spatio-temporal convnets: Minkowski convolutional neural networks.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Zhaoxin Fan, Hongyan Liu, Jun He, Qi Sun, and Xiaoyong Du.
Srnet: A 3d scene recognition network using static graph and dense semantic fusion.In Computer Graphics Forum, volume 39, pages 301–311. Wiley Online Library, 2020.
-  Eduardo Fernández-Moral, Walterio Mayol-Cuevas, Vicente Arevalo, and Javier Gonzalez-Jimenez. Fast place recognition with plane-based maps. In 2013 IEEE International Conference on Robotics and Automation, pages 2719–2724. IEEE, 2013.
-  Dorian Gálvez-López and Juan D Tardos. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, 2012.
-  Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
-  Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer. arXiv preprint arXiv:2012.09688, 2020.
-  Fei Han, Xue Yang, Yiming Deng, Mark Rentschler, Dejun Yang, and Hao Zhang. Sral: Shared representative appearance learning for long-term visual place recognition. IEEE Robotics and Automation Letters, 2(2):1172–1179, 2017.
-  Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
-  Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. arXiv preprint arXiv:2103.01486, 2021.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
-  Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074, 2021.
-  Edward Johns and Guang-Zhong Yang. From images to scenes: Compressing an image cluster into a single scene model for place recognition. In 2011 International Conference on Computer Vision, pages 874–881. IEEE, 2011.
-  Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334, 2021.
-  Jacek Komorowski. Minkloc3d: Point cloud based large-scale place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1790–1799, 2021.
-  Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, et al. Towards fully autonomous driving: Systems and algorithms. In 2011 IEEE Intelligent Vehicles Symposium (IV), pages 163–168. IEEE, 2011.
-  Yunpeng Li, Noah Snavely, and Daniel P Huttenlocher. Location recognition using prioritized feature matching. In European conference on computer vision, pages 791–804. Springer, 2010.
-  Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
-  Zhe Liu, Shunbo Zhou, Chuanzhe Suo, Peng Yin, Wen Chen, Hesheng Wang, Haoang Li, and Yun-Hui Liu. Lpd-net: 3d point cloud learning for large-scale place recognition and environment analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2831–2840, 2019.
-  David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
-  Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
Raul Mur-Artal and Juan D Tardós.
Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
-  Anish Pandey, Shalini Pandey, and DR Parhi. Mobile robot navigation and obstacle avoidance techniques: A review. Int Rob Auto J, 2(3):00022, 2017.
-  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
-  Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
Filip Radenović, Giorgos Tolias, and Ondřej Chum.
Fine-tuning cnn image retrieval with no human annotation.IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
-  Abhijeet Ravankar, Ankit A Ravankar, Yukinori Kobayashi, Yohei Hoshino, and Chao-Chung Peng. Path smoothing techniques in robot navigation: State-of-the-art, current and future challenges. Sensors, 18(9):3170, 2018.
-  Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
-  Qi Sun, Hongyan Liu, Jun He, Zhaoxin Fan, and Xiaoyong Du. Dagc: Employing dual attention and graph convolution for point cloud based place recognition. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 224–232, 2020.
-  Mikaela Angelina Uy and Gim Hee Lee. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4470–4479, 2018.
Laurens Van der Maaten and Geoffrey Hinton.
Visualizing data using t-sne.
Journal of machine learning research, 9(11), 2008.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
-  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122, 2021.
-  Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.
-  Yan Xia, Yusheng Xu, Shuang Li, Rui Wang, Juan Du, Daniel Cremers, and Uwe Stilla. Soe-net: A self-attention and orientation encoding network for point cloud based place recognition. arXiv preprint arXiv:2011.12430, 2010.
-  Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
Jun Yu, Chaoyang Zhu, Jian Zhang, Qingming Huang, and Dacheng Tao.
Spatial pyramid-enhanced netvlad with weighted triplet loss for place
IEEE transactions on neural networks and learning systems, 31(2):661–674, 2019.
-  Wenxiao Zhang and Chunxia Xiao. Pcan: 3d attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12436–12445, 2019.
-  Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and Vladlen Koltun. Point transformer. arXiv preprint arXiv:2012.09164, 2020.