Attention Models for Point Clouds in Deep Learning: A Survey

02/22/2021 ∙ by Xu Wang, et al. ∙ BEIJING JIAOTONG UNIVERSITY 18

Recently, the advancement of 3D point clouds in deep learning has attracted intensive research in different application domains such as computer vision and robotic tasks. However, creating feature representation of robust, discriminative from unordered and irregular point clouds is challenging. In this paper, our ultimate goal is to provide a comprehensive overview of the point clouds feature representation which uses attention models. More than 75+ key contributions in the recent three years are summarized in this survey, including the 3D objective detection, 3D semantic segmentation, 3D pose estimation, point clouds completion etc. We provide a detailed characterization (1) the role of attention mechanisms, (2) the usability of attention models into different tasks, (3) the development trend of key technology.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Point clouds representation is an important data format that can preserve the original geometric information in 3D space without any discretization. Meanwhile, deep learning have widely and successfully applied to various tasks nowadays. Therefore, it is natural that more and more research currently aims at the adaption of deep learning to 3D point clouds, such as computer vision [Wang et al.2019] and robotics [Behl et al.2019]. However, unordered and irregular 3D point clouds structure are still a significant challenge for deep learning. The traditional point cloud representation methods include BEV [Yang et al.2018], multi-view [Yang and Wang2019], and 3D voxels [Maturana and Scherer2015]. The main problem of these methods is the fast growth of point sets size [Hu et al.2020] and geometric information loss [Qin et al.2019]

. To alleviate these problems, attention mechanism is introduced to make neural networks to focus on the important parts of input data, helping to simplified point clouds and capture sufficient feature representations

[Chaudhari et al.2019]. Thus, in this paper, we aim to provide a brief, yet comprehensive survey on attention models for point clouds in deep learning.

There have been a few domain-specific surveys published [Nguyen et al.2018, Lee et al.2019a, Chaudhari et al.2019, Wu et al.2020, Liu et al.2019a]. Compared with existing surveys, the contributions of our work can be summarized as:

  1. To the best of our knowledge, this is the first survey fully focus on attention models for point clouds tasks in deep learning, including computer vision, robotics and miscellaneous applications.

  2. This paper comprehensively covers recent and advanced progresses of attention models for point clouds. Therefore, it allows readers to learn about the state-of-the-art attention mechanisms from different perspectives.

The structure of this paper is as follows. We start by introducing short overview of attention mechanisms in section 2. Then we provide and discuss the attention mechanisms in different tasks in section3 to 5, respectively. In section 6, we present the development trend for future technology. Finally, we conclude the paper in section 7.

2 Overview

Human brain system can focus on just a salient regions with limited data, instead of an entire scene [Marblestone et al.2016]

. An attention-based feature extraction is used in the salient regions to acquire the high-level feature representation for improving brain efficiently learns. In spired by this prior knowledge, attention mechanism was first introduced in deep learning to help to researchers choose important data for their tasks

[Bahdanau et al.2015]

. With the continuous research, attention mechanisms have achieved great success in Natural Language Processing

[Galassi et al.2020], Computer Vision [Wang and Tax2016], and robotics [Ferreira and Dias2014]. Next, we will describe two main attetion mechanism types used in citation papers.

2.1 Sequence

Self-attention means that learning relevant tokens in a single input sequence for every token in the same input sequence [Chaudhari et al.2019]. Obviously self-attention is sometimes called inner-attention. Example of self-attention is [Yang et al.2019].

Co-attention, on the other hand, means that multiple input sequences are processed simultaneously and jointly learn their attentive feature weights to capture interactions between these inputs [Chaudhari et al.2019]. Example of co-attention is [You et al.2018].

2.2 Positions


is a differentiable deterministic process so that it can be trained with a backpropagation algorithm

[Kingkan et al.2019]. The global-attention model is similar to the soft attention model [Luong et al.2015].

Hard-attention, on the other hand, is a non-differentiable stochastic process and relies on a sampling-based method for training [Kingkan et al.2019]. local-attention model is an intermediate between soft-attention and hard-attention [Luong et al.2015].

3 3D Computer Vision Applications

In this section, we review existing attention models for computer vision. We group the applications to different subcategories, namely, 3D recognition and retrieval, 3D detection, 3D segmentation, 3D classification and 3D registration.

3.1 3D Recognition and Retrieval

3.1.1 3D Recognition

3D object recognition is one of the most fundamental and intriguing problems in computer vision, spanning broad applications from environment understanding to self-driving. Attention model is used to make the neural network focus on informative features to obtain a stronger representation. To gesture recognition, [Kingkan et al.2019]

design an automatic feature extraction network by using soft-attention module. This work is based on the intuition that, only particular points of body movement in an input point clouds are required for the network to classify the gestures.

[Li et al.2019b] propose a graph attention module to specify different weights for different nodes by calculating relational degree in the local feature space. In their model, attention module is applied four times in different layers to aggregate feature and dynamically update the state of node. Finally, they can obtain more plentiful node feature representation. Similarly, [Xia et al.2020] apply self-attention unit to better capture feature dependencies among long-range context. [Sun et al.2020] introduce a dual attention module (point-wise and channel-wise) to weigh importance of points and features for enhancing the feature representation ability.

Similar to the human visual system, attention mechanism can be used to multimodal feature fusion. [You et al.2018] propose a point-view feature fusion method based on soft-attention mask. [Lu et al.2020b] use global channel attention and spatial attention VLAD [Jégou et al.2010] to fuse the feature of point cloud and image. [Zhao et al.2020a] present a MANet framework for high-precision 3D object recognition that is able to fuse point-view data. [Luo et al.2020] introduce an embedding attention point-slice fusion strategy for new shape representation.

3.1.2 3D Retrieval

In order to manage large scale point cloud datasets, exploring effective 3D shape retrieval algorithms is necessary. [Li et al.2020b] propose a multi-part attention network for 3D model retrieval, and applies a novel self-attention module to explore the spatial relevance of local features. [Zhang and Xiao2019] apply a Point Contextual Attention network to discriminate the local feature which positively contribute to the final global feature representations. [Lei et al.2019] report that view differences of feature have no direct impact on retrieval performance. Their Representative-View Selection algorithm only trend to choose views which can contribute to better performance. In other works, [Liu et al.2019b] develop a hierarchical self-attention to highlight informative elements in point, scale and region levels. [Dovrat et al.2019] extend visual attention, focusing the subsequent task network on significant points. Experiments on various benchmark datasets show that these methods can effectively remove the redundancy and results in an enhanced feature representation.

3.2 3D Detection

3D object detection is an important aspect in computer vision. However, point clouds are usually unordered, sparse and unevenly distributed, which heavily affects feature extraction and accurate object localization. [Paigwar et al.2019] extend visual attention mechanism for multiple object detection. The attention module makes the network focus on smaller region containing the objects of interest. [Wu and Ogai2020] exploit self-attention mechanism to boost useful features and suppress useless features. [Xie et al.2020] design two attention modules and a feature fusion module for 3D object detection that are able to exploit contextual information at patch, object and global scene levels. Similarly, [Liu et al.2020b] propose a Triple Attention module that considering the channel-wise, point-wise and voxel-wise attention jointly. [Li et al.2020a] present an end-to-end geometric relation network architecture inspired by the self-attention mechanism. To solve boundary problem, [Wang et al.2020c] introduce an auxiliary corner attention module. Its key contribution is to enforce network focus on object boundaries.

3.3 3D Segmentation

3D semantic instance segmentation is a popular topic in computer vision. However, there are still many challenges for 3D point clouds segmentation, such as large scene and heterogeneous anisotropic distribution. [Qingyong et al.2019] combine Local Spatial Encoding and Attentive Pooling modules to automatically learn important local feature. [Zhang et al.2020a] propose an Attention Adversarial Network based on adversarial learning. In the learning phase, network can pay more attention to different regional informative features. [Tu et al.2020] present an online attention-base spatial and temporal feature fusion method for high-precision and real-time semantic segmentation. 4D point clouds (3D point cloud videos) segmentation is a more challenging task, which needs to capture both spatial and temporal information. To solve above problems, [Shi et al.2020] design a cross-frame global attention module. Instance segmentation aims to understand geometric information of point clouds on both semantic level and instance level. To instance segmentation, [Liang et al.2019] introduce a graph neural network based on attention mechanism which can aggregate geometric and embedding information from neighbours. [Wen et al.2020a] model the relationships between neighbor and central points by learnable attention mechanism.

3.4 3D Classification

3D classification is a critical task in computer vision, which is widely utilized in autonomous vehicle and robotics. In recent years, weak representation ability of low-dimensional feature and noisy points are still challenging. [Fuchs et al.2020] introduce a robust SE(3)-Transformer, a variant of the self-attention module for data translation and rotation. [Lee et al.2019b] present an attention-base network, Set Transformer, to model interactions among elements in point clouds. [Yang et al.2019] propose to use attention layers to capture the relations between point. They also design a parameter-efficient Group Shuffle Attention to decrease voluminous computing consumption of Multi-Head Attention. Airborne Laser Scanning (ALS) classification is a critical application for point clouds. [Shajahan et al.2019] design a view-based approach for roof classification, based on adding a self-attention network. [Bhattacharyya et al.2021] propose an elevation-attention module, urging network take per-point elevation information into account for better ALS classification.

3.5 3D Registration

Point clouds registration is a key problem for computer vision, which aims to estimate the optimal rigid transformation between two or more different point sets. However, point clouds have innumerable unique aspects that can increase the complexity of this problem, such as local sparsity and noisy points. [Yew and Lee2018] add an attention layer into their network architecture that better identify 3D local keypoints and descriptors for matching. Inspired by this work, [Lu et al.2019] develop a novel point weighting layer to learning the saliency of each point in an end-to-end framework. [Wang and Solomon2019] combine attention-base module and pointer generation layer to approximate combinatorial matching. [Lu et al.2020a] present an Attentive Point Aggregation module that can be used in keypoints generation by aggregating positions and features of neighbor points. Also, this module outputs an attentive feature map help to estimate saliency uncertainly of each keypoint. [Qiao et al.] use self-attention and cross attention to enhance structure information and corresponding information for feature aggregation.

4 Robotic Applications

In this section, we review a variety of attention mechanisms that can applied to robotic tasks. We group the applications as 3D completion,3D pose estimation and scene flow.

4.1 3D Completion

Point cloud completion is a challenging problem in robotic and computer vision. Incomplete point cloud shapes cannot be directly used in practical application due to the limited view angles or occlusion. [Zhang et al.2020c]

use attention module to reconstruct and refine the input point clouds, the generated points are more uniformly distributed with fewer outliers and noises.

[Wen et al.2020b] propose a Skip-Attention Network for point cloud completion. Their proposed model can extract geometric information from local regions of incomplete point clouds to encode complete shape representation at different resolutions. [Han et al.2020] design a Non-local Attention module that combines multi-resolution shape details and contributive local features for shape completion. [Zhang et al.2020b] add an Attention Unit in their multi-stage network. This unit allocates higher weights for the important points which provide more valuable information for point clouds reconstruction.

4.2 3D Pose Estimation

3D pose estimation is widely applied in robotic tasks, such as manipulation, grasping and navigation. The key challenge is to estimate pose by extracting enough features of point clouds to find pose in any environment [Yuan et al.2020]. [Yang et al.2020] present a 3D Spatial Attention Region Ensemble Network for real-time 3D hand pose estimation. With the help of spatial attention mechanism, they extract enough local structure features of hand joints. 6D pose estimation is another important branch of pose estimation, including 3D rotation and 3D translation. [Song et al.2020] propose a Point Attention module to extract powerful feature from point clouds, with geometric attention path and channel attention path. This module makes neural network focus on efficient geometric and channel information to create better feature representations. [Du et al.2020] introduce an attention predictor that effectively utilize multi-level geometric information and channel-wise relations to generate global descriptor. Unlike above single input approaches, multimodal inputs can provide additional feature information. [Yuan and Veltkamp2020] apply a graph attention network to effectively fuse the color and depth features. Similarly, [Cheng et al.2019] use attention mechanism to learn discriminative multimodal features from image and point clouds. The difference between two works mentioned above is that they use different network architectures.

4.3 Scene Flow

Scene flow is the 3D displacement vector between each surface point in two consecutive frames

[Wu et al.2019]. Estimating scene flow is an important fundamental basis for numerous higher-level challenges such as robotics. It is noteworthy that each point in the point clouds has only one direction flowing to the second frame, not all feature information has the same importance. [Wang et al.2020a] propose a hierarchical attention learning network model for scene flow estimation. This model includes two different attention modules, first attentive embedding and second attentive embedding, which can better focus on matched regions and features to find the right flowing direction. Inspired by above work, [Wang et al.2020b] present an attention cost volume structure to associate two point clouds and extract the embedding motion information. [Puy et al.2020] propose FloT attention module for scene flow estimation by optimal transport tools.

5 Miscellaneous Applications

In this section, we review the attention model unclassified in preceding two categories, namely, 3D upsampling and 3D normal estimation.

5.1 3D Upsampling

Point clouds provide a flexible and scalable geometric representation suitable for a variety of applications, but its unordered and irregular structure also needs to be noticed. To alleviate above challenge, upsampling is proposed to acquire dense and uniform point set from raw point clouds. [Li et al.2019a] propose a point cloud upsampling network, namely PU-GAN, to upsample points over patches on object surfaces. PU-GAN uses adversarial network architecture to train a generator module, which can produce a rich and robust point distributions from the latent space. Avoiding the network tend to poor convergence, they introduce a self-attention unit to enhance the feature integration quality. [Liu et al.2019b]

present an unsupervised upsampling method , named L2G-AE, with deep recurrent neural network. They leverage hierarchical self-attention mechanism to help feature aggregation at three levels of point, scale and region. Conversely,

[Liu et al.2020a] propose a self-supervised point cloud upsampling model, named SPU-Net, with graph convolution model. They combine the above two models to simultaneously capture context feature information inside and among local regions. [Zhao et al.2020b] develop a upsampling and completing network called PUI-Net. Noticeable, they apply channel attention mechanism to extract discriminative feature from point clouds.

5.2 3D Normal Estimation

3D normal estimation is a fundament task for many high-level applications, including 3d reconstruction, tracking, and rendering [Liu et al.2019c]

. In previous research, traditional normal estimation method, such as Principal Components Analysis, requires manually tuning hyper-parameters. Recent methods based on deep learning mainly focus on high-quality 3D normal estimation without manually tuning parameters.

[Wang and Prisacariu2020] propose a temperature adjusted multi-head self-attention module, namely TMHSA, which combined with deep neural network. The TMHSA softly fuses per-point weighted feature from different aspects and outputs high-quality feature representations. [Matveev et al.2020] present an attention-based neural network model that can improve neighborhood selection of point clouds and effectively incorporate geometric relations between the points. The use of geometric attention module as a means of extracting global feature representation is motivated by the fact that different quantities are defined locally.

6 Discussions

In this section, we discuss additional issues and highlight important challenges in future investigation.

  1. New applications. With the rapid development of 3D sensors and point clouds technologies, new application domains are constantly and gradually emerging, including autonomous driving, virtual reality and smart wear. It would be necessary to explore other attention mechanisms suit for different applications. Indeed, attention mechanism can be used in many applications not limited to the aforementioned domains.

  2. Powerful attention model

    . Over the past several years, a large number of different neural network architectures have been proposed, such as generative adversarial network, graph neural network and transformer. These architectures are important as they allow deep learning to handle many real-world cases. Moreover, with the rapid increase of network architecture complexity, it is difficult to extract enough feature representation without introducing more computational cost. Therefore, a light-weight attention model with powerful and meaningful feature representation for different network architecture would be an interesting direction for future investigation.

  3. Multimodal feature fusion. Modern application domains, such as autonomous driving, are commonly equipped with multiple sensors e.g., RGB cameras, thermal, starlight, LiDAR and RADAR to provide a more comprehensive understanding of real-world environment. It is important to effectively capture all task-relevant information from multimodal data especially if there are complex structure involved. Therefore, including but not limited to aforementioned point-view feature fuse methods, an interesting direction for future study is looking at attention-based technologies that can be used to effective and simplified multimodal feature extract and fuse.

  4. Energy-efficient balance. In recent years, capturing 3D point clouds is getting easier, so that the scale of point cloud datasets increases gradually. The majority of methods that calculate an attention-based work may have trouble in scaling effectively to larger point sets. Furthermore, an emerging trend of deep learning is applied to handheld devices. Power consumption, computational cost and memory footprint will be the most significant obstacle. Therefore, it would be useful to explore other ways of applying attention mechanism not simply to boost model accuracy but to balance the model energy-efficient.

7 Conclusion

In this paper, we have provided a systematic review of state-of-the-art attention models for point clouds in deep learning. To the best of our knowledge, this is the first work of this kind. We group existing work to three intuitive taxonomies: computer vision, robotics and Miscellaneous applications. We also list several challenges and opportunities for future investigation in the field of 3D point clouds based on attention mechanism.


  • [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  • [Behl et al.2019] Aseem Behl, Despoina Paschalidou, Simon Donné, and Andreas Geiger. Pointflownet: Learning representations for rigid motion estimation from point clouds. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 7962–7971, 2019.
  • [Bhattacharyya et al.2021] Prarthana Bhattacharyya, Chengjie Huang, and Krzysztof Czarnecki. Self-attention based context-aware 3d object detection. arXiv preprint arXiv:2101.02672, 2021.
  • [Chaudhari et al.2019] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun Mithal. An attentive survey of attention models. arXiv preprint arXiv:1904.02874, 2019.
  • [Cheng et al.2019] Yi Cheng, Hongyuan Zhu, Cihan Acar, Wei Jing, Yan Wu, Liyuan Li, Cheston Tan, and Joo-Hwee Lim. 6d pose estimation with correlation fusion. arXiv preprint arXiv:1909.12936, 2019.
  • [Dovrat et al.2019] Oren Dovrat, Itai Lang, and Shai Avidan. Learning to sample. CVPR, 2019.
  • [Du et al.2020] Juan Du, Rui Wang, and Daniel Cremers. Dh3d: Deep hierarchical 3d descriptors for robust large-scale 6dof relocalization. In European Conference on Computer Vision, pages 744–762. Springer, 2020.
  • [Ferreira and Dias2014] J. F. Ferreira and J. Dias. Attentional mechanisms for socially interactive robots–a survey. IEEE Transactions on Autonomous Mental Development, 6(2):110–125, 2014.
  • [Fuchs et al.2020] B. Fabian Fuchs, E. Daniel Worrall, Volker Fischer, and Max Welling. Se(3)-transformers: 3d roto-translation equivariant attention networks. NeurIPS, 2020.
  • [Galassi et al.2020] A. Galassi, M. Lippi, and P. Torroni. Attention in natural language processing. IEEE Transactions on Neural Networks and Learning Systems, pages 1–18, 2020.
  • [Han et al.2020] Zhizhong Han, Baorui Ma, Yu-Shen Liu, and Matthias Zwicker. Reconstructing 3d shapes from multiple sketches using direct shape optimization. IEEE Transactions on Image Processing, 29:8721–8734, 2020.
  • [Hu et al.2020] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11108–11117, 2020.
  • [Jégou et al.2010] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
  • [Kingkan et al.2019] Cherdsak Kingkan, Joshua Owoyemi, and Koichi Hashimoto. Point attention network for gesture recognition using point cloud data. In 29th British Machine Vision Conference, BMVC 2018, 2019.
  • [Lee et al.2019a] John Boaz Lee, Ryan A Rossi, Sungchul Kim, Nesreen K Ahmed, and Eunyee Koh. Attention models in graphs: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD), 13(6):1–25, 2019.
  • [Lee et al.2019b] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In

    International Conference on Machine Learning

    , pages 3744–3753. PMLR, 2019.
  • [Lei et al.2019] Yinjie Lei, Ziqin Zhou, Pingping Zhang, Yulan Guo, Zijun Ma, and Lingqiao Liu. Deep point-to-subspace metric learning for sketch-based 3d shape retrieval. Pattern Recognition, 96:106981, 2019.
  • [Li et al.2019a] Ruihui Li, Xianzhi Li, Chi-Wing Fu, Daniel Cohen-Or, and Pheng-Ann Heng. Pu-gan: a point cloud upsampling adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, pages 7203–7212, 2019.
  • [Li et al.2019b] Zongmin Li, Jun Zhang, Guanlin Li, Yujie Liu, and Siyuan Li. Graph attention neural networks for point cloud recognition. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 387–392. IEEE, 2019.
  • [Li et al.2020a] Ying Li, Lingfei Ma, Weikai Tan, Chen Sun, Dongpu Cao, and Jonathan Li. Grnet: Geometric relation network for 3d object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 165:43–53, 2020.
  • [Li et al.2020b] Zirui Li, Junyu Xu, Yue Zhao, Wenhui Li, and Weizhi Nie. Mpan: Multi-part attention network for point cloud based 3d shape retrieval. IEEE Access, 8:157322–157332, 2020.
  • [Liang et al.2019] Zhidong Liang, Ming Yang, and Chunxiang Wang.

    3d graph embedding learning with a structure-aware loss function for point cloud semantic instance segmentation.

    arXiv: Computer Vision and Pattern Recognition, 2019.
  • [Liu et al.2019a] Weiping Liu, Jia Sun, Wanyi Li, Ting Hu, and Peng Wang. Deep learning on point clouds and its application: A survey. Sensors, 19(19):4188, 2019.
  • [Liu et al.2019b] Xinhai Liu, Zhizhong Han, Xin Wen, Yu-Shen Liu, and Matthias Zwicker. L2g auto-encoder: Understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In Proceedings of the 27th ACM International Conference on Multimedia, pages 989–997, 2019.
  • [Liu et al.2019c] Yongcheng Liu, Bin Fan, Gaofeng Meng, Jiwen Lu, Shiming Xiang, and Chunhong Pan. Densepoint: Learning densely contextual representation for efficient point cloud processing. In Proceedings of the IEEE International Conference on Computer Vision, pages 5239–5248, 2019.
  • [Liu et al.2020a] Xinhai Liu, Xinchen Liu, Zhizhong Han, and Yu-Shen Liu. Spu-net: Self-supervised point cloud upsampling by coarse-to-fine reconstruction with self-projection optimization. arXiv preprint arXiv:2012.04439, 2020.
  • [Liu et al.2020b] Zhe Liu, Xin Zhao, Tengteng Huang, hu ruolan, Yu Zhou, and Xiang Bai. Tanet: Robust 3d object detection from point clouds with triple attention. AAAI, pages 11677–11684, 2020.
  • [Lu et al.2019] Weixin Lu, Guowei Wan, Yao Zhou, Xiangyu Fu, Pengfei Yuan, and Shiyu Song. Deepvcp: An end-to-end deep neural network for point cloud registration. In Proceedings of the IEEE International Conference on Computer Vision, pages 12–21, 2019.
  • [Lu et al.2020a] Fan Lu, Guang Chen, Yinlong Liu, Zhongnan Qu, and Alois Knoll. Rskdd-net: Random sample-based keypoint detector and descriptor. arXiv preprint arXiv:2010.12394, 2020.
  • [Lu et al.2020b] Yuheng Lu, Fan Yang, Fangping Chen, and Don Xie. Pic-net: Point cloud and image collaboration network for large-scale place recognition. arXiv preprint arXiv:2008.00658, 2020.
  • [Luo et al.2020] Zhipeng Luo, Di Liu, Jonathan Li, Yiping Chen, Zhenlong Xiao, José Marcato Junior, Wesley Nunes Gonçalves, and Cheng Wang. Learning sequential slice representation with an attention-embedding network for 3d shape recognition and retrieval in mls point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, 161:147–163, 2020.
  • [Luong et al.2015] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  • [Marblestone et al.2016] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:94, 2016.
  • [Maturana and Scherer2015] Daniel Maturana and Sebastian Scherer.

    Voxnet: A 3d convolutional neural network for real-time object recognition.

    In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
  • [Matveev et al.2020] Albert Matveev, Alexey Artemov, Denis Zorin, and Evgeny Burnaev. Geometric attention for prediction of differential properties in 3d point clouds. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pages 113–124. Springer, 2020.
  • [Nguyen et al.2018] Tam V Nguyen, Qi Zhao, and Shuicheng Yan. Attentive systems: A survey. International Journal of Computer Vision, 126(1):86–110, 2018.
  • [Paigwar et al.2019] Anshul Paigwar, Ozgur Erkent, Christian Wolf, and Christian Laugier. Attentional pointnet for 3d-object detection in point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • [Puy et al.2020] Gilles Puy, Alexandre Boulch, and Renaud Marlet. Flot: Scene flow on point clouds guided by optimal transport. arXiv preprint arXiv:2007.11142, 2020.
  • [Qiao et al.] Zhijian Qiao, Zhe Liu, Chuanzhe Suo, Huanshu Wei, Zhuowen Shen, and Hesheng Wang. End-to-end 3d point cloud learning for registration task using virtual correspondences.
  • [Qin et al.2019] Can Qin, Haoxuan You, Lichen Wang, C-C Jay Kuo, and Yun Fu. Pointdan: A multi-scale 3d domain adaption network for point cloud representation. In Advances in Neural Information Processing Systems, pages 7192–7203, 2019.
  • [Qingyong et al.2019] Hu Qingyong, Yang Bo, Xie Linhai, Rosa Stefano, Guo Yulan, Wang Zhihua, Trigoni Niki, and Markham Andrew. Randla-net: Efficient semantic segmentation of large-scale point clouds. CVPR, pages 11105–11114, 2019.
  • [Shajahan et al.2019] Dimple A Shajahan, Vaibhav Nayel, and Ramanathan Muthuganapathy. Roof classification from 3-d lidar point clouds using multiview cnn with self-attention. IEEE Geoscience and Remote Sensing Letters, 2019.
  • [Shi et al.2020] Hanyu Shi, Guosheng Lin, Hao Wang, Tzu-Yi Hung, and Zhenhua Wang. Spsequencenet: Semantic segmentation network on 4d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4574–4583, 2020.
  • [Song et al.2020] Myoungha Song, Jeongho Lee, and Donghwan Kim. Pam: Point-wise attention module for 6d object pose estimation. arXiv preprint arXiv:2008.05242, 2020.
  • [Sun et al.2020] Qi Sun, Hongyan Liu, Jun He, Zhaoxin Fan, and Xiaoyong Du. Dagc: Employing dual attention and graph convolution for point cloud based place recognition. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pages 224–232, 2020.
  • [Tu et al.2020] Xinyuan Tu, Jian Zhang, Runhao Luo, Kai Wang, Qingji Zeng, Yu Zhou, Yao Yu, and Sidan Du. Reconstruction of high-precision semantic map. Sensors, 20(21):6264, 2020.
  • [Wang and Prisacariu2020] Zirui Wang and Victor Adrian Prisacariu. Neighbourhood-insensitive point cloud normal estimation network. arXiv preprint arXiv:2008.09965, 2020.
  • [Wang and Solomon2019] Yue Wang and Justin M Solomon. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE International Conference on Computer Vision, pages 3523–3532, 2019.
  • [Wang and Tax2016] Feng Wang and David MJ Tax. Survey on the attention based rnn model and its applications in computer vision. arXiv preprint arXiv:1601.06823, 2016.
  • [Wang et al.2019] Kaiqi Wang, Ke Chen, and Kui Jia. Deep cascade generation on point sets. In IJCAI, volume 2019, page 4, 2019.
  • [Wang et al.2020a] Guangming Wang, Xinrui Wu, Zhe Liu, and Hesheng Wang. Hierarchical attention learning of scene flow in 3d point clouds. arXiv preprint arXiv:2010.05762, 2020.
  • [Wang et al.2020b] Guangming Wang, Xinrui Wu, Zhe Liu, and Hesheng Wang. Pwclo-net: Deep lidar odometry in 3d point clouds using hierarchical embedding mask optimization. arXiv preprint arXiv:2012.00972, 2020.
  • [Wang et al.2020c] Guojun Wang, Bin Tian, Yunfeng Ai, Tong Xu, Long Chen, and Dongpu Cao. Centernet3d: An anchor free object detector for autonomous driving. arXiv preprint arXiv:2007.07214, 2020.
  • [Wen et al.2020a] Xin Wen, Zhizhong Han, Geunhyuk Youk, and Yu-Shen Liu. Cf-sis: Semantic-instance segmentation of 3d point clouds by context fusion with self-attention. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1661–1669, 2020.
  • [Wen et al.2020b] Xin Wen, Tianyang Li, Zhizhong Han, and Yu-Shen Liu. Point cloud completion by skip-attention network with hierarchical folding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1939–1948, 2020.
  • [Wu and Ogai2020] Yutian Wu and Harutoshi Ogai. Realtime single-shot refinement neural network for 3d obejct detection from lidar point cloud. In 2020 59th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), pages 332–337. IEEE, 2020.
  • [Wu et al.2019] Wenxuan Wu, Zhiyuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: A coarse-to-fine network for supervised and self-supervised scene flow estimation on 3d point clouds. arXiv preprint arXiv:1911.12408, 2019.
  • [Wu et al.2020] Yutian Wu, Yueyu Wang, Shuwei Zhang, and Harutoshi Ogai. Deep 3d object detection networks using lidar data: A review. IEEE Sensors Journal, 21(2):1152–1171, 2020.
  • [Xia et al.2020] Yan Xia, Yusheng Xu, Shuang Li, Rui Wang, Juan Du, Daniel Cremers, and Uwe Stilla. Soe-net: A self-attention and orientation encoding network for point cloud based place recognition. arXiv preprint arXiv:2011.12430, 2020.
  • [Xie et al.2020] Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, and Jun Wang. Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10447–10456, 2020.
  • [Yang and Wang2019] Ze Yang and Liwei Wang. Learning relationships for multi-view 3d object recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 7505–7514, 2019.
  • [Yang et al.2018] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
  • [Yang et al.2019] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li, Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3323–3332, 2019.
  • [Yang et al.2020] Jian Yang, Xu Jiang, and Xiaohong Ma. 3dsenet: 3d spatial attention region ensemble network for real-time 3d hand pose estimation. In 2020 10th International Conference on Information Science and Technology (ICIST), pages 96–104. IEEE, 2020.
  • [Yew and Lee2018] Zi Jian Yew and Gim Hee Lee. 3dfeat-net: Weakly supervised local 3d features for point cloud registration. In European Conference on Computer Vision, pages 630–646. Springer, 2018.
  • [You et al.2018] Haoxuan You, Yifan Feng, Rongrong Ji, and Yue Gao. Pvnet: A joint convolutional network of point cloud and multi-view for 3d shape recognition. In Proceedings of the 26th ACM international conference on Multimedia, pages 1310–1318, 2018.
  • [Yuan and Veltkamp2020] Honglin Yuan and Remco C Veltkamp. 6d object pose estimation with color/geometry attention fusion. In 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), pages 529–535. IEEE, 2020.
  • [Yuan et al.2020] Honglin Yuan, Remco C Veltkamp, Georgios Albanis, Nikolaos Zioulis, Dimitrios Zarpalas, and Petros Daras. Shrec 2020 track: 6d object pose estimation. arXiv preprint arXiv:2010.09355, 2020.
  • [Zhang and Xiao2019] Wenxiao Zhang and Chunxia Xiao. Pcan: 3d attention map learning using contextual information for point cloud based retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12436–12445, 2019.
  • [Zhang et al.2020a] Gege Zhang, Qinghua Ma, Licheng Jiao, Fang Liu, and Qigong Sun. Attan: Attention adversarial networks for 3d point cloud semantic segmentation. IJCAI, pages 789–796, 2020.
  • [Zhang et al.2020b] Wenxiao Zhang, Chengjiang Long, Qingan Yan, Alix LH Chow, and Chunxia Xiao. Multi-stage point completion network with critical set supervision. Computer Aided Geometric Design, 82:101925, 2020.
  • [Zhang et al.2020c] Wenxiao Zhang, Qingan Yan, and Chunxia Xiao. Detail preserved point cloud completion via separated feature aggregation. european conference on computer vision, pages 512–528, 2020.
  • [Zhao et al.2020a] Yaxin Zhao, Jichao Jiao, and Tangkun Zhang. Manet: Multimodal attention network based point-view fusion for 3d shape recognition. arXiv preprint arXiv:2002.12573, 2020.
  • [Zhao et al.2020b] Yifan Zhao, Jin Xie, Jianjun Qian, and Jian Yang. Pui-net: A point cloud upsampling and inpainting network. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 328–340. Springer, 2020.