1 Introduction
We live in a 3D world. Geometric data have raised increasing research concerns thanks to the popularity of 3D sensors, e.g
., LiDAR and RGBD cameras. In particular, we are interested in analyzing 3D point clouds with endtoend deep learning, which theoretically requires the neural networks to consume 1) sizevarying and 2) permutationinvariant sets. PointNets
[30] and DeepSets [51] pioneer directly processing the point sets. Several studies push this research direction by proposing either structural [19, 32, 42] or componential [23, 34] improvements.We argue the relations between points are critical to represent a point cloud: a single point is noninformative without other points in the same set; in other words, it is simply represented by relations between other points. Inspired by the recent advances in NLP domain [39, 10], we introduce Point Attention Transformers (PATs), based on selfattention to model the relations with powerful multihead design [39]. Combining with ideas of the lightweight but highperformance model, we propose a parameterefficient Group Shuffle Attention (GSA) to replace the costly MultiHead Attention [39] with superior performance.
Besides, prior studies [32, 23] demonstrate the effectiveness of hierarchical structures in point cloud reasoning. By sampling central subsets of input points and grouping them with graphbased operations at multiple levels, the hierarchical structures mimic receptive fields in CNNs with bottomup representation learning. Despite great success, we however figure out that the sampling operation is a bottleneck of the hierarchical structures.
Few prior works study sampling from highdimension embeddings. The most popular sampling operation on 3D point clouds is the Furthest Point Sampling (FPS). However, it is taskdependent, i.e., designed for lowdimension Euclidean space exclusively, without sufficiently utilizing the semantically highlevel representations. Moreover, as illustrated in Figure 1, FPS is permutationvariant, and sensitive to outliers in point clouds.
To this end, we propose a taskagnostic and permutationinvariant sampling operation, named Gumbel Subset Sampling (GSS), to address the set sampling problem. Importantly, it is endtoend trainable. To our knowledge, we are the first study to propose a differentiable subset sampling method. Equipped with GumbelSoftmax [16, 26], our GSS samples soft virtual points in training phase, and produces hard selection in test phase via annealing. With GSS, our PAT classification models are betterperformed with lower computation cost.
2 Preliminaries
2.1 Deep Learning on Point Clouds
CNNs (especially 3D CNNs [31, 54]) dominate earlystage researches of deep learning on 3D vision, where the point clouds are rendered into 2D multiview images [37] or 3D voxels [31]. These methods require computeintensively prerendering the sparse points into voluminous representations with quantization artifacts [30]. To improve memory efficiency and running speed, several researchers [41, 36] introduce sparse CNNs on specific data structures.
On the other hand, deep learning directly on the Euclideanspace point clouds raises research attention. By design, these networks should be able to process 1) sizevarying and 2) permutationinvariant (or permutationequivariant) point sets (called Theoretical Conditions for simplicity). PointNet [30] and DeepSet [51] pioneer this direction, where a symmetric function (e.g
., shared FC before maxpooling) is used for learning each point’s highlevel representation before aggregation; However, relations between the points are not sufficiently captured in this way. To this end, PointNet++
[32] introduces a hierarchical structure based on Euclideanspace nearestneighbor graph, KdNet [19] designs spatial KDtrees for efficient information aggregation, and DGCNN [42] develops a graph neural network (GNN) approach with dynamic graph construction. Not all studies satisfy both Theoretical Conditions at the same time; For instance, KdNet [19] resamples the input points to evade the ”sizevarying” condition, and PointCNN [23] groups and processes the points via specific operators without ”permutationinvariant” condition.2.2 SelfAttention
An attention mechanism [4, 47, 11, 49, 48] uses inputdependent weights to linearly combine the inputs. Mathematically, given an input , a query to attend to the input , the output of the attention layer is
(1) 
where is a matrix function for producing the attention weights . The common choices of function are additive, dotproduct and general attention [25]. A selfattention is simply to let the inputs attend to every input element themselves, i.e., . As the attention layer is a single linear combination once the attention weights are produced, we call this form the vanilla selfattention.
An attention transformer [39]
is a (fully) attentional model with
stateoftheartperformance on neural machine translation and other NLP tasks. Importantly, it introduces a
MultiHead Attention (MHA)to aggregate the inputs multiple times with different linear transformations. For a selfattention version
^{1}^{1}1Note the projection weights for of MultiHead Attention are shared in our derivation.,(2) 
where is the number of heads, and is the projection weights of head . Positionwise MLPs with nonlinearity are connected to the attention layers. Equipped with different attention weights, MHA introduces stronger capacity in a single layer than the vanilla selfattention.
2.3 Discrete Reparameterization
Variational AutoEncoders (VAEs) [18]
introduce an elegant reparameterization trick to enable continuous stochastic variables to backpropagate in neural network computation graphs. However, discrete stochastic variables are nontrivial to be reparameterized. To this regard, several stochastic gradient estimation methods are proposed,
e.g., REINFORCEbased methods [38, 33] and StraightThrough Estimators [5].For a categorical distribution , where denotes the number of categories, (
) means the probability score of category
, a GumbelSoftmax [16, 26]is designed as a discrete reparameterization trick, to estimate smooth gradient with a continuous relaxation for the categorical variable. Given i.i.d Gumbel noise
drawn from distribution, a soft categorical sample can be drawn (or computed) by(3) 
The Eq. 3 is referred as gumbel_softmax operation on .
Parameter is the annealing temperature, as , degenerates into the GumbelMax form,
(4) 
which is an unbiased sample from .
In this way, we are able to draw differentiable samples (Eq. 3) from the distribution in training phase. In practice, starts at a high value (e.g., 1.0), and anneals to a small value (e.g
., 0.1). Optimization on the Gumbel Softmax distribution could be interpreted as solving a certain entropyregularized linear program on the probability simplex
[27]. In test phase, discrete samples can be drawn with GumbelMax trick (Eq. 4).3 Point Attention Transformers
3.1 Overview
We describe our model in a topdown approach. As illustrated in Figure 2, we define as the number of points, and as feature dimension except for the dimension, e.g., for RGB point clouds. An input 3D point cloud , is first embedded into higherlevel representations by an Absolute and Relative Position Embedding (ARPE) module (Section 3.2), in which each point is represented by its nearest neighbors’ relative positions, together with its own absolute position. We then use Group Shuffle Attention (GSA) (Section 3.3) blocks for mining relations between elements in the feature set , and the representation of each element becomes semantically stronger via the layerbylayer attentional transformation.
For classification, we define as the number of target classes, the final output assigns a single label to the input by . Inspired by several prior studies with hierarchical structures [32, 23], we also adopt this downsampling structure (Section 3.5). After every GSA operation, we sample a subset (i.e., downsampling) for subsequent processing. The sampling operation could be either Furthest Point Sampling (FPS) or the proposed Gumbel Subset Sampling (GSS) in Section 3.4. The remaining points after the last downsampling are separately connected to shared MLPs before global average pooling for classification output. For training, a crossentropy loss is computed over every MLP before averaging, referred as Elementwise Loss trick (Section 3.5).
For segmentation, the output is , which assigns a label to every point. As GSA operation adaptively aggregates global information to every local point, the downsampling structure is not necessary, which introduces information loss for segmentation. In this way, a segmentation PAT is simply a stack of GSA layers connected to the ARPE module, followed by a shared MLP on each point for pointwise segmentation.
We describe the submodules in the following sections.
3.2 Absolute and Relative Position Embedding
We first consider how to represent a point cloud. For a single point , we argue that its absolute position is informative, while not rich enough; it is also represented by all the remaining points’ relative positions (to ) in the same point cloud. Combine both, and we call it an Absolute and Relative Position Embedding (ARPE) module.
Given an input point cloud , for a point , its position set is defined as,
(5) 
A shared PointNet [30] is applied on the position set for each point, i.e.,
(6) 
where and are both MLPs with group normalization [43] and ELU activation [8]. Note on all points is easy to parallelize.
With complexity, it is too costly to use the position set with all points in Eq. 6. Instead, only top nearest neighbors are considered (”Nearestneighbor Graph” in Figure 2). However, sparsity and number of input points are coupled; in other words, top 32 neighbors in 256 points and those in 1,024 points are very different on the scale. To make the ARPE module more robust with various point numbers, we introduce a dilated sampling technique [23], i.e., the position set is constructed by sampling points from the top neighbors, where dilated rate , and is a base dilated rate on points. If not specified, and for points.
3.3 Group Shuffle Attention
We propose to use attention layers to capture the relations between the points. MHA is successful in modeling relations by introducing a critical multihead design [10, 40, 28], however we argue that it is voluminous for modeling point clouds. To this regard, we propose a parameterefficient Group Shuffle Attention (GSA) to replace MHA. There are two improvements over MHA:
Firstly, to get rid of positionwise MLPs, we integrate the nonlinearity into attention modules, named nonlinear selfattention,
(7) 
where we use a Scaled DotProduct attention [39] for , i.e., , and ELU activation [8] for . In other words, we use the ”preactivation” to attend to the ”postactivation”.
Secondly, we introduce compact group linear transformations [45, 7] with channel shuffle [53, 52], keeping the multihead design. Let be the number of groups, , , we split by channels into groups: , and define as a learnable transformation weight for group , thus a Group Attention (GroupAttn) is defined,
(8) 
However, a pure stack of GroupAttn blocks the information flow between groups. To enable efficient layerbylayer transformations, we introduce a parameterfree channel shuffle [53] operator , see Figure 3 (a) for illustration.
For an element , we rewrite as,
(9) 
where is the group of channels. In this way, we define the channel shuffle as,
(10) 
For any modern deep learning framework, channel shuffle can be elegantly implemented by ”reshape  transpose  flatten” endtoend.
A Group Shuffle Attention (GSA
) is simply a Group Attention followed by the channel shuffle, together with residual connection
[12] and the group normalization [43],(11) 
The following proposition theoretically guarantees the permutationequivariance of GSA.
Proposition 1.
The Group Shuffle Attention operation is permutationequivariant, i.e., given input , permutation matrix of size ,
Proof is provided in Appendix A.
3.4 Gumbel Subset Sampling
Although Furthest Point Sampling (FPS) is widely used in point cloud reasoning, it has several defects: 1) its sampling result is dependent on the initial point, i.e., it is not permutationinvariant; 2) it is taskdependent (designed for lowdimension Euclidean space); 3) it is sensitive to outliers. To overcome these issues, we argue that an ideal sampling operation should be:

Permutationinvariant: the selected subset is always consistent regardless of any permutation of inputs;

Sampling from a highdimension embedding space: the sampling operation should be designed taskagnostic and less sensitive to outliers by learning representative and robust embeddings;

Differentiable: it enables the sampling operation to integrate into neural networks painlessly.
For these purposes, we develop a permutationinvariant, taskagnostic and differentiable Gumbel Subset Sampling (GSS). Given an input set , which could be output of a neural network layer, the goal is to select a representative subset with differentiable operations. Inspired by Attentionbased MIL pooling [15], where the pooling output is an average value weighted by normalized scores produced elementwisely, i.e.,
(12) 
Note is a learnable weight and could be replaced with an MLP.
We reinterpret Attentionbased MIL pooling (Eq. 12) as competitively selecting one soft virtual point. Though differentiable, the virtual point is however untraceable and less interpretable, especially when selecting multiple points. Instead, we use a hard and discrete selection with an endtoend trainable gumbel_softmax (Eq. 3):
(13) 
in training phase, it provides smooth gradients using discrete reparameterization trick. With annealing, it degenerates to a hard selection in test phase.
A Gumbel Subset Sampling (GSS) is simply a multiplepoint version of Eq. 13, which means a distribution of subsets,
(14) 
The following proposition theoretically guarantees the permutationinvariance of GSS.
Proposition 2.
The Gumbel Subset Sampling operation is permutationinvariant, i.e., given input , permutation matrix of size ,
Proof is provided in Appendix B.
3.5 Other Architecture Design
Downsampling Structure
In our classification models, we downsample input points at 3 levels (from 1,024 points to 384  128  64 points). Although GSS is theoretically superior to FPS, the Gumbel noises also serve as a (too) strong regularization. Instead of using GSS in all downsampling, we find that replacing the first downsampling with FPS performs slightly better in our experiments.
Elementwise Loss
We compute the classification loss as segmentation [23]: a shared MLP is connected to each remaining point to output the same target class, where the MLP is a stack of ”FC   ELU  dropout [35]”. The final loss is averaged by elementwise cross entropy. The elementwise loss trick does not bring any performance boost, while the training is significantly faster to converge. At inference, the final classification score is averaged by the elementwise outputs.
4 Applications
In this section, we first demonstrate the effectiveness and efficiency of PATs on a benchmark of point cloud classification, ModelNet40 dataset [44] of CAD models. We then explore the model performance on realworld datasets. We report the segmentation results on S3DIS dataset [2]. Furthermore, we propose a novel application on recognizing gestures with event camera on DVS128 Gesture Dataset [1]. To our knowledge, this is the first study to process eventcamera stream as spatiotemporal point clouds, with stateoftheart performance.
4.1 ModelNet40 Shape Classification
Dataset
We evaluate our classification model on ModelNet40 [44] dataset of 40category CAD models. Official split with 9,840 samples for training, and 2,468 samples for test is used in our experiments. We use the same preprocessed dataset as PointNet++ [32].
Method  Points  Accuracy (%) 

DeepSets [51]  5,000  90.0 
PointNet [30]  1,024  89.2 
KdNet [19]  1,024  90.6 
PointNet++ [32]  1,024  90.7 
KCNet [34]  1,024  91.0 
DGCNN [42]  1,024  92.2 
PointCNN [23]  1,024  92.2 
PAT (GSA only)  1,024  91.3 
PAT (GSA only)  256  90.9 
PAT (FPS)  1,024  91.4 
PAT (FPS + GSS)  1,024  91.7 
Experiment Setting
Classification PATs use ARPE to produce 1,024dimension embeddings, subsequently fed into 3 GSAs with hidden size 1,024, followed by a shared MLP with 1,024  512  256 hidden sizes for 40category elementwise cross entropy loss (Section 3.5). Several variants of PATs are considered in our experiments: ”PAT (GSA only)” uses no downsampling; ”PAT (FPS)” uses FPS downsampling after each GSA, with a FPS(384)  FPS(128)  FPS(64) downsampling structure; and ”PAT (FPS + GSS)” uses a downsampling structure GSS except for the first one, i.e. FPS(384)  GSS(128)  GSS(64).
Performance and Model Complexity
Classification performance on the test set is summarized in Table 1 with recent stateoftheart. Our PATs (including all variants) achieve comparable result on ModelNet40. Interestingly, the PAT using only 256 points (to train and test) outperforms the models before PointNet++ [32] using 1,024 points.
We also evaluate the model complexity in terms of model size and forward time in Table 2
. The forward time is recorded with a batch size of 8 on a single GTX 1080 GPU, which is the same hardware environment of the comparison models. As illustrated, our models achieve competitive performance with great parameterefficiency and acceptable speed. Due to the insufficient support of group linear layers in PyTorch (0.4.1)
[29], there still exists improvements in speed with lowlevel implemental optimization. Note the PATs with downsampling achieve better performance with even lower computation cost, and GSS improves FPS further with a neglectable burden.4.2 S3DIS Indoor Scene Segmentation
Dataset
We evaluate our PAT segmentation models on realword point cloud semantic segmentation dataset, Stanford LargeScale 3D Indoor Spaces Dataset (S3DIS) [2]. This dataset contains 3D RGB point clouds of 6 indoor areas totally including 272 rooms. Each point belongs to one of 13 semantic categories (e.g., ceiling, floor, clutter).
Experiment Setting
We follow the same setting as prior study [23], where each room is split into blocks of area 1.5m 1.5m
, and each point is represented as a 6D vector (XYZ, RGB). 2,048 points are sampled for each block during training process, and all points are used for testing blockwisely. We use a 6fold cross validation over the 6 areas, with 5 areas for training and 1 area left for validation each time. As there are overlaps between areas except for Area 5
[20], we report the metrics on Area 5 separately.Segmentation PATs use ARPE modules to produce 1,024dimension embeddings, followed by 5 1,024dimension GSAs. No downsampling is used. A shared MLP with the same structure as that in our classification PATs (Section 4.1) is used for 13category segmentation. Adam optimizer [17] is used for training crossentropy loss with a batch size of 16. The learning rate is initialized at 0.0001, then halved every 5 epochs. The training is converged within 20 epochs.
Performance
Evaluation performance on allarea cross validation (AREAS) and Area 5 is reported in Table 3. Our segmentation PAT achieves a best tradeoff between segmentation performance and parameterefficiency. On Area 5, it outperforms all the comparison models; on AREAS, our method achieves a superior performance over all comparison models except for PointCNN [23] in terms of mIoU, with a significantly smaller model size.
To further analyze the performance between PointCNN and our method, we compare perclass IoU and mean perclass accuracy (mAcc) on AREAS and Area 5. As depicted in Table 4, on AREAS, our method outperforms PointCNN in terms of mAcc; on Area 5, our method outperforms PointCNN in terms of both mIoU and mAcc, plus superior perclass IoUs on majority of classes.
Method  mIoU  mIoU on Area 5  Size (MB) 

RSNet [13]  56.47     
SPGraph [20]  62.1  58.04   
PointNet [30]  47.71  47.6  4.7 
DGCNN [42]  56.1    6.9 
PointCNN [23]  65.39  57.26  46.2 
PAT  64.28  60.07  6.1 
3D semantic segmentation results on S3DIS. Mean perclass IoU (mIoU, %) is used as evaluation metric. Model sizes are obtained using the official codes.
Class  ceiling  floor  wall  beam  colum  window  door  table  chair  sofa  bookcase  board  clutter  mIoU  mAcc  

AREAS  PointCNN  94.78  97.30  75.82  63.25  51.71  58.38  57.18  71.63  69.12  39.08  61.15  52.19  58.59  65.39  75.61 
PAT  93.01  98.36  73.54  58.51  38.87  77.41  67.74  62.70  67.30  30.63  59.60  66.61  41.39  64.28  76.45  
Area 5  PointCNN  92.31  98.24  79.41  0.00  17.6  22.77  62.09  74.39  80.59  31.67  66.67  62.05  56.74  57.26  63.86 
PAT  93.04  98.51  72.28  1.00  41.52  85.05  38.22  57.66  83.64  48.12  67.00  61.28  33.64  60.07  70.83 
4.3 Event Camera Stream as Point Clouds: DVS128 Gesture Recognization
Motivation and Dataset
Point cloud approaches are primarily designed for 3D spatial sensors, e.g., LiDAR and Matterport 3D Cameras. However, there are numbers of potential applications with pointbased records. In this section, we explore a novel application on event camera with point cloud approaches.
Dynamic Vision Sensor (DVS) [24] is a biologically inspired event camera, which ”transmits data only when a pixel detects a change” [1]. On the 128128 sensor matrix, it records whether there is a change (by a userdefined threshold) on the corresponding position in microseconds. In particular, we explore gesture recognition on DVS128 Gesture Dataset [1], with 11 classes of gestures (10 named gestures, e.g., ”Arm Roll”, and 1 ”others”) collected from 122 users. Training data is collected from 94 users with 939 event streams, and test data is collected from 24 users with 249 event streams. The gesture action records is a sequence of points, each of which is point change represented as a 4dimension vector: abscissa , ordinate , timestamp , and polarity (1 for appear and 0 for disappear). In this way, we regard the event stream as spatiotemporal point clouds. See Figure 4 for illustration.
Experiment Setting
We use a sliding window approach to get training and test samples. Each sample is within a window length of 750ms, sliding with step size of 100ms, see Figure 5 for demonstration. After this preprocessing, there are 61,544 clips for training and 10,256 clips for test. The step size could also be regarded as the maximum decision latency for a realtime application. For a stream containing clips, its systemlevel prediction is the mode of all predictions labels. The System Accuracy based on the systemlevel prediction is used for evaluation [1].
The samestructured classification PATs as ModelNet40 (Section 4.1) are used in this experiment. An SGD optimizer with a constant learning rate of 0.001 is used to train the PATs within 60 epochs for convergence.
Performance
There are few studies using point cloud approaches on event camera stream. For fair comparison, we implement a vanilla PointNet [30] and a PointNet++ [32] on this experiment. If not specified, 1,024 points sampled from the clips are used for training and evaluation.
As depicted in Table 5, all point cloud approaches are running within the maximum decision latency (100ms). We achieve a stateoftheart on this dataset, with strong parameterefficiency. Interestingly, PAT (GSA only) with 256 points performs similarly to that with 1,024 points. We argue that it is because of outliers and sparsity of the events. Note the baseline CNN [1] is running on a lowpower processor with a maximum decision latency of 105ms. Our results indicate the potential of replacing CNNs with PATs, with general network quantization techniques [14].
Model  10class  11class  Time  Size 
CNN [1]  96.5  94.4     
PointNet [30]  89.5  88.8  2.8  6.5 
PointNet++ [32]  95.6  95.2  18.2  12 
PAT (GSA only)  96.9  95.6  16.9  5 
PAT (GSA only, N256)  96.9  95.6  7.5  5 
PAT (FPS)  96.5  95.2  12.7  5 
PAT (FPS + GSS)  97.4  96.0  13.1  5.8 
5 Ablation Study
In this section, we analyze the effectiveness of the components / tricks on ModelNet40 dataset. As GSS has been proven effective in Section 4.1, we analyze the components on PATs without downsampling. All experiments in the ablation study are conducted using 256 points.
Elementwise Loss. As depicted in Figure 6 (a), training tends to be faster with elementwise loss. However, there is no performance boost for evaluation on test set.
Number of Groups. As shown in Figure 6 (b), grouping is critical to the performance since it is coupled with the multihead design in attention. Without grouping (), the model accuracy drops significantly, with even larger model size. With 8 groups, it achieves a best tradeoff between accuracy and model size.
Channel Shuffle vs. no Shuffle. To enable information to flow across groups, channel shuffle is critical to GSA (CS ”On” or ”Off” in Table 6), which is parameterfree and introduces neglectable computation cost.
Embedding Layer. ARPE module is shown to be very effective to boost performance (”MLP” or ”ARPE” in Table 6). It provides an improvement of approximately 0.8% consistently on 256 or 1,024 points.
GSA vs. MHA. We design 2 MHA counterparts to compare with: 1) MHA LG, with the same hidden size as GSA, and 2) MHA SM, by tuning the hidden size to keep a comparable model size as GSA. As depicted in Table 6, our PATs with GSA show superior performance in terms of both parameter efficiency and accuracy.
Group Norm vs. Layer Norm. We also discuss layer normalization [3] in the original Multihead Attention [39]. As is proposed to be an extension to , in our experiments ( or in Table 6), the former () outperforms the latter consistently.
Embedding  Attention  CS  Norm  Accuracy 
ARPE  GSA  On  90.9  
ARPE  GSA  Off  88.8  
MLP  GSA  On  90.1  
ARPE  MHA SM  On  89.3  
ARPE  MHA LG  On  89.9  
ARPE  GSA  On  89.9  
MLP  GSA  On  90.0  
6 Conclusion
We develop Point Attention Transformers (PATs) on point cloud reasoning. A parameterefficient Group Shuffle Attention (GSA) is proposed to learn the relations between points. Besides, we design an endtoend learnable and taskagnostic sampling operation, named Gumbel Subset Sampling (GSS). Results on several benchmarks demonstrate the effectiveness and efficiency of our methods. In the future, it is interesting to apply GSS on general sets, e.g., to explore both effectiveness and interpretability on hierarchical multiple instance learning.
Acknowledgment This work was supported by National Science Foundation of China (U1611461, 61521062). This work was partly supported by STCSM (18DZ1112300, 18DZ2270700). This work was also partially supported by joint research grant of SJTUBIGO LIVE, joint research grant of SJTUMinivision, and China’s Thousand Talent Program.
References
 [1] Arnon Amir, Brian Taba, David J Berg, Timothy Melano, Jeffrey L McKinstry, Carmelo Di Nolfo, Tapan K Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully eventbased gesture recognition system. In CVPR, pages 7388–7397, 2017.
 [2] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of largescale indoor spaces. In CVPR, pages 1534–1543, 2016.
 [3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 [5] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 [6] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [7] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1251–1258, 2017.
 [8] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
 [9] Taco S Cohen, Mario Geiger, Jonas Köhler, and Max Welling. Spherical cnns. In ICLR, 2018.
 [10] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [11] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. Convolutional sequence to sequence learning. In ICML, 2017.
 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [13] Qiangui Huang, Weiyue Wang, and Ulrich Neumann. Recurrent slice networks for 3d segmentation of point clouds. In CVPR, pages 2626–2635, 2018.
 [14] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In NIPS, pages 4107–4115, 2016.
 [15] Maximilian Ilse, Jakub M Tomczak, and Max Welling. Attentionbased deep multiple instance learning. In ICML, 2018.
 [16] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [18] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 [19] Roman Klokov and Victor Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In ICCV, pages 863–872, 2017.
 [20] Loïc Landrieu and Martin Simonovsky. Largescale point cloud semantic segmentation with superpoint graphs. In CVPR, pages 4558–4567, 2018.
 [21] Truc Le and Ye Duan. Pointgrid: A deep network for 3d shape understanding. In CVPR, pages 9204–9214, 2018.
 [22] Jiaxin Li, Ben M Chen, and Gim Hee Lee. Sonet: Selforganizing network for point cloud analysis. In CVPR, pages 9397–9406, 2018.
 [23] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on xtransformed points. In NeurIPS, 2018.
 [24] Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128x128 120 db 15 mu s latency asynchronous temporal contrast vision sensor. IEEE journal of solidstate circuits, 43(2):566–576, 2008.
 [25] MinhThang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. In EMNLP, 2015.

[26]
Chris J Maddison, Andriy Mnih, and Yee Whye Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In ICLR, 2017.  [27] Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. Learning latent permutations with gumbelsinkhorn networks. In ICLR, 2018.
 [28] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, and Alexander Ku. Image transformer. In ICML, 2018.
 [29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [30] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 77–85, 2017.
 [31] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multiview cnns for object classification on 3d data. In CVPR, pages 5648–5656, 2016.
 [32] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, pages 5105–5114, 2017.
 [33] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, pages 3528–3536, 2015.
 [34] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernel correlation and graph pooling. In CVPR, volume 4, 2018.

[35]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  [36] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, MingHsuan Yang, and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing. In CVPR, pages 2530–2539, 2018.

[37]
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik LearnedMiller.
Multiview convolutional neural networks for 3d shape recognition.
In CVPR, pages 945–953, 2015. 
[38]
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour.
Policy gradient methods for reinforcement learning with function approximation.
In NIPS, pages 1057–1063, 2000.  [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
 [40] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
 [41] PengShuai Wang, Yang Liu, YuXiao Guo, ChunYu Sun, and Xin Tong. Ocnn: Octreebased convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics, 36(4):72, 2017.
 [42] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 [43] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
 [44] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
 [45] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
 [46] Saining Xie, Sainan Liu, Zeyu Chen, and Zhuowen Tu. Attentional shapecontextnet for point cloud recognition. In CVPR, pages 4606–4615, 2018.
 [47] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015.
 [48] Yichao Yan, Bingbing Ni, and Xiaokang Yang. Finegrained recognition via attributeguided attentive feature aggregation. In ACM MM, pages 1032–1040, 2017.
 [49] Yichao Yan, Bingbing Ni, and Xiaokang Yang. Predicting human interaction via relative attention model. In IJCAI, pages 3245–3251, 2017.
 [50] Jiancheng Yang, Qiang Zhang, Rongyao Fang, Bingbing Ni, Jinxian Liu, and Qi Tian. Adversarial attack and defense on point sets. arXiv preprint arXiv:1902.10899, 2019.
 [51] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In NIPS, pages 3391–3401, 2017.
 [52] Ting Zhang, GuoJun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In CVPR, 2017.
 [53] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848–6856, 2018.
 [54] Wei Zhao, Jiancheng Yang, Yingli Sun, Cheng Li, Weilan Wu, Liang Jin, Zhiming Yang, Bingbing Ni, Pan Gao, Peijun Wang, et al. 3d deep learning from ct scans predicts tumor invasiveness of subcentimeter pulmonary adenocarcinomas. Cancer research, 78(24):6881–6889, 2018.
 [55] Yin Zhou and Oncel Tuzel. Voxelnet: Endtoend learning for point cloud based 3d object detection. In CVPR, 2017.
Appendix
Appendix A Proof of Permutation Equivariance of Group Shuffle Attention
Lemma 1 (Permutation matrix and permutation function).
, permutation matrix of size , is a permutation function:
(15) 
Lemma 2.
Given , permutation matrix of size ,
(16) 
Lemma 3.
Nonlinear selfattention , with is permutationequivariant.
Proof.
For a selfattention, .
is an elementwise function, thus .
which implies nonlinear selfattention is permutationequivariant. ∎
Proposition 1.
The Group Shuffle Attention operation is permutationequivariant, i.e., given input , permutation matrix of size ,
Proof.
GSA only introduces elementwise operations, which does not change the permutationequivariance of . ∎
Appendix B Proof of Permutation Invariance of Gumbel Subset Sampling
Proposition 1.
The Gumbel Subset Sampling operation is permutationinvariant, i.e., given input , permutation matrix of size ,