SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks

06/18/2020 ∙ by Fabian B. Fuchs, et al. ∙ Bosch University of Oxford University of Amsterdam 0

We introduce the SE(3)-Transformer, a variant of the self-attention module for 3D point clouds, which is equivariantunder continuous 3D roto-translations. Equivariance is important to ensure stable and predictable performance in the presence of nuisance transformations of the data input. A positive corollary of equivariance is increased weight-tying within the model, leading to fewer trainable parameters and thus decreased sample complexity (i.e. we need less training data). The SE(3)-Transformer leverages the benefits of self-attention to operate on large point clouds with varying number of points, while guaranteeing SE(3)-equivariance for robustness. We evaluate our model on a toy N-body particle simulation dataset, showcasing the robustness of the predictions under rotations of the input. We further achieve competitive performance on two real-world datasets, ScanObjectNN and QM9. In all cases, our model outperforms a strong, non-equivariant attention baseline and an equivariant model without attention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-attention mechanisms (Vaswani et al., 2017) have enjoyed a sharp rise in popularity in the last few years. Their relative implementational simplicity coupled with high efficacy on a wide range of tasks such as language modeling (Vaswani et al., 2017), image recognition (Parmar et al., 2019), or graph-based problems (Veličković et al., 2018), make them an attractive component to use. However, their generality of application means that for specific tasks, knowledge of existing underlying structure is unused. In this paper, we propose the SE(3)-Transformer shown in Fig. 1, a self-attention mechanism specifically for 3D point cloud data, which adheres to equivariance constraints, improving robustness to nuisance transformations and general performance.

Point cloud data is ubiquitous across many fields, presenting itself in diverse forms such as 3D object scans (Uy et al., 2019), 3D molecular structures (Ramakrishnan et al., 2014), or -body particle simulations (Kipf et al., 2018). Finding neural structures which can adapt to the varying number of points in an input, while respecting the irregular sampling of point positions, is challenging. Furthermore, an important property is that these structures should be invariant to global changes in overall input pose; that is, 3D translations and rotations of the input point cloud should not affect the output. In this paper, we find that the explicit imposition of equivariance constraints on the self-attention mechanism addresses these challenges. The SE(3)-Transformer uses the self-attention mechanism as a data-dependent filter particularly suited for sparse, non-voxelised point cloud data, while respecting and leveraging the symmetries of the task at hand.

Self-attention itself is a pseudo-linear map between sets of points. It can be seen to consist of two components: input-dependent attention weights and an embedding of the input, called a value embedding. In Fig. 1

, we show an example of a molecular graph, where attached to every atom we see a value embedding vector and where the attention weights are represented as edges, with width corresponding to the attention weight magnitude. In the SE(3)-Transformer, we explicitly design the attention weights to be invariant to global pose. Furthermore, we design the value embedding to be equivariant to global pose. Equivariance generalises the translational weight-tying of convolutions. It ensures that transformations of a layer’s input manifest as equivalent transformations of the output. SE(3)-equivariance in particular is the generalisation of translational weight-tying in 2D known from conventional convolutions to roto-translations in 3D. This restricts the space of learnable functions to a subspace which adheres to the symmetries of the task and thus reduces the number of learnable parameters. Meanwhile, it provides us with a richer form of invariance, since relative positional information between features in the input is preserved.

[width=]images/Teaser_v2.pdf

Figure 1: A) Each layer of the SE(3)-Transformer maps from a point cloud to a point cloud while guaranteeing euqivariance. For classification, this is followed by an invariant pooling layer and an MLP. B) In each layer, for each node, attention is performed. Here, the red node attends to its neighbours. Attention weights (indicated by line thickness) are invariant w.r.t. rotation of the input.

Our contributions are the following:

  • We introduce a novel self-attention mechanism, guaranteeably invariant to global rotations and translations of its input. It is also equivariant to permutations of the input point labels.

  • We show that the SE(3)-Transformer resolves an issue with concurrent SE(3)-equivariant neural networks, which suffer from angularly constrained filters.

  • We introduce a Pytorch implementation of spherical harmonics, which is 10x faster than Scipy on CPU and faster on GPU.

2 Background And Related Work

In this section we introduce the relevant background materials on self-attention, graph neural networks, and equivariance. We are concerned with point cloud based machine learning tasks, such as object classification or segmentation. In such a task, we are given a point cloud as input, represented as a collection of coordinate vectors with optional per-point features .

2.1 The Attention Mechanism

The standard attention mechanism (Vaswani et al., 2017) can be thought of as consisting of three terms: a set of query vectors for , a set of key vectors for , and a set of value vectors for , where and are the dimensions of the low dimensional embeddings. We commonly interpret the key and the value as being ‘attached’ to the same point . For a given query , the attention mechanism can be written as

(1)

where we used a softmax as a nonlinearity acting on the weights. In general, the number of query vectors does not have to equal the number of input points (Lee et al., 2019). In the case of self-attention the query, key, and value vectors are embeddings of the input features, so

(2)

where are, in the most general case, neural networks (van Steenkiste et al., 2018). For us, query is associated with a point in the input, which has a geometric location . Thus if we have points, we have possible queries. For query , we say that node attends to all other nodes .

Motivated by a successes across a wide range of tasks in deep learning such as language modeling

(Vaswani et al., 2017), image recognition (Parmar et al., 2019), graph-based problems (Veličković et al., 2018), and relational reasoning (van Steenkiste et al., 2018; Fuchs et al., 2020), a recent stream of work has applied forms of self-attention algorithms to point cloud data (Yang et al., 2019; Xie et al., 2018; Lee et al., 2019). One such example is the Set Transformer (Lee et al., 2019). When applied to object classification on ModelNet40 (Wu et al., 2015), the input to the Set Transformer are the cartesian coordinates of the points. Each layer embeds this positional information further while dynamically querying information from other points. The final per-point embeddings are downsampled and used for object classification.

Permutation equivariance A key property of self-attention is permutation equivariance. Permutations of point labels lead to permutations of the self-attention output. This guarantees the attention output does not depend arbitrarily on input point ordering. Wagstaff et al. (2019) recently showed that this mechanism can theoretically approximate all permutation equivariant functions. The SE(3)-transformer is a special case of this attention mechanism, inheriting permutation equivariance. However, it limits the space of learnable functions to rotation and translation equivariant ones.

2.2 Graph Neural Networks

Attention scales quadratically with point cloud size, so it is useful to introduce neighbourhoods: instead of each point attending to all other points, it only attends to its nearest neighbours. Sets with neighbourhoods are naturally represented as graphs. Attention has previously been introduced on graphs under the names of intra-, self-, vertex-, or graph-attention (Lin et al., 2017; Vaswani et al., 2017; Veličković et al., 2018; Hoshen, 2017; Shaw et al., 2018). These methods were unified by Wang et al. (2017) with the non-local neural network. This has the simple form

(3)

where and are neural networks and normalises the sum as a function of all features in the neighbourhood . This has a similar structure to attention, and indeed we can see it as performing attention per neighbourhood. While non-local modules do not explicitly incorporate edge-features, it is possible to add them, as done in Veličković et al. (2018) and Hoshen (2017).

2.3 Equivariance

Given a set of transformations for , where is an abstract group, a function is called equivariant if for every there exists a transformation such that

(4)

The indices can be considered as parameters describing the transformation. Given a pair , we can solve for the family of equivariant functions satisfying Equation 4. Furthermore, if are linear and the map is also linear, then a very rich and developed theory already exists for finding  (Cohen and Welling, 2017). In the equivariance literature, deep networks are built from interleaved linear maps and equivariant nonlinearities. In the case of 3D roto-translations it has already been shown that a suitable structure for is a tensor field network (Thomas et al., 2018), explained below. Note that Romero et al. (2020) recently introduced a 2D roto-translationally equivariant attention module for pixel-based image data.

Group Representations In general, the transformations are called group representations. Formally, a group representation is a map from a group to the set of invertible matrices . Critically is a group homomorphism; that is, it satisfies the following property for all . Specifically for 3D rotations , we have a few interesting properties: 1) its representations are orthogonal matrices, 2) all representations can be decomposed as

(5)

where is an orthogonal, , change-of-basis matrix (Chirikjian et al., 2001); each for is a matrix known as a Wigner-D matrix111The ‘D’ stands for Darstellung, German for representation; and the is the direct sum or concatenation of matrices along the diagonal. The Wigner-D matrices are irreducible representations of SO(3)—think of them as the ‘smallest’ representations possible. Vectors transforming according to (i.e. we set , ), are called type- vectors. Type-0 vectors are invariant under rotations and type-1 vectors rotate according to 3D rotation matrices. Note, type- vectors have length . They can be stacked, forming a feature vector transforming according to Eq. 5.

Tensor Field Networks Tensor field networks (TFN) (Thomas et al., 2018) are neural networks, which map point clouds to point clouds under the constraint of SE(3)-equivariance, the group of 3D rotations and translations. For point clouds, the input is a vector field of the form

(6)

where is the Dirac delta function, are the 3D point coordinates and are point features, representing such quantities as atomic number or point identity. For equivariance to be satisfied, the features of a TFN transform under Eq. 5, where . Each is a concatenation of vectors of different types, where a subvector of type- is written . A TFN layer computes the convolution of a continuous-in-space, learnable weight kernel from type- features to type- features. The type- output of the TFN layer at position is

(7)

We can also include a sum over input channels, but we omit it here. Weiler et al. (2018); Thomas et al. (2018) and Kondor (2018) showed that the kernel lies in the span of an equivariant basis . The kernel is a linear combination of these basis kernels, where the th coefficient is a learnable function of the radius . Mathematically this is

(8)

Each basis kernel is formed by taking a linear combination of Clebsch-Gordan matrices of shape , where the th linear combination coefficient is the th dimension of the th spherical harmonic . Each basis kernel

completely constrains the form of the learned kernel in the angular direction, leaving the only learnable degree of freedom in the radial direction. Note that

only when and , which reduces the kernel to a scalar multiplied by the identity, , referred to as self-interaction (Thomas et al., 2018). As such we can rewrite the TFN layer as

(9)

Eq. 7 and Eq. 9 present the convolution in message-passing form, where messages are aggregated from all nodes and feature types. They are also a form of nonlocal graph operation as in Eq. 3, where the weights are functions on edges and the features are node features. We will later see how our proposed attention layer unifies aspects of convolutions and graph neural networks.

[width=]images/4steps.pdf

Figure 2: Updating the node features using our equivariant attention mechanism in four steps. A more detailed description, especially of step 2, is provided in the Appendix. Steps 3 and 4 visualise a graph network perspective: features are passed from nodes to edges to compute keys, queries and values, which depend both on features and relative positions in a rotation-equivariant manner.

3 Method

Here, we present the SE(3)-Transformer. The layer can be broken down into a procedure of steps as shown in Fig. 2, which we describe in the following section. These are the construction of a graph from a point cloud, the construction of equivariant edge functions on the graph, how to propagate SE(3)-equivariant messages on the graph, and how to aggregate them. We also introduce an alternative for the self-interaction layer, which we call attentive self-interaction.

3.1 Neighbourhoods

Given a point cloud , we first introduce a collection of neighbourhoods , one centered on each point . These neighbourhoods are computed either via the nearest-neighbours methods or may already be defined. For instance, molecular structures have neighbourhoods defined by their bonding structure. Neighbourhoods reduce the computational complexity of the attention mechanism from quadratic in the number of points to linear. The introduction of neighbourhoods converts our point cloud into a graph. This step is shown as Step 1 of Fig. 2.

3.2 The SE(3)-Transformer

The SE(3)-Transformer itself consists of three components. These are 1) edge-wise attention weights , constructed to be SE(3)-invariant on each edge , 2) edge-wise SE(3)-equivariant value messages, propagating information between nodes, as found in the TFN convolution of Eq. 7, and 3) a linear/attentive self-interaction layer. Attention is performed on a per-neighbourhood basis as follows:

(10)

These components are visualised in Fig. 2. If we remove the attention weights then we have a tensor field convolution, and if we instead remove the dependence of on , we have a conventional attention mechanism. Provided that the attention weights are invariant, Eq. 10 is equivariant to SE(3)-transformations. This is because it is just a linear combination of equivariant value messages. Invariant attention weights can be achieved with a dot-product attention structure shown in Eq. 11. This mechanism consists of a normalised inner product between a query vector at node and a set of key vectors along each edge in the neighbourhood where

(11)

is the direct sum, i.e. vector concatenation in this instance. The linear embedding matrices and are of TFN type (c.f. Eq. 8). The attention weights are invariant for the following reason. If the input features are SO(3)-equivariant, then the query and key vectors are also SE(3)-equivariant, since the linear embedding matrices are of TFN type. The inner product of SO(3)-equivariant vectors, transforming under the same representation is invariant, since if and , then , because of the orthonormality of representations of SO(3), mentioned in the background section. We follow the common practice from the self-attention literature (Vaswani et al., 2017; Lee et al., 2019), and chosen a softmax nonlinearity to normalise the attention weights to unity, but in general any nonlinear function could be used.

Aside: Angular Modulation The attention weights add extra degrees of freedom to the TFN kernel in the angular direction. This is seen when Eq. 10 is viewed as a convolution with a data-dependent kernel . In the literature, SO(3) equivariant kernels are decomposed as a sum of products of learnable radial functions and non-learnable angular kernels (c.f. Eq. 8). The fixed angular dependence of is a strange artifact of the equivariance condition in noncommutative algebras and while necessary to guarantee equivariance, it is seen as overconstraining the expressiveness of the kernels. Interestingly, the attention weights introduce a means to modulate the angular profile of , while maintaining equivariance.

Channels, Self-interaction Layers, and Non-Linearities Analogous to conventional neural networks, the SE(3)-Transformer can straightforwardly be extended to multiple channels per representation degree , so far omitted for brevity. This sets the stage for self-interaction layers. The attention layer (c.f. Fig. 2 and circles 1 and 2 of Eq. 10) aggregates information over nodes and input representation degrees . In contrast, the self-interaction layer (c.f. circle 3 of Eq. 10) exchanges information solely between features of the same degree and within one node—much akin to 1x1 convolutions in CNNs. Self-interaction is an elegant form of learnable skip connection, transporting information from query point in layer to query point in layer . This is crucial since, in the SE(3)-Transformer, points do not attend to themselves. In our experiments, we use two different types of self-interaction layer: (1) linear and (2) attentive, both of the form

(12)

Linear: Following Schütt et al. (2017), output channels are a learned linear combination of input channels using one set of weights per representation degree, shared across all points. As proposed in Thomas et al. (2018), this is followed by a norm-based non-linearity.

Attentive: We propose an extension of linear self-interaction, attentive self-interaction, combining self-interaction and nonlinearity. We replace the learned scalar weights with attention weights output from an MLP, shown in Eq. 13 ( means concatenation.). These weights are SE(3)-invariant due to the invariance of inner products of features, transforming under the same representation.

(13)

3.3 Node and Edge Features

Point cloud data often has information attached to points (node-features) and connections between points (edge-features), which we would both like to pass as inputs into the first layer of the network. Node information can directly be incorporated via the tensors in Eqs. 10, LABEL: and 6. For incorporating edge information, note that is part of multiple neighbourhoods. One can replace with in Eq. 10. Now, can carry different information depending on which neighbourhood we are currently performing attention over. In other words, can carry information both about node but also about edge . Alternatively, if the edge information is scalar, it can be incorporated into the weight matrices and as an input to the radial network (see step 2 in Fig. 2).

[width=]images/fig_nri_example_rotation_sett.pdf

Set Transformer

[width=]images/fig_nri_example_rotation_equi.pdf

SE(3)-Transformer
Figure 3: A model based on conventional self-attention (left) and our rotation-equivariant version (right) predict future locations and velocities in a 5-body problem. The respective left-hand plots show input locations at time step , ground truth locations at , and the respective predictions. The right-hand plots show predicted locations and velocities for rotations of the input in steps of 10 degrees. The dashed curves show the predicted locations of a perfectly equivariant model.

4 Experiments

We test the efficacy of the SE(3)-Transformer on three datasets, each testing different aspects of the model. The N-body problem is an equivariant task: rotation of the input should result in rotated predictions of locations and velocities of the particles. Next, we evaluate on a real-world object classification task. Here, the network is confronted with large point clouds of noisy data with symmetry only around the gravitational axis. Finally, we test the SE(3)-Transformer on a molecular property regression task, which shines light on its ability to incorporate rich graph structures. We compare to publicly available, state-of-the-art results as well as a set of our own baselines. Specifically, we compare to the Set-Transformer (Lee et al., 2019)

, a non-equivariant attention model, and Tensor Field Networks

(Thomas et al., 2018), which is similar to SE(3)-Transformer but does not leverage attention.

Similar to Sosnovik et al. (2020); Worrall and Welling (2019), we measure the exactness of equivariance by applying uniformly sampled SO(3)-transformations to input and output. The distance between the two, averaged over samples, yields the equivariance error. Note that, unlike in Sosnovik et al. (2020), the error is not squared:

(14)

4.1 N-Body Simulations

In this experiment, we use an adaptation of the dataset from Kipf et al. (2018). Five particles each carry either a positive or a negative charge and exert repulsive or attractive forces on each other. The input to the network is the position of a particle in a specific time step, its velocity, and its charge. The task of the algorithm is then to predict the relative location and velocity 500 time steps into the future. We deliberately formulated this as a regression problem to avoid the need to predict multiple time steps iteratively. Even though it certainly is an interesting direction for future research to combine equivariant attention with, e.g., an LSTM, our goal here was to test our core contribution and compare it to related models. This task sets itself apart from the other two experiments by not being invariant but equivariant: When the input is rotated or translated, the output changes respectively (see Fig. 3).

We trained an SE(3)-Transformer with 4 equivariant layers, each followed by an attentive self-interaction layer (details are provided in the Appendix). Table 1 shows quantitative results. Our model outperforms both an attention-based, but not rotation-equivariant approach (Set Transformer) and a equivariant approach which does not levarage attention (Tensor Field). The equivariance error shows that our approach is indeed fully rotation equivariant up to the precision of the computations.

! Linear DeepSet Zaheer et al. (2017) Tensor Field Thomas et al. (2018) Set Transformer Lee et al. (2019) SE(3)-Transformer MSE Position std - - MSE Velocity std - -

Table 1: Predicting future locations and velocities in an electron-proton simulation.

4.2 Real-World Object Classification on ScanObjectNN

[width=]images/accuracy_over_angle.pdf

Training without data augmentation.

[width=]images/accuracy_over_angle_augmented.pdf

Training with data augmentation.
Figure 4: ScanObjectNN: -axis shows data augmentation on the test set. The -value corresponds to the maximum rotation around a random axis in the --plane. If both training and test set are not rotated ( in a), breaking the symmetry of the SE(3)-Transformer by providing the -component of the coordinates as an additional, scalar input improves the performance significantly. Interestingly, the model learns to ignore the additional, symmetry-breaking input when the training set presents a rotation-invariant problem (strongly overlapping dark red circles and dark purple triangles in b).

ScanObjectNN is a recently introduced dataset for real-world object classification. The benchmark provides point clouds of 2902 objects across 15 different categories. We only use the coordinates of the points as input and object categories as training labels. We train an SE(3)-Transformer with 4 equivariant layers with linear self-interaction followed by max-pooling and an MLP. Interestingly, the task is not fully rotation invariant, in a statistical sense, as the objects are aligned with respect to the gravity axis. This results in a performance loss when deploying a fully SO(3) invariant model (see

Fig. 4). In other words: when looking at a new object, it helps to know where ‘up’ is. We create an SO(2) invariant version of our algorithm by additionally feeding the -component as an type-0 field and the , position as an additional type-1 field (see Appendix). We dub this model SE(3)-Transformer +z. This way, the model can ‘learn’ which symmetries to adhere to by suppressing and promoting different inputs (compare Fig. 4 and Fig. 4). In Table 2, we compare our model to the current state-of-the-art in object classification222PointGLR is a recently published preprint (Rao et al., 2020). The performance of the following models was taken from the official benchmark of the dataset as of June 4th, 2020 (https://hkust-vgd.github.io/benchmark/): 3DmFV (Ben-Shabat et al., 2018), PointNet (Qi et al., 2017a), SpiderCNN (Xu et al., 2018), PointNet++ (Qi et al., 2017b), DGCN (Wang et al., 2019).. Despite the dataset not playing to the strengths of our model (full SE(3)-invariance) and a much lower number of input points, the performance is competitive with models specifically designed for object classification.

0.98! 30DeepSet 303DmFV 30Set Transformer 30PointNet 30SpiderCNN 30Tensor Field +z 30PointNet++ 30SE(3)-Transf.+z 30PointCNN 30DGCNN 30PointGLR No. Points 1024 1024 1024 1024 1024 128 1024 128 1024 1024 1024 Accuracy 71.4% 73.8% 74.1% 79.2% 79.5% 81.0% 84.3% 85.0% 85.5% 86.2% 87.2%

Table 2: Classification accuracy on the ’object only’ category of the ScanObjectNN dataset2

. The performance of the SE(3)-Transformer is averaged over 5 runs (standard deviation 0.7%).

4.3 Qm9

0.55! Task Units meV meV meV D cal/mol K WaveScatt (Hirn et al., 2017) .160 118 85 76 .340 .049 NMP (Gilmer et al., 2020) .092 69 43 38 .030 .040 SchNet (Schütt et al., 2017) .235 63 41 34 .033 .033 Cormorant Anderson et al. (2019) .085 61 34 38 .038 .026 LieConv(T3) (Finzi et al., 2020) .084 49 30 25 .032 .038 TFN (Thomas et al., 2018) .223 58 40 38 .064 .101 Us .148 53 36 33 .053 .057

Table 3: QM9 Mean Absolute Error. Top: Non-equivariant models. Bottom: Equivariant models

The QM9 regression dataset (Ramakrishnan et al., 2014) is a publicly available chemical property prediction task. There are 134k molecules with up to 29 atoms per molecule. Atoms are represented as a 5 dimensional one-hot node embeddings in a molecular graph connected by 4 different chemical bond types (more details in Appendix). ‘Positions’ of each atom are provided. We show results on the test set of Anderson et al. (2019) for 6 regression tasks in Table 3. Lower is better. The table is split into non-equivariant (top) and equivariant models (bottom). Our nearest models are Cormorant and TFN (own implementation). We see that while not state-of-the-art, we offer competitive performance, especially against Cormorant and TFN, which transform under irreducible representations of SE(3) (like us), unlike LieConv(T3), using a left-regular representation of SE(3), which may explain its success.

5 Conclusion

We have presented an attention-based neural architecture designed specifically for point cloud data. This architecture is guaranteed to be robust to rotations and translations of the input, obviating the need for training time data augmentation and ensuring stability to arbitrary choices of coordinate frame. The use of self-attention allows for anisotropic, data-adaptive filters, while the use of neighbourhoods enables scalability to large point clouds. We have also introduced the interpretation of the attention mechanism as a data-dependent nonlinearity, adding to the list of equivariant nonlinearties which we can use in equivariant networks. Furthermore, we provide pseudocode in the Appendix for a speed up of spherical harmonics computation of up to 3 orders of magnitudes. This speed-up allowed us to train significantly larger versions of both the SE(3)-Transformer and the Tensor Field network (Thomas et al., 2018) and to apply these models to real-world datasets.

Our experiments showed that adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability. Specifically for large neighbourhoods, attention proved essential for model convergence. On the other hand, compared to convential attention, adding the equivariance constraints also increases performance in all of our experiments while at the same time providing a mathematical guarantee for robustness with respect to rotations of the input data.

Broader Impact

The main contribution of the paper is a mathematically motivated attention mechanism which can be used for deep learning on point cloud based problems. We do not see a direct potential of negative impact to the society. However, we would like to stress that this type of algorithm is inherently suited for classification and regression problems on molecules. The SE(3)-Transformer therefore lends itself to application in drug research. One concrete application we are currently investigating is to use the algorithm for early-stage suitability classification of molecules for inhibiting the reproductive cycle of the coronavirus. While research of this sort always requires intensive testing in wet labs, computer algorithms can be and are being used to filter out particularly promising compounds from large databases of millions of molecules.

Acknowledgements

We would like to express our gratitude to the Bosch Center for Artificial Intelligence and Konincklijke Philips N.V. for their support and contribution to open research in publishing our paper.

References

  • B. Anderson, T. S. Hy, and R. Kondor (2019) Cormorant: covariant molecular neural networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §D.3, §4.3, Table 3.
  • L. J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv Preprint. Cited by: §D.3, Table 5.
  • Y. Ben-Shabat, M. Lindenbaum, and A. Fischer. (2018)

    3dmfv: three-dimensional point cloud classification in realtime using convolutional neural networks

    .
    IEEE Robotics and Automation Letters. Cited by: footnote 2.
  • G. S. Chirikjian, A. B. Kyatkin, and A. Buckingham (2001) Engineering applications of noncommutative harmonic analysis: with emphasis on rotation and motion groups. Appl. Mech. Rev. 54 (6), pp. B97–B98. Cited by: Appendix A, §2.3.
  • T. S. Cohen and M. Welling (2017) Steerable cnns. International Conference on Learning Representations (ICLR). Cited by: §2.3.
  • M. Finzi, S. Stanton, P. Izmailov, and A. Wilson (2020) Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. Proceedings of the International Conference on Machine Learning, ICML. Cited by: Table 3.
  • F. B. Fuchs, A. R. Kosiorek, L. Sun, O. P. Jones, and I. Posner (2020) End-to-end recurrent multi-object tracking and prediction with relational reasoning. arXiv preprint. Cited by: §2.1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, and O. V. G. E. Dahl (2020) Neural message passing for quantum chemistry. Proceedings of the International Conference on Machine Learning, ICML. Cited by: Table 3.
  • M. J. Hirn, S. Mallat, and N. Poilvert (2017) Wavelet scattering regression of quantum chemical energies. Multiscale Model. Simul. 15 (2), pp. 827–863. Cited by: Table 3.
  • Y. Hoshen (2017) VAIN: attentional multi-agent predictive modeling. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §2.2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR, Cited by: §D.1, §D.2, §D.3.
  • T. N. Kipf, E. Fetaya, K. Wang, M. Welling, and R. S. Zemel (2018) Neural relational inference for interacting systems. In Proceedings of the International Conference on Machine Learning, ICML, Cited by: §D.2, §1, §4.1.
  • R. Kondor (2018) N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials. arXiv preprint. Cited by: §2.3.
  • J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the International Conference on Machine Learning, ICML, Cited by: §D.1, §D.2, §2.1, §2.1, §3.2, Table 1, §4.
  • Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. International Conference on Learning Representations (ICLR). Cited by: §2.2.
  • N. Parmar, P. Ramachandran, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. In Advances in Neural Information Processing System (NeurIPS), Cited by: §1, §2.1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3d classification and segmentation.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
    Cited by: footnote 2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) Pointnet++: deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems (NeurIPS). Cited by: footnote 2.
  • R. Ramakrishnan, P. Dral, M. Rupp, and A. von Lilienfeld (2014) Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1, pp. . Cited by: §D.3, §1, §4.3.
  • Y. Rao, J. Lu, and J. Zhou (2020) Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: footnote 2.
  • D. W. Romero, E. J. Bekkers, J. M. Tomczak, and M. Hoogendoorn (2020) Attentive group equivariant convolutional networks. Cited by: §2.3.
  • K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela1, A. Tkatchenko, and K.-R. Müller (2017) SchNet: a continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §3.2, Table 3.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). Cited by: §2.2.
  • I. Sosnovik, M. Szmaja, and A. Smeulders (2020) Scale-equivariant steerable networks. International Conference on Learning Representations (ICLR). Cited by: §4.
  • N. Thomas, T. Smidt, S. M. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley (2018) Tensor field networks: rotation- and translation-equivariant neural networks for 3d point clouds. ArXiv Preprint. Cited by: Appendix A, §D.1, §D.1, §D.2, §2.3, §2.3, §3.2, Table 1, Table 3, §4, §5.
  • M. A. Uy, Q. Pham, B. Hua, D. T. Nguyen, and S. Yeung (2019) Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data. In International Conference on Computer Vision (ICCV), Cited by: §1.
  • S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber (2018)

    Relational neural expectation maximization: unsupervised discovery of objects and their interactions

    .
    International Conference on Learning Representations (ICLR). Cited by: §2.1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §D.3, §1, §2.1, §2.1, §2.2, §3.2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: §1, §2.1, §2.2.
  • E. Wagstaff, F. B. Fuchs, M. Engelcke, I. Posner, and M. A. Osborne (2019) On the limitations of representing functions on sets. International Conference on Machine Learning (ICML). Cited by: §2.1.
  • X. Wang, R. Girshick, A. Gupta, and K. He (2017) Non-local neural networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.2.
  • Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: footnote 2.
  • M. Weiler, M. Geiger, M. Welling, W. Boomsma, and T. Cohen (2018) 3D steerable cnns: learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Appendix A, Appendix B, Table 4, §2.3.
  • D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow (2016) Harmonic networks: deep translation and rotation equivariance. External Links: 1612.04642 Cited by: §D.3.
  • D. E. Worrall and M. Welling (2019) Deep scale-spaces: equivariance over scale. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §D.1, §D.1, §2.1.
  • S. Xie, S. Liu, and Z. C. Z. Tu (2018) Attentional shapecontextnet for point cloud recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.1.
  • Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao (2018) Spidercnn: deep learning on point sets with parameterized convolutional filters. European Conference on Computer Vision (ECCV). Cited by: footnote 2.
  • J. Yang, Q. Zhang, and B. Ni (2019) Modeling point clouds with self-attention and gumbel subset sampling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.1.
  • M. Zaheer, S. Kottur, S. Ravanbhakhsh, B. Póczos, R. Salakhutdinov, and A. Smola (2017) Deep Sets. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §D.1, §D.2, Table 1.

Appendix A Group Theory and Tensor Field Networks

Groups

A group is an abstract mathematical concept. Formally a group consists of a set and a binary composition operator (typically we just use the symbol to refer to the group). All groups must adhere to the following 4 axioms

  • Closure: for all

  • Associativity: for all

  • Identity: There exists an element such that for all

  • Inverses: For each there exists a such that

In practice, we omit writing the binary composition operator , so would write instead of . Groups can be finite or infinite, countable or uncountable, compact or non-compact. Note that they are not necessarily commutative; that is, in general.

Actions/Transformations

Groups are useful concepts, because they allow us to describe the structure of transformations, also sometimes called actions. A transformation (operator) is an injective map from a space into itself. It is parameterised by an element of a group . Transformations obey two laws:

  • Closure: is a valid transformation for all

  • Identity: There exists at least one element such that for all ,

where denotes composition of transformations. For the expression , we say that acts on . It can also be shown that transformations are associative under composition. To codify the structure of a transformation, we note that due to closure we can always write

(15)

If for any we can always find a group element , such that , then we call a homogeneous space. Homogeneous spaces are important concepts, because to each pair of points we can always associate at least one group element.

Equivariance and Intertwiners

As written in the main body of the text, equivariance is a property of functions . Just to recap, given a set of transformations for , where is an abstract group, a function is called equivariant if for every there exists a transformation such that

(16)

If is linear and equivariant, then it is called an intertwiner. Two important questions arise: 1) How do we choose ? 2) once we have , how do we solve for ? To answer these questions, we need to understand what kinds of are possible. For this, we review representations.

Representations

A group representation is a map from a group to the set of invertible matrices . Critically is a group homomorphism; that is, it satisfies the following property for all . Representations can be used as transformation operators, acting on -dimensional vectors . For instance, for the group of 3D rotations, known as , we have that 3D rotation matrices, act on (i.e., rotate) 3D vectors, as

(17)

However, there are many more representations of than just the 3D rotation matrices. Among representations, two representations and of the same dimensionality are said to be equivalent if they can be connected by a similarity transformation

(18)

We also say that a representation is reducible if is can be written as

(19)

If the representations and are not reducible, then they are called irreducible representations of , or irreps. In a sense, they are the atoms among representations, out of which all other representations can be constructed. Note that each irrep acts on a separate subspace, mapping vectors from that space back into it. We say that subspace is invariant under irrep , if .

Representation theory of

As it turns out, all linear representations of compact groups333Over a field of characteristic zero. (such as ) can be decomposed into a direct sum of irreps, as

(20)

where is an orthogonal, , change-of-basis matrix [Chirikjian et al., 2001]; and each for is a matrix known as a Wigner-D matrix. The Wigner-D matrices are the irreducible representations of . We also mentioned that vectors transforming according to (i.e. we set ), are called type- vectors. Type-0 vectors are invariant under rotations and type-1 vectors rotate according to 3D rotation matrices. Note, type- vectors have length . In the previous paragraph we mentioned that irreps act on orthogonal subspaces . The orthogonal subspaces corresponding to the Wigner-D matrices are the space of spherical harmonics.

The Spherical Harmonics

The spherical harmonics for are square-integrable complex-valued functions on the sphere . They have the satisfying property that they are rotated directly by the Wigner-D matrices as

(21)

where is the th Wigner-D matrix and is its complex conjugate. They form an orthonormal basis for (the Hilbert space of) square-integrable functions on the sphere , with inner product given as

(22)

So , where is the th element of . We can express any function in as a linear combination of spherical harmonics, where

(23)

where each is a vector of coefficients of length . And in the opposite direction, we can retrieve the coefficients as

(24)

following from the orthonormality of the spherical harmonics. This is in fact a Fourier transform on the sphere and the the vectors

can be considered Fourier coefficients. Critically, we can represent rotated functions as

(25)

The Clebsch-Gordan Decomposition

In the main text we introduced the Clebsch-Gordan coefficients. These are used in the construction of the equivariant kernels. They arise in the situation where we have a tensor product of Wigner-D matrices, which as we will see is part of the equivariance constraint on the form of the equivariant kernels. In representation theory a tensor product of representations is also a representation, but since it is not an easy object to work with, we seek to decompose it into a direct sum of irreps, which are easier. This decomposition is of the form of Eq. 20, written

(26)

In this specific instance, the change of basis matrices are given the special name of the Clebsch-Gordan coefficients. These can be found in many mathematical physics libraries.

Tensor Field Layers

In Tensor Field Networks Thomas et al. [2018] and 3D Steerable CNNs Weiler et al. [2018], the authors solve for the intertwiners between SO(3) equivariant point clouds. Here we run through the derivation again in our own notation.

We begin with a point cloud , where is an equivariant point feature. Let’s say that is a type- feature, which we write as to remind ourselves of the fact. Now say we perform a convolution with kernel , which maps from type- features to type- features. Then

(27)
(28)
(29)
(30)
sifting theorem (31)
(32)

Now let’s apply the equivariance condition to this expression, then

(33)
(34)

Now we notice that this expression should also be equal to Eq. 32, which is the convolution with an unrotated point cloud. Thus we end up at

(35)

which is sometimes refered to as the kernel constraint. To solve the kernel constraint, we notice that it is a linear equation and that we can rearrange it as

(36)

where we used the identity and the fact that the Wigner-D matrices are orthogonal. Using the Clebsch-Gordan decomposition we rewrite this as

(37)

Lastly, we can left multiply both sides by and denote , noting the the Clebsch-Gordan matrices are orthogonal. At the same time we

(38)

Thus we have that the th subvector of is subject to the constraint

(39)

which is exactly the transformation law for the spherical harmonics from Eq. 21! Thus one way how can be constructed is

(40)

Appendix B Recipe for Building an Equivariant Weight Matrix

One of the core operations in the SE(3)-Transformer is multiplying a feature vector , which transforms according to , with a matrix while preserving equivariance:

(41)

where and . Here, as in the previous section we showed how such a matrix could be constructed when mapping between features of type- and type-, where is a block diagonal matrix of type- Wigner-D matrices and similarly is made of type- Wigner-D matrices. is dependent on the relative position and underlies the linear equivariance constraints, but is also has learnable components, which we did not show in the previous section. In this section, we show how such a matrix is constructed in practice.

Previously we showed that

(42)

which is an equivariant mapping between vectors of types and . In practice, we have multiple input vectors of type- and multiple output vectors of type-. For simplicity, however, we ignore this and pretend we only have a single input and single output. Note that has no learnable components. Note that the kernel constraint only acts in the angular direction, but not in the radial direction, so we can introduce scalar radial functions (one for each ), such that

(43)

There radial functions act as an independent, learnable scalar factor for each degree . The vectorised matrix has dimensionality . We can unvectorise the above yielding

(44)
(45)

where is a slice from , corresponding to spherical harmonic As we showed in the main text, we can also rewrite the unvectorised Clebsch-Gordan–spherical harmonic matrix-vector product as

(46)

In contrast to Weiler et al. [2018], we do not voxelise space and therefore will be different for each pair of points in each point cloud. However, the same will be used multiple times in the network and even multiple times in the same layer. Hence, precomputing them at the beginning of each forward pass for the entire network can significantly speed up the computation. The Clebsch-Gordan coefficients do not depend on the relative positions and can therefore be precomputed once and stored on disk. Multiple libraries exist which approximate those coefficients numerically.

Appendix C Accelerated Computation of Spherical Harmonics

We wrote our own spherical harmonics library in Pytorch, which can generate spherical harmonics on the GPU. We found this critical to being able to run the SE(3)-Transformer and Tensor Field network baselines in a reasonable time. This library is accurate to within machine precision against the scipy counterpart scipy.special.sph_harm and is 10x faster on CPU and 100-1000x on GPU. Here we outline our method to generate them.

The tesseral/real spherical harmonics are given as