Performs Gaussian Integration
The idea of equivariance to symmetry transformations provides one of the first theoretically grounded principles for neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. Here we show how this principle can be extended beyond global symmetries to local gauge transformations, thereby enabling the development of equivariant convolutional networks on general manifolds. We implement gauge equivariant CNNs for signals defined on the icosahedron, which provides a reasonable approximation of spherical signals. By choosing to work with this very regular manifold, we are able to implement the gauge equivariant convolution using a single conv2d call, making it a highly scalable and practical alternative to Spherical CNNs. We evaluate the Icosahedral CNN on omnidirectional image segmentation and climate pattern segmentation, and find that it outperforms previous methods.READ FULL TEXT VIEW PDF
Performs Gaussian Integration
By and large, progress in deep learning has been achieved through intuition-guided experimentation. This approach is indispensable and has led to many successes, but has not produced a deep understanding ofwhy and when certain architectures work well. As a result, every new application requires an extensive architecture search, which comes at a significant labor and energy cost.
Although a theory that tells us which architecture to use for any given problem is clearly out of reach, we can nevertheless come up with general principles to guide architecture search. One such rational design principle that has met with substantial empirical success (Winkels & Cohen, 2018; Zaheer et al., 2017; Lunter & Brown, 2018) maintains that network architectures should be equivariant to symmetry transformations.
Besides the ubiquitous convolutional network (which is translation equivariant), equivariant networks have been developed for sets, graphs, and homogeneous spaces like the plane and the sphere (see Sec. 3). In each case, the network is made equivariant to the global symmetries of the underlying space. In general however, a manifold does not have global symmetries, and so it is not clear how one might develop equivariant / convolutional networks for them.
In this paper we define a convolution-like operation on general manifolds that is equivariant to local gauge transformations (Fig. 1). This gauge equivariant convolution takes as input a number of feature fields on
of various types (analogous to matter fields in physics), and produces as output new feature fields. Each field is represented by a number of feature maps, whose activations are interpreted as the coefficients of a geometrical object (e.g. scalar, vector, tensor, etc.) relative to a spatially varying frame (i.e. gauge). The network is constructed such that if one were to change the gauge, the coefficients would change in a predictable way so as to preserve their geometrical meaning. Indeed, the search for a geometrically sensible definition of “manifold convolution”, a key problem in geometric deep learning, leads inevitably to gauge equivariance.
Although the theory of gauge equivariant networks developed in this paper is very general, we apply it to one specific manifold: the icosahedron. This manifold has some global symmetries (discrete rotations), which nicely shows the difference between and interplay of local and global symmetries. In addition, the regularity and local flatness of this manifold allows for a very efficient implementation using existing deep learning primitives (i.e. conv2d). The resulting algorithm shows excellent performance and accuracy on a range of different problems.
The shift from global to local symmetries mirrors a key development of 20th century physics. Indeed, gauge invariance plays a key role in all modern physical theories. However, for all its centrality, gauge theory has a reputation for being abstract and difficult. So in order to keep this article accessible to a broad machine learning audience, we have chosen to emphasize geometrical intuition over mathematical formality.
The rest of this paper is organized as follows. In Sec. 2, we motivate the need for working with gauges, and define gauge equivariant convolution for general manifolds and fields. In section 3, we discuss related work on equivariant and geometrical deep learning. Then in section 4, we discuss the concrete instantiation and implementation of gauge equivariant CNNs for the icosahedron. Results on IcoMNIST, climate pattern segmentation, and omnidirectional RGB-D image segmentation are presented in Sec. 5.
Consider the problem of generalizing the classical convolution of two planar signals (e.g. a feature map and a filter) to signals defined on a manifold . The first and most natural idea comes from thinking of planar convolution in terms of shifting a filter over a feature map. Observing that shifts are symmetries of the plane (mapping the plane onto itself while preserving its structure), one is led to the idea of transforming a filter on by the symmetries of . For instance, replacing shifts of the plane by rotations of the sphere, one obtains Spherical CNNs (Cohen et al., 2018b).
This approach works for any homogeneous space, where by definition it is possible to move from any point to any other point using an appropriate symmetry transformation (Kondor & Trivedi, 2018; Cohen et al., 2018c, a). On less symmetrical manifolds however, it may not be possible to move the filter from any point to any other point by symmetry transformations. Hence, transforming filters by symmetry transformations will in general not provide a recipe for weight sharing between filters at all points in .
Instead of symmetries, one can move the filter by parallel transport (Schonsheck et al., 2018), but as shown in Fig. 2, this leaves an ambiguity in the filter orientation, because parallel transport is path dependent. This can be addressed by using only rotation invariant filters (Boscaini et al., 2015; Bruna et al., 2014), albeit at the cost of limiting expressivity.
The key issue we have just identified is that on a manifold, there is no natural way to choose a preferred gauge (local frame), relative to which we can position our measurement apparatus (i.e. filter), and relative to which we can describe measurements (i.e. responses). We must choose a gauge in order to numerically represent geometrical quantities and perform computations, but since it is arbitrary, the computations should be independent of it.
This does not mean however that the coefficients of the feature vectors should be invariant to gauge transformations, but rather that the feature vector itself should be invariant. That is, a gauge transformation leads to a change of basis of the feature space, so the feature vector coefficients should change equivariantly to ensure that the vector itself is unchanged.
Before showing how this is achieved, we note that on non-parallelizable manifolds such as the sphere, it is not possible to choose a smooth global gauge. Hence, in order to make the mathematics work smoothly, one is forced to work with locally defined gauges. In practice, we will have to discretize the manifold, but we will nevertheless give a local description for reasons of mathematical elegance.
The basic idea of gauge equivariant convolution is as follows. Lacking alternative options, we choose arbitrarily a smooth local gauge on subsets (e.g. the red or blue gauge in Fig. 1). We can then position a filter at each point , defining its orientation relative to the gauge. Then, we can match an input feature map against the filter at to obtain the value of the output feature map at . For the output to transform equivariantly, certain linear constraints are placed on the convolution kernel. We will now define this formally.
We define a gauge as a position-dependent invertible linear map , where is the tangent space of at . This determines a frame in , where is the standard frame of .
A gauge transformation (Fig. 1) is a position-dependent change of frame, which can be described by maps (the group of invertible matrices). As indicated by the subscript, the transformation depends on the position . To change the frame, simply compose with , i.e. . It follows that component vectors transform as , so that the vector itself is invariant.
If we derive our gauge from a coordinate system for (as shown in Fig. 1), then a change of coordinates leads to a gauge transformation ( being the Jacobian of the coordinate transformation at ). But we can also choose a gauge independent of any coordinate system.
It is often useful to restrict the kinds of frames we consider, for example to only allow right-handed or orthogonal frames. Such restrictions limit the kinds of gauge transformations we can consider. For instance, if we allow only right-handed frames, should have positive determinant (i.e. ), so that it does not reverse the orientation. If in addition we allow only orthogonal frames, must be a rotation, i.e. .
In mathematical terms, is called the structure group of the theory, and limiting the kinds of frames we consider corresponds to a reduction of the structure group (Husemöller, 1994). Each reduction corresponds to some extra structure that is preserved, such as an orientation () or Riemannian metric (). In the Icosahedral CNN (Fig. 4), we will want to preserve the hexagonal grid structure, which corresponds to a restriction to grid-aligned frames and a reduction of the structure group to , the group of planar rotations by integer multiples of . For the rest of this section, we will work in the Riemannian setting, i.e. use .
Before we can define gauge equivariant convolution, we will need the exponential map, which gives a convenient parameterization of the local neighbourhood of . This map takes a tangent vector , follows the geodesic (shortest curve) in the direction of with speed for one unit of time, to arrive at a point (see Fig. 3 and the Supp. Mat.).
Having defined gauges, gauge transformations, and the exponential map, we are now ready to define gauge equivariant convolution. We begin with scalar input and output fields.
We define a filter as a locally supported function , where may be identified with via the gauge . Then, writing for , we define the scalar convolution of and at as follows:
The gauge was chosen arbitrarily, so we must consider what happens if we change it. Since the filter is a function of a coordinate vector , and gets rotated by gauge transformations, the effect of a gauge transformation is a position-dependent rotation of the filters. For the convolution output to be called a scalar field, it has to be invariant to gauge transformations ( and ). For this to be the case, the filter has to be rotation-invariant:
This is very limiting, but it is what it is: to map a scalar input field to a scalar output field in a gauge equivariant manner, we need to use rotationally symmetric filters. More general fields will yield less constrained filters.
Intuitively, a field is an assignment of some geometrical quantity (feature vector) of the same type to each point . The type of a quantity is determined by its transformation behaviour under gauge transformations. For instance, the word vector field is reserved for a field of tangent vectors , that transform like as we saw before. We emphasize again that geometrically speaking, a vector is not a list of scalars (coordinate vector), because scalars are invariant to gauge transformations.
In the general case, the transformation behaviour of a -dimensional geometrical quantity is described by a representation of the structure group . This is a mapping that satisfies , where denotes the composition of transformations in , and denotes multiplication of matrices and . The simplest examples are the trivial representation which describes the transformation behaviour of scalars, and , which describes the transformation behaviour of (tangent) vectors. A field that transforms like will be called a -field.
In Section 4 on Icosahedral CNNs, we will consider one more type of representation, namely the regular representation of . The group can be described as the planar rotations by , or as integers with addition mod . Features that transform like the regular representation of are -dimensional, with one component for each rotation. One can obtain a regular feature by taking a filter at , rotating it by for , and matching each rotated filter against the input signal. When the gauge is changed, the filter and all rotated copies are rotated, and so the components of a regular feature are cyclically shifted. Hence, is a cyclic permutation matrix that shifts the coordinates by steps for . Further examples of representations that are useful in convolutional networks may be found in (Cohen & Welling, 2017; Weiler et al., 2018a; Thomas et al., 2018; Hy et al., 2018).
Now consider a stack of input feature maps on , which represents a -dimensional -field (e.g. for a single scalar, for a vector, for a regular feature, or any multiple of these, etc.). We will define a convolution operation that takes such a field and produces as output a -dimensional -field. For this we need a filter bank with output channels and input channels, which we will describe mathematically as a matrix-valued kernel .
We can think of as a linear map from the input feature space (“fiber”) at to the output feature space at , these spaces being identified with resp. by the choice of gauge at . This suggests that we need to modify Eq. 1 to make sure that the kernel matrix is multiplied by a feature vector at , not one at . This is achieved by transporting to along the unique111For points that are close enough, there is always a unique geodesic. Since the kernel has local support, and will be close for all non-zero terms. geodesic connecting them, before multiplying by .
As is transported to , it undergoes a transformation which will be denoted (see Fig. 2). This transformation acts on the feature vector via the representation . Thus, we obtain the generalized form of Eq. 1 for general fields:
Under a gauge transformation, we have
For to be well defined as a -field, we want it to transform like . Or, in other words, should be gauge equivariant. This will be the case if and only if satisfies
This concludes our presentation of the general case. A gauge equivariant convolution on is defined relative to a local gauge by Eq. 3, where the kernel satisfies the equivariance constraint of Eq. 5. By defining gauges on local charts that cover and convolving inside each one, we automatically get a globally well-defined operation, because switching charts corresponds to a gauge transformation (Fig. 1), and the convolution is gauge equivariant.
On flat regions of the manifold, the exponential parameterization can be simplified to if we use an appropriate local coordinate of . Moreover, in such a flat chart, parallel transport is trivial, i.e. equals the identity. Thus, on a flat region, our convolution boils down to a standard convolution / correlation:
Moreover, we can recover group convolutions, spherical convolutions, and convolution on other homogeneous spaces as special cases as well (see supplementary material; A formal proof will be given in a forthcoming theoretical paper).
The work presented in this paper brings together two heretofore disparate lines of research: equivariant and geometrical deep learning. In the following, we will survey the most closely related work in these fields.
Equivariant networks have been proposed for permutation-equivariant analysis and prediction of sets (Zaheer et al., 2017; Hartford et al., 2018), graphs (Kondor et al., 2018b; Hy et al., 2018; Maron et al., 2019), translations and rotations of the plane and 3D space (Oyallon & Mallat, 2015; Cohen & Welling, 2016, 2017; Marcos et al., 2017; Weiler et al., 2018b, a; Worrall et al., 2017; Worrall & Brostow, 2018; Winkels & Cohen, 2018; Winkens et al., 2018; Thomas et al., 2018; Bekkers et al., 2018), and the sphere (see below). Equivariance to finite groups was studied in (Ravanbakhsh et al., 2017)
. Equivariant CNNs can be defined for any homogeneous space, where the theory is now well understood. These models be classified asRegular G-CNNs (which use scalar and regular features) (Kondor & Trivedi, 2018) and Steerable G-CNNs (general fields) (Cohen et al., 2018c, a). In this paper we generalize G-CNNs to general manifolds.
Geometric deep learning (Bronstein et al., 2017)
is concerned with the generalization of (convolutional) neural networks to manifolds. Many definitions of manifold convolution have been proposed, and some of them (those called “intrinsic”) are gauge equivariant (although to the best of our knowledge, the relevance of gauge theory has not been observed before). However, these methods are all limited to particular feature types(typically scalar), and/or use a parameterization of the kernel that is not maximally flexible.
define a convolution that is essentially the same as our scalar-to-regular convolution, followed by a max-pooling over orientations, which in our terminology maps a regular field to a scalar field. As shown experimentally in(Cohen & Welling, 2016, 2017) and in this paper, it is often more effective to use convolutions that preserve orientation information. Another solution is to align the filter with the maximum curvature direction (Boscaini et al., 2016), but this approach is not intrinsic and does not work for flat surfaces or uniformly curved spaces like spheres.
(Poulenard & Ovsjanikov, 2018) define a multi-directional convolution for “directional functions” (somewhat similar to what we call regular fields), but they parameterize the kernel by a scalar function on the tangent space, which is very limited compared to our matrix-valued kernel (which is the most general kernel mapping fields to fields).
Besides the general theoretical framework of gauge equivariant convolution, we present in this paper a specific model (the Icosahedral CNN), which can be viewed as a fast and simple alternative to Spherical CNNs (Cohen et al., 2018b; Esteves et al., 2018; Boomsma & Frellsen, 2017; Su & Grauman, 2017; Perraudin et al., 2018; Jiang et al., 2018; Kondor et al., 2018a). Liu et al. (2019) use a spherical grid based on a subdivision of the icosahedron, and convolve over it using a method that is similar to the one presented in Sec. 4, but this method is not equivariant and does not take into account gauge transformations. We show in Sec. 5 that both are important for good performance.
To deeply understand gauge equivariant networks, we recommend studying the mathematics of gauge theory: principal fiber bundles (Schuller, 2016; Husemöller, 1994; Steenrod, 1951). The work presented in this paper can be understood as replacing the principal -bundle used in G-CNNs over homogeneous spaces (Cohen et al., 2018a) by the frame bundle of , which is another principal -bundle.
In this section we will describe a concrete method for performing gauge equivariant convolution on the icosahedron. The very special shape of this manifold makes it possible to implement gauge equivariant convolution in a way that is both numerically convenient (no interpolation is required to rotate filters), and computationally efficient (the heavy lifting is done by a single conv2d call).
The icosahedron is a regular solid with faces, edges, and vertices (see Fig. 4, left). It has rotational symmetries. The set of global (orientation preserving) symmetries of the icosahedron will be denoted222As an abstract group, (the alternating group A5), but we use to emphasize that it is realized by a set of 3D rotations. .
Whereas general manifolds, and even spheres, do not admit completely regular and symmetrical pixelations, we can define an almost perfectly regular grid of pixels on the icosahedron. This grid is constructed through a sequence of grid-refinement steps. We begin with a grid consisting of the corners of the icosahedron itself. Then, for each triangular face, we subdivide it into 4 smaller triangles, thus introducing 3 new points on the center of the edges of the original triangle. This process is repeated times to obtain a grid with points (Fig. 4, left).
Each grid point (pixel) in the grid has 6 neighbours, except for the corners of the icosahedron, which have . Thus, one can think of the non-corner grid points as hexagonal pixels, and the corner points as pentagonal pixels.
Notice that the grid is perfectly symmetrical, which means that if we apply an icosahedral symmetry to a point , we will always again get a point . Thus, in addition to talking about gauge equivariance, for this manifold / grid, we can also talk about (exact) equivariance to global transformations (3D rotations in ). Because these global symmetries act by permuting the pixels and changing the gauge, one can see that a gauge equivariant network is automatically equivariant to global transformations. This will be demonstrated in Section 5.
We define an atlas consisting of overlapping charts on the icosahedron, as shown in Fig. 4. Each chart is an invertible map , where and . The regions and are shown in Fig. 4. The maps themselves are linear on faces, and defined by hard-coded correspondences between the corner points in and points in the plane.
Each chart covers all the points in triangular faces of the icosahedron. Together, the charts cover all faces of the icosahedron.
We divide the charts into an exterior , consisting of border pixels, and an interior , consisting of pixels whose neighbours are all contained in chart . In order to ensure that every pixel in (except for the corners) is contained in the interior of some chart, we add a strip of pixels to the left and bottom of each chart, as shown in Fig. 4 (center). Then the interior of each chart (plus two exterior corners) has a nice rectangular shape , and every non-corner is contained in exactly one interior .
So if we know the values of the field in the interior of each chart, we know the whole field (except for the corners, which we ignore). However, in order to compute a valid convolution output at each interior pixel (assuming a hexagonal filter with one ring, i.e. a masked filter), we will still need the exterior pixels to be filled in as well (introducing a small amount of redundancy). See Sec. 4.6.1.
For the purpose of computation, we fix a convenient gauge in each chart. This gauge is defined in each as the constant orthogonal frame , aligned with the and direction of the plane (just like the red and blue gauge in Fig. 1). When mapped to the icosahedron via (the Jacobian / pushforward of) , the resulting frames are aligned with the grid, and the basis vectors make an angle of .
Pixels in the intersection of charts (including all border pixels) are represented in both charts. Although the local frames are numerically constant and identical in both charts and , if we push them to the icosahedron via and , we may not get the same frame. In other words, when switching from chart to chart , there may be a gauge transformation , which rotates the frame at (see Fig. 1).
A stack of feature fields is represented as an array of shape , where is the batch size, the number of fields, is the dimension of the fields ( for scalars and for regular features), is the number of charts, and are the height and width of each local chart ( and at resolution , including a -pixel padding region on each side, see Fig. 4). We can always reshape such an array to shape , resulting in a array that can be viewed as a stack of rectangular feature maps of shape . Such an array can be input to conv2d.
Gauge equivariant convolution on the icosahedron is implemented in three steps: G-Padding, kernel expansion, and 2d convolution.
In a standard CNN, we can only compute a valid convolution output at positions where the filter fits inside the input image in its entirety. If the output is to be of the same size as the input, one uses zero padding. Likewise, the IcoConv requires padding, only now the padding border of a chart consists of pixels that are also represented in the interior of another chart (Sec. 4.3). So instead of zero padding, we copy the pixels from the neighbouring chart. We always use hexagonal filters with 1 ring, which can be represented as a filter on a square grid, so we pad by 1 pixel.
As explained in Sec. 4.4, when transitioning between charts one may have to perform a gauge transformation on the features. Since scalars are invariant quantities, transition padding amounts to a simple copy in this case. Regular features (having orientation channels) transform by cyclic shifts (Sec. 2.3), where (Fig. 4), so we must cyclically shift the channels up or down before copying to get the correct coefficients in the new chart. The whole padding operation is implemented by four indexing + assignment operations (top, bottom, left, right) using fixed pre-computed indices (see Supp. Mat.).
For the convolution to be gauge equivariant, the kernel must satisfy Eq. 5. The kernel is stored in an array of shape , with the top-right and bottom-left pixel of each filter fixed at zero so that it corresponds to a 1-ring hexagonal kernel.
Eq. 5 says that if we linearly combine the input channels (columns) by and the ouput channels (rows) by , the result should equal the original kernel where each channel is rotated by . This is the case if we use the weight-sharing scheme shown in Fig. 6.
Weight sharing can be implemented in two ways. One can construct a basis of kernels, each of which has shape and has value at all pixels of a certain color/shade, and elsewhere. Then one can construct the full kernel by linearly combining these basis filters using learned weights (one for each input/output channel and basis kernel) (Cohen & Welling, 2017). Alternatively, for scalar and regular features, one can use a set of precomputed indices to expand the kernel as shown in Fig. 6, using a single indexing operation.
The complete algorithm can be summarized as
Where and both have shape , the weights have shape , and has shape . The output of GConv has shape .
In order to validate our implementation, highlight the potential benefits of our method, and determine the necessity of each part of the algorithm, we perform a number of experiments with the MNIST dataset, projected to the icosahedron.
We generate three different versions of the training and test sets, differing in the transformations applied to the data. In the N condition, No rotations are applied to the data. In the I condition, we apply all Icosahedral symmetries (rotations) to each digit. Finally, in the R condition, we apply random continuous rotations to each digit. All signals are represented as explained in Sec. 4.5 / Fig. 4 (right), using resolution , i.e. as an array of shape .
We evaluate the full model, which uses one gauge equivariant scalar-to-regular convolution layer, followed by regular-to-regular (R2R) layers and FC layers (see Supp. Mat. for architectural details). We also evaluate a model that uses only scalar-to-regular (S2R) convolution layers, followed by orientation pooling (a over the orientation channels of each regular feature, thus mapping a regular feature to a scalar), as in (Masci et al., 2015). Finally, we consider a model that uses only rotation-invariant filters, i.e. scalar-to-scalar (S2S) convolutions, similar to standard graph CNNs (Boscaini et al., 2015; Kipf & Welling, 2017).
In addition, we perform an ablation study where we disable each part of the algorithm. The first baseline is obtained from the full R2R network by disabling gauge padding (Sec. 4.6.1), and is called the No Pad (NP) network. In the second baseline, we disable the kernel Expansion (Sec. 4.6.2), yielding the NE condition. The third baseline, called NP+NE uses neither gauge padding nor kernel expansion, and amounts to a standard CNN applied to the same input representation. We adapt the number of channels so that all networks have roughly the same number of parameters.
|Arch.||N/N||N/I||N/R||I/ I||I / R||R / R|
As shown in Table 1, icosahedral CNNs achieve excellent performance with a test error of up to 99.43%, which is a strong result even on planar MNIST, for non-augmented and non-ensembled models. The full R2R model performs best in all conditions (though not significantly in the N/N condition), showing that both gauge padding and kernel expansion are necessary, and that our general (R2R) formulation works better in practice than using scalar fields (S2S or S2R). We notice also that non-equivariant models (NP+NE, NP, NE) do not generalize well to transformed data, a problem that is only partly solved by data augmentation. On the other hand, the models S2S, S2R, and R2R are exactly equivariant to symmetries , and so generalize perfectly to -transformed test data, even when these were not seen during training. None of the models automatically generalize to continuously rotated inputs (R), but the equivariant models are closer, and can get even closer () when using data augmentation during training.
We evaluate our method on the climate pattern segmentation task proposed by Mudigonda et al. (2017). The goal is to segment extreme weather events (Atmospheric Rivers (AR) and Tropical Cyclones (TC)) in data from global climate simulations.
We use the exact same data and evaluation methodology as (Jiang et al., 2018). The preprocessed data as released by (Jiang et al., 2018) consists of -channel spherical images at resolution , which we reinterpret as icosahedral signals (introducing slight distortion). See (Mudigonda et al., 2017) for a detailed description of the data.
We compare an R2R and S2R model (details in Supp. Mat.). As shown in Table 2
, our models outperform both competing methods in terms of per-class and mean accuracy. The difference between our R2R and S2R model seems small in terms of accuracy, but when evaluated in terms of mean average precision (a more common and appropriate evaluation metric for segmentation tasks), the R2R model clearly outperforms.
|Mudigonda et al.||97||74||65||78.67||-|
|Jiang et al.||97||94||93||94.67||-|
For our final experiment, we evaluate icosahedral CNNs on the 2D-3D-S dataset (Armeni et al., 2017), which consists of 1413 omnidirectional RGB+D images with pixelwise semantic labels in 13 classes. Following Jiang et al. (2018), we sample the data on a grid of resolution using bilinear interpolation, while using nearest-neighbour interpolation for the labels. Evaluation is performed by mean intersection over union (mIoU) and pixel accuracy (mAcc).
The network architecture is a residual (He et al., 2016) U-net (Ronneberger et al., 2015) with regular-to-regular convolutions. The network consists of a downsampling and upsampling network. The downsampling network takes as input a signal at resolution and outputs feature maps at resolutions , with and channels. The upsampling network is the reverse of this. We perform a pooling over orientation channels at the end, right before applying softmax.
|(Jiang et al., 2018)|
In this paper we have presented the general theory of gauge equivariant convolutional networks on manifolds, and demonstrated their utility in a special case: learning with spherical signals using the icosahedral CNN. We have demonstrated that this method performs very well on a range of different problems and is highly scalable. The results further show that our general formulation using regular feature fields has benefits over using scalar fields as is commonly done in geometric deep learning today. And finally, our ablation study shows that each part of the algorithm is required for good performance.
Although we have only touched on the connections to physics and geometry, there are indeed interesting and deep connections, which we plan to elaborate on in a future publication. From the perspective of the mathematical framework of principal fiber bundles, our definition of manifold convolution is entirely natural. Indeed, on general manifolds, gauge equivariance is not just “nice to have” but necessary in order for the convolution to be geometrically well-defined.
In future work, we hope to apply the theory to more general manifolds. Additionally, we believe that our chart-based approach to convolution on manifolds can in principle scale to even bigger problems by using smaller charts, thus opening the door to learning from high-resolution planetary scale spherical signals that arise in the earth and climate sciences, as well as cosmology.
We would like to thank Chiyu “Max” Jiang and Mayur Mudigonda for help obtaining and interpreting the climate data, and Erik Verlinde for helpful discussions. ††The climate dataset released by (Jiang et al., 2018) and the Stanford 2D-3D-S datasets were downloaded and evaluated by QUvA researchers.
Joint 2D-3D-Semantic Data for Indoor Scene Understanding.pp. 9, 2017.
For more information on manifolds, fiber bundles, connections, parallel transport, the exponential map, etc., we highly recommend the lectures by Schuller (2016), as well as the book Nakahara (2003) which explain these concepts very clearly and at a useful level of abstraction.
From the perspective of the theory of principal fiber bundles, our work can be understood as follows. A fiber bundle is a space consisting of a base space (the manifold in our paper), with at each point a space called the fiber at . The bundle is defined in terms of a projection map , which determines the fibers as . A principal bundle is a fiber bundle where the fiber carries a transitive and free right action of a group (the structure group).
One can think of the fiber of a principal bundle as a (generalized) space of frames at . Due to the free and transitive action of on , we have that is isomoprhic to as a -space, meaning that it looks like except that it does not have a distinguished origin or identity element as does (i.e. there is no natural choice of frame).
A gauge transformation is then defined as a principal bundle automorphism, i.e. a map from that maps fibers to fibers in a -equivariant manner. Sometimes the automorphism is required to fix the base space, i.e. project down to the identity map via . Such a -automorphism will map each fiber onto itself, so it restricts to a -space automorphism on each fiber.
Given a principal bundle and a vector space with representation of , we can construct the associated bundle , whose elements are the equivalence classes of the following equivalence relation on :
The associated bundle is a fiber bundle over the same base space as , with fiber isomorphic to .
A (matter) field is described as a section of the associated bundle , i.e. a map that satisfies . Locally, one can describe a section as a function (as we do in the paper), but globally this is not possible unless the bundle is trivial.
The group of automorphisms of (gauge transformations) acts on the space of fields (sections of the associated bundle). It is this group that we wish to be equivariant to.
From this mathematical perspective, our work amounts to replacing the principal bundle333It is more common to use the letter for the supergroup and for the subgroup, but that leads to a principal -bundle , which is inconsistent with the main text, where we use a principal bundle. used in the work on regular and steerable G-CNNs of Cohen et al. (2018a, c), by another principal bundle, namely the frame bundle of . Hence, this general theory can describe in a unified way the most prominent and geometrically natural methods of geometrical deep learning (Masci et al., 2015; Boscaini et al., 2016), as well as all G-CNNs on homogeneous spaces.
Indeed, if we build a gauge equivariant CNN on a homogeneous space (e.g. the sphere ), it will (under mild conditions) automatically be equivariant to the left action of also. To see this, note that the left action of on itself (the total space of the principal bundle) can be decomposed into an action on the base space (permuting the fibers), and an action on the fibers (cosets) that factors through (see e.g. Sec. 2.1 of (Cohen et al., 2018c)). The action on the base space preserves the local neighbourhoods from which we compute filter responses, and equivariance to the action of is ensured by the kernel constraint. Since G-CNNs (Cohen et al., 2018a) and gauge equivariant CNNs employ the most general equivariant map, we conclude that they are indeed the same, for bundles . Thus, “gauge theory is all you need”. (We plan to expand this argument in a future paper)
Most modern theories of physics are gauge theories, meaning they are based on this mathematical framework. In such theories, any construction is required to be gauge invariant (i.e. the coefficients must be gauge equivariant), for otherwise the predictions will depend on the way in which we choose to represent physical quantities. This logic applies not just to physics theories, but, as we have argued in the paper, also to neural networks and other models used in machine learning. Hence, it is only natural that the same mathematical framework is applicable in both fields.
Our main model consists of convolution layers and linear layers. The first layer is a scalar-to-regular gauge equivariant convolution layer, and the following layers are regular-to-regular layers. These layers have
output channels, and stride, respectively.
In between convolution layers, we use batchnormalization (Ioffe & Szegedy, 2015)
and ReLU nonlinearities. When using batchnormalization, we average over groups offeature maps, to make sure the operation is equivariant. Any nonlinearity is equivariant, because we use only trivial and regular representations realized by permutation matrices.
After the convolution layers, we perform global pooling over spatial and orientation channels, yielding an invariant representation. We map these through 3 FC layers (with channels) before applying softmax.
The other models are obtained from this one by replacing the convolution layers by scalar-to-regular + orientation pooling or scalar-to-scalar layers, or by disabling G-padding and/or kernel expansion, always adjusting the number of channels to keep the number of parameters roughly the same.
The model was trained for epochs, or epoch of the augmented dataset (where each instance is transformed by each icosahedron symmetry , or rotation .
For the climate experiments, we used a U-net with regular-to-regular convolutions. The first layer is a scalar-to-regular convolution with 16 output channels. The downsampling path consists of regular-to-regular layers with stride , and output channels. The downsampling path takes as input a signal with resolution (i.e. pixels), and outputs one at (i.e. pixels).
The decoder is the reverse of the encoder in terms of resolution and number of channels. Upsampling is performed by bilinear interpolation (which is exactly equivariant), before each convolution layer (which uses stride 1). As usual in the U-net architecture, each layer in the upsampling path takes as input the output of the previous layer, as well as the output of the encoder path at the same resolution.
Each convolution layer is followed by equivariant batchnorm and ReLU.
The model was trained for epochs with batchsize .
For the 2D-3D-S experiments, we used a residual U-Net with the following architecture.
The input layer is a scalar-to-regular layer with 8 channels, followed by batchnorm and relu. Then we apply 4 residual blocks with 16, 32, 64, 64 output channels, each of which uses stride=2. In the upsampling stream, we use 32, 16, 8, 8 channels, for the residual blocks, respectively. Each upsampling layer receives input from the corresponding downsampling layer, as well as the previous layer. Upsampling is performed using bilinear interpolation, and downsampling by hexagonal max pooling.
The input resolution is , which is downsampled to by the downsampling stream.
Each residual block consists of a convolution, batchnorm, skipconnection, and ReLU.