1 Covariance and Uniqueness
It is wellknown that the principle of covariance, or coordinate independence, lies at the heart of the theory of relativity. The theory of special relativity was constructed to describe Maxwell’s theory of electromagnetism in a way that satisfies the special principle of covariance, which states “If a system of coordinates is chosen so that, in relation to it, physical laws hold good in their simplest form, the same laws hold good in relation to any other system of coordinates moving in uniform translation relatively to ” Einstein (1916).
The transformation between and , in other words between different inertial frames, can always be achieved through an element of the (global) Lorentz group. With the benefit of hindsight, there is no good reason why physics should only be covariant under a global change of coordinates. Indeed, soon after the development of special relativity, Einstein started to develop his ideas for a theory that is covariant with respect to a local, spacetimedependent, change of coordinates. In his words, the general principle of covariance states that “The general laws of nature are to be expressed by equations which hold good for all systems of coordinates, that is, are covariant with respect to any substitutions whatever (generally covariant)” Einstein (1916). The rest is history: the incorporation of the mathematics of Riemannian geometry in order to achieve general covariance and the formulation of the general relativity (GR) theory of gravity. It is important to note that the seemingly innocent assumption of general covariance is in fact so powerful that it determines GR as the unique theory of gravity compatible with this principle, and the equivalence principle in particular, up to shortdistance corrections^{1}^{1}1
The uniqueness argument roughly goes as follows. In order to achieve full general covariance, the only ingredients for the gravitational part of the action are the Riemann tensor and its derivatives, with all the indices contracted. From a simple scaling argument one can show that all of them apart from the Ricci scalar have subleading effects at large distances. Note that this universality makes no assumptions on the matter fields that may be coupled to gravity.
.In a completely different context, it has become clear in recent years that a coordinateindependent description is also desirable for convolutional networks. A covariant inference process is particularly useful in situations where the distribution of characteristic patterns is symmetric. Important practical examples include satellite imagery or biomedical microscopy imagery which often do not exhibit a preferred global rotation or chirality. In order to ensure that the inferred information of a network is equivalent for transformed samples, the network architecture has to be designed to be equivariant^{2}^{2}2In this article we will use the words “equivariant” and “covariant” interchangeably as they convey the same concept. under the corresponding group action^{3}^{3}3 To define the equivariance of a map requires the definition of a group action on the domain and codomain. One specific choice of group action is the co/contravariant transformation of tensors.. A wide range of equivariant models has been proposed for signals on flat Euclidean spaces . In particular, equivariance w.r.t. (subgroups of) the Euclidean groups of translations, rotations and mirrorings of has been investigated for planar images () Cohen & Welling (2016, 2017); Worrall et al. (2017); Weiler et al. (2018b); Hoogeboom et al. (2018). and volumetric signals () Winkels & Cohen (2018); Worrall & Brostow (2018); Weiler et al. (2018a) and has generally been found to outperform nonequivariant models in accuracy and data efficiency. Equivariance has further proven to be a powerful principle in generalizing convolutional networks to more general spaces like the sphere Cohen et al. (2018b). In general, it has been shown in Kondor & Trivedi (2018); Cohen et al. (2018c, a) that (globally) equivariant networks can be generalized to arbitrary homogeneous spaces where is a subgroup^{4}^{4}4Note that we are using an inverted definition of and w.r.t. the original paper to stay compatible with the convention used in Cohen et al. (2019), discussed below.
. The feature spaces of such networks are formalized as spaces of sections of vector bundles over
, associated to the principal bundle . Our previous examples are in this setting interpreted as equivariant networks on Euclidean space and equivariant networks on the sphere . This description includes Poincaréequivariant networks on Minkowski spacetime since Minkowski space arises as quotient of the Poincaré group w.r.t. the Lorentz group .Note that the change of coordinates required here is a global one. Global symmetries are extremely natural and readily applicable when the underlying space is homogeneous, i.e. the group action is transitive, meaning the space contains only a single orbit. At the same time, it is clearly desirable to have an effective CNN on an arbitrary surface, often not equipped with a global symmetry. If the previous work on homogeneous spaces is based on an equivariance requirement analogous to the special principle of covariance, then what one needs for general surfaces is an analogue of the general principle of covariance. In other words, we would like to have covariance with respect to a local, locationdependent coordinate transformations.
This requirement for local transformation equivariance of convolutional networks on general manifolds has been recognized and described in Cohen et al. (2019). A choice of local coordinates is thereby formalized as a gauge of the tangent space^{5}^{5}5 The gauge is equivalent to choosing a basis of the tangent space by mapping the standard basis of to . Explicitly, a coefficient vector determines a vector . Coordinate bases and frame (vielbein) bases are examples of choices of the gauge. . Similar to the general theory of equivariant networks on homogeneous spaces, the feature fields of these networks are realized as sections of vector bundles over , this time associated to the frame bundle of the manifold. Local transformations are described as positiondependent gauge transformations , where is an element of the structure group. When the frame bundle is chosen to be the orthonormal frame bundle, the structure group is reduced to and in our analogy this corresponds to the vierbein formulation of GR where the group is the Lorentz group .
Note that the parallel between the two problems regarding general covariance forces us to employ the same mathematical language of (psuedo) Riemannian geometry. Interestingly, we will argue in the next section that our formalism is basically unique once general covariance along with some basic assumptions is demanded. This can be compared with the longdistance uniqueness of GR, once covariance is required.
2 The Covariant Convolution
In CNNs we are interested in devising a linear map between the input feature space and the output space between every pair of subsequent layers in the convolutional network. In this section we will argue that the four properties of 1) linearity, 2) locality, 3) covariance, and 4) weight sharing is sufficient to uniquely determine its form, which we give in (4) and (5) below.
In mathematical terms, we describe the feature space in the th layer in terms of a fiber bundle with fiber , that is associated to the principal bundle with the representation of the structure group , with the projection . The bundle structure captures the transformation properties of the feature fields under a change of coordinates of the manifold . For now we focus on a subregion of the surface which admits a single coordinate chart and a local trivialisation of the bundles. In this language, the feature field corresponds to a local section^{6}^{6}6Without the risk of causing confusion we will sometimes consider as a map from the local region to the fiber , where implicitly we have used the local trivialisation to write where and ignored the first entry. of the fiber bundle and the linear map is between the space of sections: . Moreover, we require the linear map to satisfy the following locality condition: given the distance function on the manifold , which in our case will be supplied by the metric, we have for all with the property that for all with for some (fixed) positive number . The linearity and the locality of the map immediately leads to the following form of the map. To illustrate this, consider the simplified scenario when the in and output features are just numbers (scalars) which do not transform under coordinate transformation and is replaced with a set with finite elements equipped with the distance function, then the above requirements immediately leads to the matrix form of the map . Similarly, for our case we are led to the linear map
(1) 
where , is the ball centered at with radius , and is what will turn out to be the convolution kernel.
In the next step we will impose the condition of general covariance to restrict the form of . In the case of homogeneous spaces and when we require just special covariance, we can phrase the problem in the following general form. Suppose that the input feature and the output feature form representations and under a group , then it is clear from the consistency of the action with the above map that must transform as and this is precisely what is described in Cohen et al. (2018a, 2019). Once we promote the group element to be locationdependent, the analogous requirement is . In our case, the group under discussion is that of local changes of coordinates^{7}^{7}7In this proceeding we mainly work with the basis of the tangent space, while another common choice is the orthonormal (vielbein) basis. The former has the advantage of being directly related to the covariance in the context of GR and the latter has the advantage of being closer to the philosophy of gauge theory. , with the consistent corresponding change of metric . Note that this is not just a mathematical formality: one needs to deal with changes of coordinates when working with manifolds that cannot be covered with one coordinate chart, such as a sphere.
However, it is unwieldy to work with group elements at different points and . Instead, we would like to encode the information in another way so that we can work with gauge/coordinate transformations at one single point when talking about the transformation of . Here the relevant concept is parallel transport. Given a bundle with connection on and path with and , for every and there is a unique section along that is flat along such that . In coordinates, this means for all Note that the parallel transport is generically pathdependent; in other words, transporting to along different paths yields different results unless the bundle is flat.
However, in our application we always have a uniquely distinguished path between and in practice. Namely, in the CNN context we let the ball containing the support of to be so small that every point in the ball is uniquely connected by a single geodesic to the center . We hence replace with , the parallel transport of along the unique geodesic from to the center point . Denote the corresponding new kernel by , and we arrive at the transformation property
(2) 
In fact, this geodesic description of the points provides us with an alternative, convenient way to parametrise the points we integrate over. Let be a vector in the tangent space at . There is a unique geodesic flow starting from where the initial velocity is , i.e. . We will denote the endpoint of this flow We can hence trade the integration within a small ball in our manifold with an integration within a ball of some radius in its tangent space . We can hence write the kernel as .
Apart from accommodating the transformation of the in and out feature fields, one also have to make sure that the integration measure remains invariant. From here we conclude that should contain the factor of the volume form , and we can hence write the kernel as . Note also that this volume factor is simply 1 if one works with gauge that corresponds to an orthonormal basis.
At this stage, we arrive at the following form of our linear map
(3) 
From this stage onwards, we would like to be less abstract and focus on the groups and representations we encounter in real problems. Namely, when is the tensor product of the tangent and the cotangent bundles and takes the form for . To see that this is sufficient and to make contact with previous work, note for instance that any irreducible representations of , denoted by , the spin irreducible representation with dimensions , can be expressed as an linear combination of the tensor product of the vector representation . Specifically, we have , where denotes the direct sum of nonnegative copies of with , for all . In other words, contains precisely one copy of spin irreducible representations as well as other irreducibles with lower spins. This ensures that every is in the span of 1, . In other words, as well as are equally good bases for representations.
To ease notation, we will assume that the input and output feature fields are sections of tensor products of tangent bundles, while the cases involving the cotangent bundles can be treated with a straightforward generalisation of our formula. In this case, we can write explicit expressions for the output feature field as and similarly for the input, and the transformation property (2) is succinctly summarised by the tensor and index structure of the kernel function, which we write as . In other words for a fixed and , we have . Explicitly, we now have for these cases
(4)  
Finally, we would like to impose the weight sharing condition, which we phrase in the following way: when the (localised) input signal is parallel transported along a curve, the output signal should also equal to the parallel transport of the previous result. First, we need to explain what we mean by parallel transporting the input feature field along a curve with and . For the point itself it is clear that we can simply parallel transport to . Suppose that is connected by the geodesic flow starting from , we also transport to and transport by transporting it to and then further transport it along the geodesic flow starting from . After this prescription, depicted in Figure 1, it is clear how one should define such that the weight sharing condition is true. Recall that for a given , . Now parallel transport it along to obtain , and define that
(5) 
In other words, we simultaneously parallel transport the dependence on the tangent vector. The vanishing of the covariant derivative of the output feature along the curve, then just comes from the vanishing of the covariant derivative of volume form and the definition (5).
Note that (5) means that, along the path, the kernel is completely determined by the kernel at any point on the path. Supposed further that we select a reference point on the manifold. For any point that is connected to by a unique geodesic, parallel transport with respect to the geodesic then unambiguously “share” the kernel at with . On the other hand, when is connected by more than one geodesics, the general covariance then dictates the relation between the outputs corresponding to different geodesics. Moreover, this covariance also holds for transporting along different paths (not necessarily geodesics) in general. More precisely, we see how different kernels, related again by a local change of coordinates, can be compensated by a transformation of the input and output feature fields. We hence see that our simple and general assumptions in fact completely determine the form of the convolution map.
3 Discussion
After pointing out the parallel between special and general relativity and equivariant CNNs, it is also important to point out the crucial differences. In the CNN setup, the geometry is always held fixed and we do not consider dynamics of the metric. From this point of view the closer analogy is perhaps the study of field theories in a fixed curved spacetime where the backreaction of the matter fields to the spacetime geometry has been ignored. It would be interesting to explore equivariant CNNs with geometry that evolves between layers in future work. It is certainly tempting to treat the direction of different layers as a part of the spacetime, either as the temporal or the holographic direction ’t Hooft (1993). This interpretation is particularly relevant if all feature spaces carry the same group representation.
Acknowledgements
The work of MC is supported by ERC starting grant #640159 and NWO Vidi grant ERC starting grant H2020 ERC StG #640159. The work of VA is supported by ERC starting grant #640159.
References
 Cohen et al. (2018a) Cohen, T., Geiger, M., and Weiler, M. A General Theory of Equivariant CNNs on Homogeneous Spaces. 2018a.
 Cohen & Welling (2016) Cohen, T. S. and Welling, M. Group equivariant convolutional networks. In ICML, 2016.
 Cohen & Welling (2017) Cohen, T. S. and Welling, M. Steerable CNNs. In ICLR, 2017.
 Cohen et al. (2018b) Cohen, T. S., Geiger, M., Koehler, J., and Welling, M. Spherical CNNs. In ICLR, 2018b.
 Cohen et al. (2018c) Cohen, T. S., Geiger, M., and Weiler, M. Intertwiners between Induced Representations (with Applications to the Theory of Equivariant Neural Networks). 2018c.
 Cohen et al. (2019) Cohen, T. S., Weiler, M., Kicanaoglu, B., and Welling, M. Gauge equivariant convolutional networks and the icosahedral cnn. arXiv preprint arXiv:1902.04615, 2019.
 Einstein (1916) Einstein, A. The foundation of the general theory of relativity. Annalen der Physik, 49(7):769–822, 1916.
 Hoogeboom et al. (2018) Hoogeboom, E., Peters, J. W. T., Cohen, T. S., and Welling, M. HexaConv. In ICLR, 2018.
 Kondor & Trivedi (2018) Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
 ’t Hooft (1993) ’t Hooft, G. Dimensional reduction in quantum gravity. Conf. Proc., C930308:284–296, 1993.
 Weiler et al. (2018a) Weiler, M., Geiger, M., Welling, M., Boomsma, W., and Cohen, T. S. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. In NIPS, 2018a.
 Weiler et al. (2018b) Weiler, M., Hamprecht, F. A., and Storath, M. Learning Steerable Filters for Rotation Equivariant CNNs. In CVPR, 2018b.

Winkels & Cohen (2018)
Winkels, M. and Cohen, T. S.
3D GCNNs for Pulmonary Nodule Detection.
In
International Conference on Medical Imaging with Deep Learning (MIDL)
, 2018.  Worrall & Brostow (2018) Worrall, D. E. and Brostow, G. J. Cubenet: Equivariance to 3d rotation and translation. In ECCV, 2018.
 Worrall et al. (2017) Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Harmonic Networks: Deep Translation and Rotation Equivariance. In CVPR, 2017.