1 Introduction
The ability to learn constraints or symmetries is a foundational property of intelligent systems. Humans are able to discover patterns and regularities in data that provide compressed representations of reality, such as translation, rotation, intensity, or scale symmetries. Indeed, we see the value of such constraints in deep learning. Fully connected networks are more flexible than convolutional networks, but convolutional networks are more broadly impactful because they enforce the
translation equivariance symmetry: when we translate an image, the outputs of a convolutional layer translate in the same way (LeCun et al., 1998; Cohen and Welling, 2016b). Further gains have been achieved by recent work hardcoding additional symmetries, such as rotation equivariance, into convolutional neural networks (e.g., Cohen and Welling, 2016b; Worrall et al., 2017; Zhou et al., 2017; Marcos et al., 2017)But we might wonder whether it is possible to learn that we want to use a convolutional neural network. Moreover, we typically do not know which constraints are suitable for a given problem, and to what extent those constraints should be enforced. The class label for the digit ‘6’ is rotationally invariant up until it becomes a ‘9’. Like biological systems, we would like to automatically discover the appropriate symmetries. This task appears daunting, because standard learning objectives such as maximum likelihood select for flexibility, rather than constraints (MacKay, 2003; Minka, 2001).
In this paper, we provide an extremely simple and practical approach to automatically discovering invariances and equivariances, from training data alone. Our approach operates by learning a distribution over augmentations, then training with augmented data, leading to the name Augerino. Augerino (1) can learn both invariances and equivariances over a wide range of symmetry groups, including translations, rotations, scalings, and shears; (2) can discover partial symmetries, such as rotations not spanning the full range from
; (3) can be combined with any standard architectures, loss functions, or optimization algorithm with little overhead; (4) performs well on regression, classification, and segmentation tasks, for both image and molecular data.
To our knowledge, Augerino is the first approach that can learn symmetries in neural networks from training data alone, without requiring a validation set or a special loss function. In Sections 35 we introduce Augerino and show why it works. The accompanying code can be found at https://github.com/gbenton/learninginvariances.
2 Related Work
There has been explosion of work constructing convolutional neural networks that have hardcoded invariance or equivariance to a set of transformations, such as rotation (Cohen and Welling, 2016b; Worrall et al., 2017; Zhou et al., 2017; Marcos et al., 2017) and scaling (Worrall and Welling, 2019; Sosnovik et al., 2019). While recent methods use a representation theoretic approach to find a basis of equivariant convolutional kernels (Cohen and Welling, 2016a; Worrall et al., 2017; Weiler and Cesa, 2019), the older method of Laptev et al. (2016) pools network outputs over many hardcoded transformations of the input for fixed invariances, but does not consider equivariances.
With a desire to automate the machine learning pipeline,
Cubuk et al. (2019) introduced AutoAugmentin which reinforcement learning is used to find an optimal augmentation policy within a discrete search space. At the expense of a massive computational budget for the search, AutoAugment brought substantial gains in image classification performance, including stateoftheart results on ImageNet. The AutoAugment framework was extended first to
Fast AutoAugment in Lim et al. (2019), improving both the speed and accuracy of AutoAugment by using Bayesian data augmentation (Tran et al., 2017). Both Cubuk et al. (2019) and Lim et al. (2019) apply a reinforcement learning approach to searching the space of augmentations, significantly differing from our work which directly optimizes distributions over augmentations with respect to the training loss.Faster AutoAugment (Hataya et al., 2019), which uses a GAN framework to match augmentations to the data distribution, and Differentiable Automatic Data Augmentation (Li et al., 2020) which applies a DARTS (Liu et al., 2018) bilevel optimization procedure to learn augmentation from the validation loss are most similar to Augerino in the discovery of distributions over augmentations. Both methods learn augmentations from data using the reparametrization trick; however unlike Li et al. (2020) and Liu et al. (2018), we learn augmentations directly from the training loss without need for GAN training or the complex DARTS procedure (Liu et al., 2018; Xu et al., 2019; Liang et al., 2019), and are specifically learning degrees of invariances and equivariances.
To the best of our knowledge, Augerino is the first work to learn invariances and equivariances in neural networks from training data alone. The ability to automatically discover symmetries enables us to uncover interpretable salient structure in data, and provide better generalization.
3 Augerino: Learning Invariances through Augmentation
A simple way of constructing a model invariant to a given group of transformations is to average the outputs of an arbitrary model for the inputs transformed with all the transformations in the group. For example, if we wish to make a given image classifier invariant to horizontal reflections, we can average the predictions of the network for the original and reflected input.
Augerino functions by sampling multiple augmentations from a parameterized distribution then applying these augmentations to an input to acquire multiple augmented samples of the input. The augmented input samples are each then passed through the model, with the final prediction being generated by averaging over the individual outputs. We present the Augerino framework in Figure 1.
Now, suppose we are working with a set of transformations. Relevant transformations may not always form a group structure, such as rotations by limited angles in the range . Given a neural network , with parameters , we can make a new model which is approximately invariant to transformations by averaging the outputs over a parametrized distribution of the transformations :
(1) 
Since the crossentropy loss
for classification is linear in the class probabilities, we can pull the expectation outside of the loss:
(2) 
As stochastic gradient descent only requires an unbiased estimator of the gradients, we can train the augmentation averaged model
exactly by minimizing the loss of averaged over a finite number of samples fromat training time, using a Monte Carlo estimator.
To learn the invariances we can also backpropagate through to the parameters
of the distribution by using the reparametrization trick (Kingma and Welling, 2013). For example, for a uniform distribution over rotations with angles
, we can parametrize the rotation angle by with . The loss for the augmentationaveraged model on an input can be computed as(3) 
Specifically, during training we can use a single sample from the augmentation distribution to estimate the gradients. The learned range of rotations would correspond to the extent rotational invariance is present in the data. With a more general set of transformations, we can similarly define a distribution over the transformation elements using the reparametrization trick , with and . The reparameterized loss is then
(4) 
In Section 3.2 we describe a parameterization of the set of affine transformations which includes translations, rotations, and scalings of the input as special cases. In this fashion, we can train both the parameters of the augmentation averaged model consisting both of the weights of and the parameters of the augmentation distribution .
Testtime Augmentation
Regularized Loss
Invariances correspond to constraints on the model, and in general the most unconstrained model may be able to achieve the lowest training loss. However, we have a prior belief that a model should preserve some level of invariance, even if standard losses cannot account for this preference. To bias training towards solutions that incorporate invariances, we add a regularization penalty to the network loss function that promotes broader distributions over augmentations. Our final loss function is given by
(5) 
where is a regularization function encouraging coverage of a larger volume of transformations and is the regularization weight (the form of is discussed in Section 3.2). In practice we find that the choice of is largely unimportant; the insensitivity to the choice of is demonstrated throughout Sections 4 and 6 in which performance is consistent for various values of . This is due to the fact that there is essentially no gradient signal for over the range of augmentations consistent with the data, so even a small push is sufficient. We discuss further why Augerino is able to learn the correct level of invariance — without sensitivity to , and from training data alone — in Section 5.
We refer to the proposed method as Augerino^{1}^{1}1https://en.wikipedia.org/wiki/Augerino. We summarize the method in Algorithm 1.
3.1 Extension to Equivariant Predictions
We now generalize Augerino to problems where the targets are equivariant rather than invariant to a certain set of transformations. We say that target values are equivariant to a set of input transformations if the targets for a transformed input are transformed in the same way as the input. Formally, a function is equivariant to a symmetry transformation , if applying to the input of the function is the same as applying to the output, such that . For example, in image segmentation if the input image is rotated the target segmentation mask should also be rotated by the same angle, rather than being unchanged.
To make the Augerino model equivariant to transformations sampled from , we can average the inversely transformed outputs of the network for transformed inputs:
(6) 
Supposing that acts linearly on the image then the model is equivariant:
(7)  
(8) 
where and the distribution is right invariant: for any measurable set , . If the distribution over the transformations is uniform then the model is equivariant.
3.2 Parameterizing Affine Transformations
We now show how to parameterize a distribution over the set of affine transformations of data (e.g. images). With this parameterization, Augerino can learn from a broad variety of augmentations including translations, rotations, scalings and shears.
The set of affine transformations form an algebraic structure known as a Lie Group. To apply the reparametrization trick, we can parametrize elements of this Lie Group in terms of its Lie Algebra via the exponential map Falorsi et al. (2019). With a very simple approach, we can define bounds on a uniform distribution over the different exponential generators in the Lie Algebra:
(9) 
where exp is the matrix exponential function: . ^{2}^{2}2Mathematically speaking, this distribution is a pushforward by the exp map of a scaled cube with side lengths of a cube .
The generators of the affine transformations in , , correspond to translation in , translation in , rotation, scaling in , scaling in , and shearing; we write out these generators in Appendix A
. The exponential map of each generating matrix produces an affine matrix that can be used to transform the coordinate grid points of the input like in
Jaderberg et al. (2015). To ensure that the parameters are positive, we learn parameters where . In maximizing the volume of transformations covered, it would be geometrically sensible to maximize the Haar measure of the set of transformations that are covered by Augerino, which is similar to the volume covered in the Lie Algebra . However, we find that even the negative regularization on the bounds is sufficient to bias the model towards invariance. More intuitively, the regularization penalty biases solutions towards values of which induce broad distributions over affine transformations, .We apply the regularization penalty on both classification and regression problems, using cross entropy and mean squared error loss, respectively. This regularization method is effective, interpretable, and leads to the discovery of the correct level of invariance for a wide range of .
4 Shades of Invariance
We can broadly classify invariances in three distinct ways: first there are cases in which we wish to be completely invariant to transformations in the data, such as to rotations on the rotMNIST dataset. There are also cases in which we want to be only partially invariant to transformations, i.e. soft invariance, such as if we are asking if a picture is right side up or upside down. Lastly, there are cases in which we wish there to be no invariance to transformations, such as when we wish to predict the rotations themselves. We show that Augerino can learn full invariance, soft invariance, and no invariance to rotations. We then explain in Section 5 why Augerino is able to discover the correct level of invariance from training data alone. Incidentally, soft invariances are the most representative of realworld problems, and also the most difficult to correctly encode a priori — where we most need to learn invariances.
For the experiments in this and all following sections we use a layer CNN architecture from Laine and Aila (2016). We compare Augerino trained with three values of from Equation 5; corresponding to low, standard, and high levels of regularization. To further emphasize the need for invariance to be learned as opposed to just embedded in a model we also show predictions generated from an invariant steerable network (Cohen and Welling, 2016a). Specific experimental and training details are in Appendix C.
4.1 Full Rotational Invariance: rotMNIST
The rotated MNIST dataset (rotMNIST) consists of the MNIST dataset with the input images randomly rotated. As the dataset has an inherent augmentation present (random rotations), we desire a model that is invariant to such augmentations. With Augerino, we aim to approximate invariance to rotations by learning an augmentation distribution that is uniform over all rotations in
.Figure 2 shows the learned distribution over rotations to apply to images input into the model. On top of learning the correct augmentation through automatic differentiation using only the training data, we achieve test accuracy. We also see the level of regularization has little effect on performance. To our knowledge, only Weiler and Cesa (2019) achieve better performance on the rotMNIST dataset, using the correct equivariance already hardcoded into the network.
4.2 Soft Invariance: Mario & Iggy
We show that Augerino can learn soft invariances — e.g. invariance to a subset of transformations such as only partial rotations. To this end, we consider a dataset in which the labels are dependent on both image and pose. We use the sprites for the characters Mario and Iggy from Super Mario World, randomly rotated in the intervals of and Nintendo (1990). There are labels in the dataset, one for the Mario sprite in the upper half plane, one for the Mario sprite in the lower half plane, one for the Iggy sprite in the upper half plane, and one for the Iggy sprite in the lower half plane; we show an example demonstrating each potential label in Figure 3.
In Figure 3, we see that too much rotational augmentation would make it impossible to correctly identify the pose. The limited rotations present in the data give that the labels are invariant to rotations of up to radians. Augerino learns the correct augmentation distribution within approximately radians, and the predicted class labels follow the desired invariances, with predictions that are invariant to rotations only within subsets of .
4.3 Avoiding Invariance: Olivetti Faces
To test that Augerino can avoid unwanted invariances we train the model on the rotated Olivetti faces dataset (Hinton and Salakhutdinov, 2008). This dataset consists of distinct images of different people. We select the images of people to generate the training set, randomly rotating each image in , retaining the angle of rotation as the new label. We then crop the result to pixel square images. We repeat the process times for each image, generating training images. Figure 4 shows the data generating process and the corresponding label. Augmenting the image with any rotation would make it impossible to learn the angle by which the original image was rotated.
We find experimentally in Figure 4 that when we initialize the Augerino model such that the distribution over the rotation generating matrix is uniform , training for epochs reduces the distribution on the rotational augmentation to have domain of support radians wide. The model learns a nearly fixed transformation in each of the other spaces of affine transformation, all with domains of support for the weights under units wide.
5 Why Augerino Works
The conventional wisdom is that it is impossible to learn invariances directly from the training loss as invariances are constraints on the model which make it harder to fit the data (van der Wilk et al., 2018). Given data that has invariance to some augmentation, the training loss will not be improved by widening our distribution over this augmentation, even if it helps generalization: we would want a model to be invariant to rotations of a ‘6’ up until it looks more like a ‘9’, but no invariance will achieve the same training loss. However, it is sufficient to add a simple regularization term to encourage the model to discover invariances. In practice we find that the final distribution over augmentations is insensitive to the level of regularization, and that even a small amount of regularization will enable Augerino to find wide distributions over augmentations that are consistent with the precise level of invariances in the data.
We illustrate the learning of invariances with Augerino in panel (a) of Figure 5. Suppose only a limited degree of invariance is present in the data, as in Section 4.2. Then the training loss for the augmentation parameters will be flat for augmentations within the range of invariance present in the data (shown in white), and then will increase sharply beyond this range (corresponding region of Augerino parameters is shown in blue). The regularized loss in Eq. (5) will push the model to increase the level of invariance within the flat region of the training loss, but will not push it beyond the degree of invariance present in the data unless the regularization strength is extreme.
We demonstrate the effect described above for the Mario and Iggy classification problem of Section 4.2 in panel (b) of Figure 5. We use a network trained with Augerino and visualize the loss and gradient with respect to the range of rotations applied to the input with and without regularization. Without regularization, the loss is almost completely flat until the value of which is the true degree of rotational invariance in the data. With regularization we add an incentive for the model to learn larger values of the rotation range. Consequently, the loss achieves its optimum close to the optimal value of the parameter at and then quickly grows beyond that value. Figure 6 displays the results of panel (b) of Figure 5 in action; gradient signals push augmentation distributions that are too wide down and too narrow up to the correct width.
Incidentally, the Augerino solutions are substantially flatter than those obtained by standard training, as shown in Appendix F, Figure 9, which may also make them more easily discoverable by procedures such as SGD. We also see that these solutions indeed provide better generalization.
(a) Augerino training 
(b) Loss function and Gradient 
6 Image Recognition
As Augerino learns a set of augmentations specific to a given dataset, we expect to see that Augerino is capable of boosting performance over applying any level of fixed augmentation. Using the CIFAR
dataset, we compare Augerino to training on data with no augmentation, fixed, commonly applied augmentations, and the augmentations as given by Fast AutoAugment Lim et al. (2019).No Aug.  Fixed Aug.  Augerino ( copies)  Augerino ( copy)  Fast AA  

Test Accuracy 
provides a boost in performance with minimal increased training time. Error bars are reported as the standard deviation in accuracy for Augerino trained over
trials.Table 1 shows that Augerino is competitive with advanced models that seek databased augmentation schemes. The gains in performance are accompanied by notable simplifications in setup: we do not require a validation set and the augmentation is learned concurrently with training (there is no preprocessing to search for an augmentation policy). In Appendix F we show that Augerino find flatter solutions in the loss surface, which are known to generalize (Maddox et al., 2020). To further address the choice of regularization parameter, we train a number of models on CIFAR with varying levels of regularization. In Figure 9 we present the test accuracy of models for different regularization parameters along with the corresponding effective dimensionalities of the networks as a measure of the flatness of the optimum found through training. Maddox et al. (2020) shows that effective dimensionality can capture the flatness of optima in parameter space and is strongly correlated to generalization, with lower effective dimensionality implying flatter optima and better generalization.
The results of the experiment presented in Figure 9 solidify Augerino’s capability to boost performance on image recognition tasks as well as demonstrate that the inclusion of regularization is helpful, but not necessary to train accurate models. If the regularization parameter becomes too large, as can be seen in the rightmost violins of Figure 9
, training can become unstable with more variance in the accuracy achieved. We observe that while it is possible to achieve good results with no regularization, the inclusion of an inductive bias that we ought to include some invariances (by adding a regularization penalty) improves performance.
7 Molecular Property Prediction
We test out our method on the molecular property prediction dataset QM9 (Blum and Reymond, 2009; Rupp et al., 2012) which consists of small inorganic molecules with features given by the coordinates of the atoms in 3D space and their charges. We focus on the HOMO task of predicting the energy of the highest occupied molecular orbital, and we learn Augerino augmentations in the space of affine transformations of the atomic coordinates in . We parametrize the transformation as before with a uniform distribution for each of the generators listed in Appendix A. We use the LieConv model introduced in Finzi et al. (2020), both with no equivariance (LieConvTrivial) and 3D translational equivariance (LieConvT). We train the models for 500 epochs on MAE (additional training details are given in C) and report the test performance in Table 2. Augerino performs much better than using no augmentations and is competitive with the hand chosen random rotation and translation augmentation () that incorporates domain knowledge about the problem. We detail the learned distribution over affine transformations in Appendix E. Augerino is useful both for the non equivariant LieConvTrivial model as well as the translationally equivariant LieConvT(3) model, suggesting that Augerino can complement architectural equivariance.
HOMO (meV)  LUMO (meV)  

No Aug.  Augerino  SE(3)  No Aug.  Augerino  SE(3)  
LieConvTrivial  
LieConvT(3) 
8 Semantic Segmentation
In Section 3.1 we showed how Augerino can be extended to equivariant problems. In Semantic Segmentation the targets are perfectly aligned with the inputs and the network should be equivariant to any transformations present in the data. To test Augerino in equivariant learning setting we construct rotCamVid, a variation of the CamVid dataset (Brostow et al., 2008b, a) where all the training and test points are rotated by a random angle (see Appendix Figure 7). For any fixed image we always use the same rotation angle, so no two copies of the same image with different rotations are present in the data. We use the FCDensenet segmentation architecture (Jégou et al., 2017)
. We train Augerino with a Gaussian distribution over random rotations and translations.
9 ColorSpace Augmentations
In the previous sections we have focused on learning spatial invariances with Augerino. Augerino is general and can be applied to arbitrary differentiable input transformations. In this section, we demonstrate that Augerino can learn colorspace invariances.
We consider two colorspace augmentations: brightness adjustments and contrast adjustments. Each of these can be implemented as simple differentiable transformations to the RGB values of the input image (for details, see Appendix D
). We use Augerino to learn a uniform distribution over the brightness and contrast adjustments on STL10
(Coates et al., 2011) using the layer CNN architecture (see Section 4). For both Augerino and the baseline model, we use standard spatial data augmentation: random translations, flips and cutout (DeVries and Taylor, 2017). The baseline model achieves accuracy where the mean and standard deviation are computed over independent runs. The Augerino model achieves a slightly higher accuracy and learns to be invariant to noticeable brightness and contrast changes in the input image (see Appendix Figure 8).10 Conclusion
We have introduced Augerino, a framework that can be seamlessly deployed with standard model architectures to learn symmetries from training data alone, and improve generalization. Experimentally, we see that Augerino is capable of recovering ground truth invariances, including soft invariances, ultimately discovering an interpretable representation of the dataset. Augerino’s ability to recover interpretable and accurate distributions over augmentations leads to increased performance over both taskspecific specialized baselines and competing databased augmentation schemes on a variety of tasks including molecular property prediction, image segmentation, and classification.
Broader Impacts
Our work is largely methodological and we anticipate that Augerino will primarily see use within the machine learning community. Augerino’s ability to uncover invariances present within the data, without modifying the training procedure and with a very plugandplay design that is compatible with any network architecture makes it an appealing method to be deployed widely. We hope that learning invariances from data is an avenue that will see continued inquiry and that Augerino will motivate further exploration.
Acknowledgements
This research is supported by an Amazon Research Award, Facebook Research, Amazon Machine Learning Research Award, NSF IDISRE 193471, NIH R01 DA04876401A1, NSF IIS1910266, and NSF 1922658 NRTHDR: FUTURE Foundations, Translation, and Responsibility for Data Science.
References
 There are many consistent explanations of unlabeled data: why you should average. ICLR.
 Bspline cnns on lie groups. In International Conference on Learning Representations, External Links: Link
 970 million druglike small molecules for virtual screening in the chemical universe database GDB13. J. Am. Chem. Soc. 131, pp. 8732. Cited by: §7.
 Semantic object classes in video: a highdefinition ground truth database. Pattern Recognition Letters. Cited by: §8.
 Segmentation and recognition using structure from motion point clouds. In ECCV (1), pp. 44–57. Cited by: §8.

An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and StatisticsProceedings of the 24th International Conference on Machine LearningProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE International Conference on Computer VisionAdvances in Neural Information Processing SystemsAdvances in Neural Information Processing SystemsAdvances in Neural Information Processing SystemsAdvances in neural information processing systems
, G. Gordon, D. Dunson, and M. Dudík (Eds.), Proceedings of Machine Learning ResearchICML ’07, Vol. 15, Fort Lauderdale, FL, USA, pp. 215–223. External Links: Link Cited by: §9.  A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142–9153.
 Steerable cnns. arXiv preprint arXiv:1612.08498. Cited by: §2, §4.
 Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §1, §2.
 Autoaugment: learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: §2.
 A kernel theory of modern data augmentation. Proceedings of machine learning research 97, pp. 1528.
 Improved regularization of convolutional neural networks with cutout. External Links: 1708.04552 Cited by: §9.
 Reparameterizing distributions on lie groups. arXiv preprint arXiv:1903.02958. Cited by: §3.2.
 Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. arXiv preprint arXiv:2002.12880. Cited by: Appendix E, §7.
 Faster autoaugment: learning augmentation strategies using backpropagation. arXiv preprint arXiv:1911.06987. Cited by: §2.
 Using deep belief nets to learn covariance kernels for gaussian processes. In Advances in neural information processing systems, pp. 1249–1256. Cited by: §4.3.
 Deep learning on lie groups for skeletonbased action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6099–6108.
 Spatial transformer networks. pp. 2017–2025. Cited by: §3.2.
 The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 11–19. Cited by: §8.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.

Temporal ensembling for semisupervised learning
. arXiv preprint arXiv:1610.02242. Cited by: §4.  TIpooling: transformationinvariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 289–297. Cited by: §2.
 An empirical evaluation of deep architectures on problems with many factors of variation. New York, NY, USA, pp. 473–480. External Links: ISBN 9781595937933, Link, Document Cited by: Appendix B.
 Convolutional networks for images, speech, and time series, the handbook of brain theory and neural networks. MIT Press, Cambridge, MA. Cited by: §1.
 DADA: differentiable automatic data augmentation. arXiv preprint arXiv:2003.03780. Cited by: §2.
 Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §2.
 Fast autoaugment. In Advances in Neural Information Processing Systems, pp. 6662–6672. Cited by: §2, Table 1, §6.
 Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.
 Information theory, inference and learning algorithms. Cambridge university press. Cited by: §1.
 Rethinking parameter counting in deep models: effective dimensionality revisited. arXiv preprint arXiv:2003.02139. Cited by: Appendix F, Appendix, §6.

Rotation equivariant vector field networks
. pp. 5048–5057. Cited by: §1, §2.  Automatic choice of dimensionality for pca. In Advances in neural information processing systems, pp. 598–604. Cited by: §1.
 Super mario world. Cited by: §4.2.
 Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters 108, pp. 058301. Cited by: §7.
 Scaleequivariant steerable networks. arXiv preprint arXiv:1910.11093. Cited by: §2.
 A bayesian data augmentation approach for learning deep models. In Advances in neural information processing systems, pp. 2797–2806. Cited by: §2.
 Learning invariances using the marginal likelihood. pp. 9938–9948. Cited by: §5.
 General e (2)equivariant steerable cnns. pp. 14334–14345. Cited by: §2, §4.1.
 Harmonic networks: deep translation and rotation equivariance. pp. 5028–5037. Cited by: §1, §2.
 Deep scalespaces: equivariance over scale. pp. 7364–7376. Cited by: §2.
 Pcdarts: partial channel connections for memoryefficient differentiable architecture search. arXiv preprint arXiv:1907.05737. Cited by: §2.
 Dada: deep adversarial data augmentation for extremely low data regime classification. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2807–2811.
 Oriented response networks. pp. 519–528. Cited by: §1, §2.
Appendix A Lie Group Generators
The six Lie group generating matrices for affine transformations in 2D are,
(10) 
Applying the exponential map to these matrices produces affine matrices that can be used to transform images. In order, these matrices correspond to translations in , translations in , rotations, scaling in , scaling in , and shearing.
Appendix B Semantic Segmentation: Details
To generate the rotCamVid dataset, we rotate all images in the CamVid by a random angle, analogously to the rotMNIST dataset (Larochelle et al., 2007)
. We note that rotCamVid only contains a single rotated copy of each image, which is not the same as applying rotational augmentation during training. When computing the training loss and test acccuracy, we ignore the padding pixels which appear due to rotating the image.
For the segmentation experiment we used the simpler augmentation distribution covering rotations and translations instead of the affine transformations (Section 3.2). We use a Gaussian parameterization of the distribution:
(11) 
where are trainable parameters, and is the affine transformation matrix for the random sample ; and are the width and height of the image.
Augerino achieves pixelwise segmentation accuracy of while the baseline model with standard augmentation achieves .
Appendix C Training Details
Network Training Hyperparameters
We train the networks in Sections 4 and 6 for epochs, using an initial learning rate of with a cosine learning rate schedule and a batch size of . We use the cross entropy loss function for all classification tasks, and mean squared error for all regression tasks except for QM9 where we use mean absolute error.
Train and TestTime Augmentations
In Algorithm 1 we include a term ncopies that denotes the number of sampled augmentations during training. We find that we can achieve strong performance with Augerino, with minimally increased training time, by setting ncopies to at traintime and then applying multiple augmentations by increasing ncopies at testtime. Thus we train using a single augmentation for each input, and then apply multiple augmentations at testtime to increase accuracy, as seen in Table 1.
Appendix D ColorSpace Augmentations: Details
In Section 9, we apply Augerino to learning colorspace invariances on the STL10 dataset. We consider two transformations:

Brightness adjustment by a value transforms the intensity in each channel additively:
(12) Positive increases, and negative decreases brightness.

Contrast adjustment by a value transforms the intensity in each channel as follows^{3}^{3}3https://www.dfstudios.co.uk/articles/programming/imageprogrammingalgorithms/imageprocessingalgorithmspart5contrastadjustment/:
(13)
Appendix E QM9 Experiment
We reproduce the training details from Finzi et al. (2020). Affine transformations in 3d, there are 9 generators, 3 for translation, 3 for rotation, 2 for squeezing and 1 for scaling, a straightforward extension of those listed in equation 10 to 3 dimensions. Like before, we parameterize the bounds on the uniform distribution for each of these generators. We use a regularization strength of .
Appendix F Width of Augerino Solutions
To help explain the increased generalization seen in using Augerino, we train models on CIFAR both with and without Augerino. In Figure 9 we present the test error of both types of models for along with the corresponding effective dimensionalities and sensitivity to parameter perturbations of the networks as a measure of the flatness of the optimum found through training. Maddox et al. (2020) shows that effective dimensionality can capture the flatness of optima in parameter space and is strongly correlated to generalization, with lower effective dimensionality implying flatter optima and better generalization. Overall we see that Augerino enables networks to find much flatter solutions in the loss surface, corresponding to better compressions of the data and better generalization.