Learning Invariances in Neural Networks

10/22/2020 ∙ by Gregory Benton, et al. ∙ 0

Invariances to translations have imbued convolutional neural networks with powerful generalization properties. However, we often do not know a priori what invariances are present in the data, or to what extent a model should be invariant to a given symmetry group. We show how to learn invariances and equivariances by parameterizing a distribution over augmentations and optimizing the training loss simultaneously with respect to the network parameters and augmentation parameters. With this simple procedure we can recover the correct set and extent of invariances on image classification, regression, segmentation, and molecular property prediction from a large space of augmentations, on training data alone.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to learn constraints or symmetries is a foundational property of intelligent systems. Humans are able to discover patterns and regularities in data that provide compressed representations of reality, such as translation, rotation, intensity, or scale symmetries. Indeed, we see the value of such constraints in deep learning. Fully connected networks are more flexible than convolutional networks, but convolutional networks are more broadly impactful because they enforce the

translation equivariance symmetry: when we translate an image, the outputs of a convolutional layer translate in the same way (LeCun et al., 1998; Cohen and Welling, 2016b). Further gains have been achieved by recent work hard-coding additional symmetries, such as rotation equivariance, into convolutional neural networks (e.g., Cohen and Welling, 2016b; Worrall et al., 2017; Zhou et al., 2017; Marcos et al., 2017)

But we might wonder whether it is possible to learn that we want to use a convolutional neural network. Moreover, we typically do not know which constraints are suitable for a given problem, and to what extent those constraints should be enforced. The class label for the digit ‘6’ is rotationally invariant up until it becomes a ‘9’. Like biological systems, we would like to automatically discover the appropriate symmetries. This task appears daunting, because standard learning objectives such as maximum likelihood select for flexibility, rather than constraints (MacKay, 2003; Minka, 2001).

In this paper, we provide an extremely simple and practical approach to automatically discovering invariances and equivariances, from training data alone. Our approach operates by learning a distribution over augmentations, then training with augmented data, leading to the name Augerino. Augerino (1) can learn both invariances and equivariances over a wide range of symmetry groups, including translations, rotations, scalings, and shears; (2) can discover partial symmetries, such as rotations not spanning the full range from

; (3) can be combined with any standard architectures, loss functions, or optimization algorithm with little overhead; (4) performs well on regression, classification, and segmentation tasks, for both image and molecular data.

To our knowledge, Augerino is the first approach that can learn symmetries in neural networks from training data alone, without requiring a validation set or a special loss function. In Sections 3-5 we introduce Augerino and show why it works. The accompanying code can be found at https://github.com/g-benton/learning-invariances.

2 Related Work

There has been explosion of work constructing convolutional neural networks that have hard-coded invariance or equivariance to a set of transformations, such as rotation (Cohen and Welling, 2016b; Worrall et al., 2017; Zhou et al., 2017; Marcos et al., 2017) and scaling (Worrall and Welling, 2019; Sosnovik et al., 2019). While recent methods use a representation theoretic approach to find a basis of equivariant convolutional kernels (Cohen and Welling, 2016a; Worrall et al., 2017; Weiler and Cesa, 2019), the older method of Laptev et al. (2016) pools network outputs over many hard-coded transformations of the input for fixed invariances, but does not consider equivariances.

With a desire to automate the machine learning pipeline,

Cubuk et al. (2019) introduced AutoAugment

in which reinforcement learning is used to find an optimal augmentation policy within a discrete search space. At the expense of a massive computational budget for the search, AutoAugment brought substantial gains in image classification performance, including state-of-the-art results on ImageNet. The AutoAugment framework was extended first to

Fast AutoAugment in Lim et al. (2019), improving both the speed and accuracy of AutoAugment by using Bayesian data augmentation (Tran et al., 2017). Both Cubuk et al. (2019) and Lim et al. (2019) apply a reinforcement learning approach to searching the space of augmentations, significantly differing from our work which directly optimizes distributions over augmentations with respect to the training loss.

Faster AutoAugment (Hataya et al., 2019), which uses a GAN framework to match augmentations to the data distribution, and Differentiable Automatic Data Augmentation (Li et al., 2020) which applies a DARTS (Liu et al., 2018) bi-level optimization procedure to learn augmentation from the validation loss are most similar to Augerino in the discovery of distributions over augmentations. Both methods learn augmentations from data using the reparametrization trick; however unlike Li et al. (2020) and Liu et al. (2018), we learn augmentations directly from the training loss without need for GAN training or the complex DARTS procedure (Liu et al., 2018; Xu et al., 2019; Liang et al., 2019), and are specifically learning degrees of invariances and equivariances.

To the best of our knowledge, Augerino is the first work to learn invariances and equivariances in neural networks from training data alone. The ability to automatically discover symmetries enables us to uncover interpretable salient structure in data, and provide better generalization.

3 Augerino: Learning Invariances through Augmentation

A simple way of constructing a model invariant to a given group of transformations is to average the outputs of an arbitrary model for the inputs transformed with all the transformations in the group. For example, if we wish to make a given image classifier invariant to horizontal reflections, we can average the predictions of the network for the original and reflected input.

Augerino functions by sampling multiple augmentations from a parameterized distribution then applying these augmentations to an input to acquire multiple augmented samples of the input. The augmented input samples are each then passed through the model, with the final prediction being generated by averaging over the individual outputs. We present the Augerino framework in Figure 1.

Figure 1: The Augerino framework. Augmentations are sampled from a distribution governed by parameters , and applied to an input to produce multiple augmented inputs. These augmented inputs are then passed to a neural network with weights , and the final prediction is generated by averaging over the multiple outputs. Augerino discovers invariances by learning from training data alone.

Now, suppose we are working with a set of transformations. Relevant transformations may not always form a group structure, such as rotations by limited angles in the range . Given a neural network , with parameters , we can make a new model which is approximately invariant to transformations by averaging the outputs over a parametrized distribution of the transformations :


Since the cross-entropy loss

for classification is linear in the class probabilities, we can pull the expectation outside of the loss:


As stochastic gradient descent only requires an unbiased estimator of the gradients, we can train the augmentation averaged model

exactly by minimizing the loss of averaged over a finite number of samples from

at training time, using a Monte Carlo estimator.

To learn the invariances we can also backpropagate through to the parameters

of the distribution by using the reparametrization trick (Kingma and Welling, 2013)

. For example, for a uniform distribution over rotations with angles

, we can parametrize the rotation angle by with . The loss for the augmentation-averaged model on an input can be computed as


Specifically, during training we can use a single sample from the augmentation distribution to estimate the gradients. The learned range of rotations would correspond to the extent rotational invariance is present in the data. With a more general set of transformations, we can similarly define a distribution over the transformation elements using the reparametrization trick , with and . The reparameterized loss is then


In Section 3.2 we describe a parameterization of the set of affine transformations which includes translations, rotations, and scalings of the input as special cases. In this fashion, we can train both the parameters of the augmentation averaged model consisting both of the weights of and the parameters of the augmentation distribution .

Test-time Augmentation

At test time we sample multiple transformations and make a prediction by averaging over the predictions generated for each transformed input, approximating the expectation in Equation (1). We discuss experimental design choices for train and test time augmentation in Appendix C.

Regularized Loss

Invariances correspond to constraints on the model, and in general the most unconstrained model may be able to achieve the lowest training loss. However, we have a prior belief that a model should preserve some level of invariance, even if standard losses cannot account for this preference. To bias training towards solutions that incorporate invariances, we add a regularization penalty to the network loss function that promotes broader distributions over augmentations. Our final loss function is given by


where is a regularization function encouraging coverage of a larger volume of transformations and is the regularization weight (the form of is discussed in Section 3.2). In practice we find that the choice of is largely unimportant; the insensitivity to the choice of is demonstrated throughout Sections 4 and 6 in which performance is consistent for various values of . This is due to the fact that there is essentially no gradient signal for over the range of augmentations consistent with the data, so even a small push is sufficient. We discuss further why Augerino is able to learn the correct level of invariance — without sensitivity to , and from training data alone — in Section 5.

We refer to the proposed method as Augerino111https://en.wikipedia.org/wiki/Augerino. We summarize the method in Algorithm 1.

Dataset ; parametric family of data augmentations and a distribution over the parameters ; neural network with parameters ; number of augmented inputs to use during training; number of training steps .
for  do
       Sample a mini-batch from ;
       For each datapoint in sample transformations from ;
       Average predictions of the network over data transformations of ;
       Compute the loss (5), using the averaged predictions;
       Take the gradient step to update the parameters and ;
end for
Algorithm 1 Learning Invariances with Augerino

3.1 Extension to Equivariant Predictions

We now generalize Augerino to problems where the targets are equivariant rather than invariant to a certain set of transformations. We say that target values are equivariant to a set of input transformations if the targets for a transformed input are transformed in the same way as the input. Formally, a function is equivariant to a symmetry transformation , if applying to the input of the function is the same as applying to the output, such that . For example, in image segmentation if the input image is rotated the target segmentation mask should also be rotated by the same angle, rather than being unchanged.

To make the Augerino model equivariant to transformations sampled from , we can average the inversely transformed outputs of the network for transformed inputs:


Supposing that acts linearly on the image then the model is equivariant:


where and the distribution is right invariant: for any measurable set , . If the distribution over the transformations is uniform then the model is equivariant.

3.2 Parameterizing Affine Transformations

We now show how to parameterize a distribution over the set of affine transformations of data (e.g. images). With this parameterization, Augerino can learn from a broad variety of augmentations including translations, rotations, scalings and shears.

The set of affine transformations form an algebraic structure known as a Lie Group. To apply the reparametrization trick, we can parametrize elements of this Lie Group in terms of its Lie Algebra via the exponential map Falorsi et al. (2019). With a very simple approach, we can define bounds on a uniform distribution over the different exponential generators in the Lie Algebra:


where exp is the matrix exponential function: . 222Mathematically speaking, this distribution is a pushforward by the exp map of a scaled cube with side lengths of a cube .

The generators of the affine transformations in , , correspond to translation in , translation in , rotation, scaling in , scaling in , and shearing; we write out these generators in Appendix A

. The exponential map of each generating matrix produces an affine matrix that can be used to transform the coordinate grid points of the input like in

Jaderberg et al. (2015). To ensure that the parameters are positive, we learn parameters where . In maximizing the volume of transformations covered, it would be geometrically sensible to maximize the Haar measure of the set of transformations that are covered by Augerino, which is similar to the volume covered in the Lie Algebra . However, we find that even the negative regularization on the bounds is sufficient to bias the model towards invariance. More intuitively, the regularization penalty biases solutions towards values of which induce broad distributions over affine transformations, .

We apply the regularization penalty on both classification and regression problems, using cross entropy and mean squared error loss, respectively. This regularization method is effective, interpretable, and leads to the discovery of the correct level of invariance for a wide range of .

4 Shades of Invariance

We can broadly classify invariances in three distinct ways: first there are cases in which we wish to be completely invariant to transformations in the data, such as to rotations on the rotMNIST dataset. There are also cases in which we want to be only partially invariant to transformations, i.e. soft invariance, such as if we are asking if a picture is right side up or upside down. Lastly, there are cases in which we wish there to be no invariance to transformations, such as when we wish to predict the rotations themselves. We show that Augerino can learn full invariance, soft invariance, and no invariance to rotations. We then explain in Section 5 why Augerino is able to discover the correct level of invariance from training data alone. Incidentally, soft invariances are the most representative of real-world problems, and also the most difficult to correctly encode a priori — where we most need to learn invariances.

For the experiments in this and all following sections we use a -layer CNN architecture from Laine and Aila (2016). We compare Augerino trained with three values of from Equation 5; corresponding to low, standard, and high levels of regularization. To further emphasize the need for invariance to be learned as opposed to just embedded in a model we also show predictions generated from an invariant -steerable network (Cohen and Welling, 2016a). Specific experimental and training details are in Appendix C.

4.1 Full Rotational Invariance: rotMNIST

The rotated MNIST dataset (rotMNIST) consists of the MNIST dataset with the input images randomly rotated. As the dataset has an inherent augmentation present (random rotations), we desire a model that is invariant to such augmentations. With Augerino, we aim to approximate invariance to rotations by learning an augmentation distribution that is uniform over all rotations in


Figure 2 shows the learned distribution over rotations to apply to images input into the model. On top of learning the correct augmentation through automatic differentiation using only the training data, we achieve test accuracy. We also see the level of regularization has little effect on performance. To our knowledge, only Weiler and Cesa (2019) achieve better performance on the rotMNIST dataset, using the correct equivariance already hard-coded into the network.

Figure 2: Left: Samples of the rotated digits in the data. Center: The initial and learned distributions over rotations. Right: The prediction probabilities of the correct class label over rotated versions of an image; the model learns to be approximately invariant to rotations under all levels of regularization.

4.2 Soft Invariance: Mario & Iggy

We show that Augerino can learn soft invariances — e.g. invariance to a subset of transformations such as only partial rotations. To this end, we consider a dataset in which the labels are dependent on both image and pose. We use the sprites for the characters Mario and Iggy from Super Mario World, randomly rotated in the intervals of and Nintendo (1990). There are labels in the dataset, one for the Mario sprite in the upper half plane, one for the Mario sprite in the lower half plane, one for the Iggy sprite in the upper half plane, and one for the Iggy sprite in the lower half plane; we show an example demonstrating each potential label in Figure 3.

In Figure 3, we see that too much rotational augmentation would make it impossible to correctly identify the pose. The limited rotations present in the data give that the labels are invariant to rotations of up to radians. Augerino learns the correct augmentation distribution within approximately radians, and the predicted class labels follow the desired invariances, with predictions that are invariant to rotations only within subsets of .

Figure 3: Left: Example data from the constructed Mario dataset. Labels are dependent on both the character, Mario or Iggy, and the rotation, upper half- or lower half-plane. Center: The initial and learned distribution over rotations. Rotations in the data are limited to and , meaning that augmenting an image by no more than radians will keep the rotation in the same half of the plane as where it started. The learned distributions approximate the invariance to rotations in that is present in the data. Right: The predicted probability of label for input images of Mario rotated at various angles. -steerable model is invariant, and incapable of distinguishing between inputs of different rotations.
Figure 4: Left: The data generating process for the Olivetti faces dataset. The labels correspond to the rotation of the input image. Center: The initialized and learned distributions over rotations. Right: The predictions generated as an input is rotated. Here we see that there is no invariance present for any level of regularization - as the image rotates the predicted label changes accordingly. The -steerable network fails for this task, as the invariance to rotations prevents us from being able to predict the rotation of the image.

4.3 Avoiding Invariance: Olivetti Faces

To test that Augerino can avoid unwanted invariances we train the model on the rotated Olivetti faces dataset (Hinton and Salakhutdinov, 2008). This dataset consists of distinct images of different people. We select the images of people to generate the training set, randomly rotating each image in , retaining the angle of rotation as the new label. We then crop the result to pixel square images. We repeat the process times for each image, generating training images. Figure 4 shows the data generating process and the corresponding label. Augmenting the image with any rotation would make it impossible to learn the angle by which the original image was rotated.

We find experimentally in Figure 4 that when we initialize the Augerino model such that the distribution over the rotation generating matrix is uniform , training for epochs reduces the distribution on the rotational augmentation to have domain of support radians wide. The model learns a nearly fixed transformation in each of the other spaces of affine transformation, all with domains of support for the weights under units wide.

5 Why Augerino Works

The conventional wisdom is that it is impossible to learn invariances directly from the training loss as invariances are constraints on the model which make it harder to fit the data (van der Wilk et al., 2018). Given data that has invariance to some augmentation, the training loss will not be improved by widening our distribution over this augmentation, even if it helps generalization: we would want a model to be invariant to rotations of a ‘6’ up until it looks more like a ‘9’, but no invariance will achieve the same training loss. However, it is sufficient to add a simple regularization term to encourage the model to discover invariances. In practice we find that the final distribution over augmentations is insensitive to the level of regularization, and that even a small amount of regularization will enable Augerino to find wide distributions over augmentations that are consistent with the precise level of invariances in the data.

We illustrate the learning of invariances with Augerino in panel (a) of Figure 5. Suppose only a limited degree of invariance is present in the data, as in Section 4.2. Then the training loss for the augmentation parameters will be flat for augmentations within the range of invariance present in the data (shown in white), and then will increase sharply beyond this range (corresponding region of Augerino parameters is shown in blue). The regularized loss in Eq. (5) will push the model to increase the level of invariance within the flat region of the training loss, but will not push it beyond the degree of invariance present in the data unless the regularization strength is extreme.

We demonstrate the effect described above for the Mario and Iggy classification problem of Section 4.2 in panel (b) of Figure 5. We use a network trained with Augerino and visualize the loss and gradient with respect to the range of rotations applied to the input with and without regularization. Without regularization, the loss is almost completely flat until the value of which is the true degree of rotational invariance in the data. With regularization we add an incentive for the model to learn larger values of the rotation range. Consequently, the loss achieves its optimum close to the optimal value of the parameter at and then quickly grows beyond that value. Figure 6 displays the results of panel (b) of Figure 5 in action; gradient signals push augmentation distributions that are too wide down and too narrow up to the correct width.

Incidentally, the Augerino solutions are substantially flatter than those obtained by standard training, as shown in Appendix F, Figure 9, which may also make them more easily discoverable by procedures such as SGD. We also see that these solutions indeed provide better generalization.

(a) Augerino training
(b) Loss function and Gradient
Figure 5: (a): A visualization of the space of possible transformations. Augerino expands to fill out the invariances in the dataset but is halted at the boundary where harmful transformations increase the training loss like rotating a 6 to a 9. (b): Loss value as a function of the rotation range applied to the input on the Mario and Iggy classification problem of Section 4.2 and its derivative. Without regularization the loss is flat for augmentations within the range corresponding to the true rotational invariance range in the data, and grows sharply beyond this range.
Figure 6: The distribution over rotation augmentations for the Mario and Iggy dataset over training iterations for various initializations. Regardless of whether we start with too wide, too narrow, or approximately the correct distribution over rotations, Augerino converges to the appropriate width.

6 Image Recognition

As Augerino learns a set of augmentations specific to a given dataset, we expect to see that Augerino is capable of boosting performance over applying any level of fixed augmentation. Using the CIFAR-

dataset, we compare Augerino to training on data with no augmentation, fixed, commonly applied augmentations, and the augmentations as given by Fast AutoAugment Lim et al. (2019).

No Aug. Fixed Aug. Augerino ( copies) Augerino ( copy) Fast AA
Test Accuracy
We compare models trained with no augmentation, a fixed commonly applied set of augmentations (including flipping, cropping, and color-standardization), Augerino, and Fast AutoAugment (Lim et al., 2019). Augerino with

provides a boost in performance with minimal increased training time. Error bars are reported as the standard deviation in accuracy for Augerino trained over

Table 1: Test accuracy for models trained on CIFAR- with different augmentations applied to the training data.

Table 1 shows that Augerino is competitive with advanced models that seek data-based augmentation schemes. The gains in performance are accompanied by notable simplifications in setup: we do not require a validation set and the augmentation is learned concurrently with training (there is no pre-processing to search for an augmentation policy). In Appendix F we show that Augerino find flatter solutions in the loss surface, which are known to generalize (Maddox et al., 2020). To further address the choice of regularization parameter, we train a number of models on CIFAR- with varying levels of regularization. In Figure 9 we present the test accuracy of models for different regularization parameters along with the corresponding effective dimensionalities of the networks as a measure of the flatness of the optimum found through training. Maddox et al. (2020) shows that effective dimensionality can capture the flatness of optima in parameter space and is strongly correlated to generalization, with lower effective dimensionality implying flatter optima and better generalization.

The results of the experiment presented in Figure 9 solidify Augerino’s capability to boost performance on image recognition tasks as well as demonstrate that the inclusion of regularization is helpful, but not necessary to train accurate models. If the regularization parameter becomes too large, as can be seen in the rightmost violins of Figure 9

, training can become unstable with more variance in the accuracy achieved. We observe that while it is possible to achieve good results with no regularization, the inclusion of an inductive bias that we ought to include some invariances (by adding a regularization penalty) improves performance.

7 Molecular Property Prediction

We test out our method on the molecular property prediction dataset QM9 (Blum and Reymond, 2009; Rupp et al., 2012) which consists of small inorganic molecules with features given by the coordinates of the atoms in 3D space and their charges. We focus on the HOMO task of predicting the energy of the highest occupied molecular orbital, and we learn Augerino augmentations in the space of affine transformations of the atomic coordinates in . We parametrize the transformation as before with a uniform distribution for each of the generators listed in Appendix A. We use the LieConv model introduced in Finzi et al. (2020), both with no equivariance (LieConv-Trivial) and 3D translational equivariance (LieConv-T). We train the models for 500 epochs on MAE (additional training details are given in C) and report the test performance in Table 2. Augerino performs much better than using no augmentations and is competitive with the hand chosen random rotation and translation augmentation () that incorporates domain knowledge about the problem. We detail the learned distribution over affine transformations in Appendix E. Augerino is useful both for the non equivariant LieConv-Trivial model as well as the translationally equivariant LieConv-T(3) model, suggesting that Augerino can complement architectural equivariance.

HOMO (meV) LUMO (meV)
No Aug. Augerino SE(3) No Aug. Augerino SE(3)
Table 2: Test MAE (in meV) on QM9 tasks trained with specified augmentation.

8 Semantic Segmentation

In Section 3.1 we showed how Augerino can be extended to equivariant problems. In Semantic Segmentation the targets are perfectly aligned with the inputs and the network should be equivariant to any transformations present in the data. To test Augerino in equivariant learning setting we construct rotCamVid, a variation of the CamVid dataset (Brostow et al., 2008b, a) where all the training and test points are rotated by a random angle (see Appendix Figure 7). For any fixed image we always use the same rotation angle, so no two copies of the same image with different rotations are present in the data. We use the FC-Densenet segmentation architecture (Jégou et al., 2017)

. We train Augerino with a Gaussian distribution over random rotations and translations.

In Appendix Figure 7 we visualize the training data and learned augmentations for Augerino. Augerino is able to successfully recover rotational augmentation while matching the performance of the baseline. For further details, please see Appendix B.

9 Color-Space Augmentations

In the previous sections we have focused on learning spatial invariances with Augerino. Augerino is general and can be applied to arbitrary differentiable input transformations. In this section, we demonstrate that Augerino can learn color-space invariances.

We consider two color-space augmentations: brightness adjustments and contrast adjustments. Each of these can be implemented as simple differentiable transformations to the RGB values of the input image (for details, see Appendix D

). We use Augerino to learn a uniform distribution over the brightness and contrast adjustments on STL-10

(Coates et al., 2011) using the -layer CNN architecture (see Section 4). For both Augerino and the baseline model, we use standard spatial data augmentation: random translations, flips and cutout (DeVries and Taylor, 2017). The baseline model achieves accuracy where the mean and standard deviation are computed over independent runs. The Augerino model achieves a slightly higher accuracy and learns to be invariant to noticeable brightness and contrast changes in the input image (see Appendix Figure 8).

10 Conclusion

We have introduced Augerino, a framework that can be seamlessly deployed with standard model architectures to learn symmetries from training data alone, and improve generalization. Experimentally, we see that Augerino is capable of recovering ground truth invariances, including soft invariances, ultimately discovering an interpretable representation of the dataset. Augerino’s ability to recover interpretable and accurate distributions over augmentations leads to increased performance over both task-specific specialized baselines and competing data-based augmentation schemes on a variety of tasks including molecular property prediction, image segmentation, and classification.

Broader Impacts

Our work is largely methodological and we anticipate that Augerino will primarily see use within the machine learning community. Augerino’s ability to uncover invariances present within the data, without modifying the training procedure and with a very plug-and-play design that is compatible with any network architecture makes it an appealing method to be deployed widely. We hope that learning invariances from data is an avenue that will see continued inquiry and that Augerino will motivate further exploration.


This research is supported by an Amazon Research Award, Facebook Research, Amazon Machine Learning Research Award, NSF I-DISRE 193471, NIH R01 DA048764-01A1, NSF IIS-1910266, and NSF 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.


  • B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson (2019) There are many consistent explanations of unlabeled data: why you should average. ICLR.
  • E. J. Bekkers (2020) B-spline cnns on lie groups. In International Conference on Learning Representations, External Links: Link
  • L. C. Blum and J.-L. Reymond (2009) 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc. 131, pp. 8732. Cited by: §7.
  • G. J. Brostow, J. Fauqueur, and R. Cipolla (2008a) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters. Cited by: §8.
  • G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008b) Segmentation and recognition using structure from motion point clouds. In ECCV (1), pp. 44–57. Cited by: §8.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and StatisticsProceedings of the 24th International Conference on Machine LearningProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProceedings of the IEEE International Conference on Computer VisionAdvances in Neural Information Processing SystemsAdvances in Neural Information Processing SystemsAdvances in Neural Information Processing SystemsAdvances in neural information processing systems

    , G. Gordon, D. Dunson, and M. Dudík (Eds.),
    Proceedings of Machine Learning ResearchICML ’07, Vol. 15, Fort Lauderdale, FL, USA, pp. 215–223. External Links: Link Cited by: §9.
  • T. S. Cohen, M. Geiger, and M. Weiler (2019) A general theory of equivariant cnns on homogeneous spaces. In Advances in Neural Information Processing Systems, pp. 9142–9153.
  • T. S. Cohen and M. Welling (2016a) Steerable cnns. arXiv preprint arXiv:1612.08498. Cited by: §2, §4.
  • T. Cohen and M. Welling (2016b) Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §1, §2.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: §2.
  • T. Dao, A. Gu, A. J. Ratner, V. Smith, C. De Sa, and C. Ré (2019) A kernel theory of modern data augmentation. Proceedings of machine learning research 97, pp. 1528.
  • T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. External Links: 1708.04552 Cited by: §9.
  • L. Falorsi, P. de Haan, T. R. Davidson, and P. Forré (2019) Reparameterizing distributions on lie groups. arXiv preprint arXiv:1903.02958. Cited by: §3.2.
  • M. Finzi, S. Stanton, P. Izmailov, and A. G. Wilson (2020) Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. arXiv preprint arXiv:2002.12880. Cited by: Appendix E, §7.
  • R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama (2019) Faster autoaugment: learning augmentation strategies using backpropagation. arXiv preprint arXiv:1911.06987. Cited by: §2.
  • G. E. Hinton and R. R. Salakhutdinov (2008) Using deep belief nets to learn covariance kernels for gaussian processes. In Advances in neural information processing systems, pp. 1249–1256. Cited by: §4.3.
  • Z. Huang, C. Wan, T. Probst, and L. Van Gool (2017) Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6099–6108.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. pp. 2017–2025. Cited by: §3.2.
  • S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 11–19. Cited by: §8.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.
  • S. Laine and T. Aila (2016)

    Temporal ensembling for semi-supervised learning

    arXiv preprint arXiv:1610.02242. Cited by: §4.
  • D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys (2016) TI-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 289–297. Cited by: §2.
  • H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio (2007) An empirical evaluation of deep architectures on problems with many factors of variation. New York, NY, USA, pp. 473–480. External Links: ISBN 9781595937933, Link, Document Cited by: Appendix B.
  • Y. LeCun, Y. Bengio, et al. (1998) Convolutional networks for images, speech, and time series, the handbook of brain theory and neural networks. MIT Press, Cambridge, MA. Cited by: §1.
  • Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang (2020) DADA: differentiable automatic data augmentation. arXiv preprint arXiv:2003.03780. Cited by: §2.
  • H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §2.
  • S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim (2019) Fast autoaugment. In Advances in Neural Information Processing Systems, pp. 6662–6672. Cited by: §2, Table 1, §6.
  • H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.
  • D. J. MacKay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §1.
  • W. J. Maddox, G. Benton, and A. G. Wilson (2020) Rethinking parameter counting in deep models: effective dimensionality revisited. arXiv preprint arXiv:2003.02139. Cited by: Appendix F, Appendix, §6.
  • D. Marcos, M. Volpi, N. Komodakis, and D. Tuia (2017)

    Rotation equivariant vector field networks

    pp. 5048–5057. Cited by: §1, §2.
  • T. P. Minka (2001) Automatic choice of dimensionality for pca. In Advances in neural information processing systems, pp. 598–604. Cited by: §1.
  • Nintendo (1990) Super mario world. Cited by: §4.2.
  • M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld (2012) Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters 108, pp. 058301. Cited by: §7.
  • I. Sosnovik, M. Szmaja, and A. Smeulders (2019) Scale-equivariant steerable networks. arXiv preprint arXiv:1910.11093. Cited by: §2.
  • T. Tran, T. Pham, G. Carneiro, L. Palmer, and I. Reid (2017) A bayesian data augmentation approach for learning deep models. In Advances in neural information processing systems, pp. 2797–2806. Cited by: §2.
  • M. van der Wilk, M. Bauer, S. John, and J. Hensman (2018) Learning invariances using the marginal likelihood. pp. 9938–9948. Cited by: §5.
  • M. van der Wilk (2020)
  • M. Weiler and G. Cesa (2019) General e (2)-equivariant steerable cnns. pp. 14334–14345. Cited by: §2, §4.1.
  • D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow (2017) Harmonic networks: deep translation and rotation equivariance. pp. 5028–5037. Cited by: §1, §2.
  • D. Worrall and M. Welling (2019) Deep scale-spaces: equivariance over scale. pp. 7364–7376. Cited by: §2.
  • Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2019) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. arXiv preprint arXiv:1907.05737. Cited by: §2.
  • X. Zhang, Z. Wang, D. Liu, and Q. Ling (2019) Dada: deep adversarial data augmentation for extremely low data regime classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2807–2811.
  • Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao (2017) Oriented response networks. pp. 519–528. Cited by: §1, §2.

Appendix A Lie Group Generators

The six Lie group generating matrices for affine transformations in 2D are,


Applying the exponential map to these matrices produces affine matrices that can be used to transform images. In order, these matrices correspond to translations in , translations in , rotations, scaling in , scaling in , and shearing.

Appendix B Semantic Segmentation: Details

In Section 8, we apply Augerino to semantic segmentation on the rotCamVid dataset (see Figure 7).

To generate the rotCamVid dataset, we rotate all images in the CamVid by a random angle, analogously to the rotMNIST dataset (Larochelle et al., 2007)

. We note that rotCamVid only contains a single rotated copy of each image, which is not the same as applying rotational augmentation during training. When computing the training loss and test acccuracy, we ignore the padding pixels which appear due to rotating the image.

For the segmentation experiment we used the simpler augmentation distribution covering rotations and translations instead of the affine transformations (Section 3.2). We use a Gaussian parameterization of the distribution:


where are trainable parameters, and is the affine transformation matrix for the random sample ; and are the width and height of the image.

Augerino achieves pixel-wise segmentation accuracy of while the baseline model with standard augmentation achieves .

(a) Original Data
(b) Augerino Sample
(c) Augerino Sample
(d) Augerino Sample
Figure 7: Augmentations learned by Augerino on the rotCamVid dataset. (a): original data from rotCamVid; (b)-(d): three random samples of augmentations from the learned augerino distribution. Augerino learns to be invariant to rotations but not translations.

Appendix C Training Details

Network Training Hyperparameters

We train the networks in Sections 4 and 6 for epochs, using an initial learning rate of with a cosine learning rate schedule and a batch size of . We use the cross entropy loss function for all classification tasks, and mean squared error for all regression tasks except for QM9 where we use mean absolute error.

Train- and Test-Time Augmentations

In Algorithm 1 we include a term ncopies that denotes the number of sampled augmentations during training. We find that we can achieve strong performance with Augerino, with minimally increased training time, by setting ncopies to at train-time and then applying multiple augmentations by increasing ncopies at test-time. Thus we train using a single augmentation for each input, and then apply multiple augmentations at test-time to increase accuracy, as seen in Table 1.

Appendix D Color-Space Augmentations: Details

(a) Original Data
(b) Augerino Sample
(c) Augerino Sample
(d) Augerino Sample
Figure 8: Color-space augmentation distribution learned by Augerino. (a): original data from STL-10; (b)-(d): three random samples of augmentations from the learned augerino distribution. Augerino learns to be invariant to a broad range of color and contrast adjustments while matching the performance of the baseline.

In Section 9, we apply Augerino to learning color-space invariances on the STL-10 dataset. We consider two transformations:

We apply brightness and contrast adjustments sequentially and independently from each other. We learn the range of a uniform distribution over the values in (12), (13). The learned data augmentation strategy is visualized in Figure 8.

Appendix E QM9 Experiment

We reproduce the training details from Finzi et al. (2020). Affine transformations in 3d, there are 9 generators, 3 for translation, 3 for rotation, 2 for squeezing and 1 for scaling, a straightforward extension of those listed in equation 10 to 3 dimensions. Like before, we parameterize the bounds on the uniform distribution for each of these generators. We use a regularization strength of .

Appendix F Width of Augerino Solutions

To help explain the increased generalization seen in using Augerino, we train models on CIFAR- both with and without Augerino. In Figure 9 we present the test error of both types of models for along with the corresponding effective dimensionalities and sensitivity to parameter perturbations of the networks as a measure of the flatness of the optimum found through training. Maddox et al. (2020) shows that effective dimensionality can capture the flatness of optima in parameter space and is strongly correlated to generalization, with lower effective dimensionality implying flatter optima and better generalization. Overall we see that Augerino enables networks to find much flatter solutions in the loss surface, corresponding to better compressions of the data and better generalization.

Figure 9: Top: Test error and train loss as a function of perturbation lengths along random rays from the SGD found training solution for models. Each curve represents a different ray. Bottom: Test error and effective dimensionality for models trained on CIFAR-. Results from

random initializations are presented violin-plot style where width represents the kernel density estimate at the corresponding