1 Introduction
A dataset can be described in terms of natural factors of variation of the data: for example, images of objects can present those objects with different poses, illuminations, colors, etc. Prediction consistency of models with respect to changes in these factors is a desirable property for outofdomain generalization (bengio_representation_2013; lenc18understanding)
. However, stateoftheart Convolutional Neural Networks (CNNs) struggle when presented with “unusual" examples, e.g. a bus upside down
(alcorn2019strike). Indeed, CNNs lack robustness not only to changes in pose, but even to simple geometric transformations such as small translations and rotations (Engstrom19; Azulay19).Invariance to factors of variation can be learned directly from data, builtin via architectural inductive biases, or encouraged via data augmentation (Fig. 1A). As of today, data augmentation is the predominant method for encouraging invariance to a set of transformations. Yet, even with data augmentation models fail to generalize to held out objects and to learn invariance. For example, (Engstrom19) found that augmenting with rotation and translation does not lead to invariance to the very same transformations during testing. A complementary research direction ensures a model responds predictably to transformations, using group equivariance theory (see pmlrv48cohenc16; cohen_general_2020 among others). Provably invariant models have limited largescale applications as they require a priori knowledge of the underlying factors of variation (pmlrv48cohenc16; finzi_generalizing_2020). Recent work tackles automatic discovery of symmetries in data (benton2020learning; zhou_metalearning_2020; dehmamy2021lie; Hashimoto2017). However, these methods have mostly been applied to synthetic or artificially augmented datasets which are not directly transferable to real data settings, and can even hurt performance when transferred (dehmamy2021lie).
One way to characterize the consistency of a model’s response is by measuring its equivariance or invariance to changes of the input. A model, , is equivariant to a transformation of an input if the model’s output transforms in a corresponding manner via an output transformation , i.e. for any . Invariance is a special type of equivariance, where the model’s output is the same for all transformations, i.e. . To understand whether CNNs and other stateoftheart models are invariant to changes in the data factors of variation, one needs explicit annotations about such factors. While this is trivial for synthetic datasets with known factors, identifying the factors of variations in real datasets is a complex task. As such, prior work turned to synthetic settings to show that knowing the underlying factors improves generalization (ChengRotDCF; locatello_weaklysupervised_2020).
Here, we take a first step towards understanding the links between data augmentations and factors of variation in natural images, in the context of image classification. We do so by carefully studying the role of data augmentation, architectural inductive biases, and the data itself in encouraging invariance to these factors. We primarily focus on ResNet18 trained ImageNet as a benchmark for largescale vision models (deng2009imagenet; He2015), which we also compare to the recently proposed vision transformer (ViT) (dosovitskiy2020image)
. While previous works study the invariance properties of neural networks to a set of transformations
(lenc18understanding; Myburgh2021; Zhang2019; Kauderer2018; Engstrom19), we ground invariances in ImageNet factors of variations. We make the following contributions (summarized in Fig. 1):
[topsep=0pt]

What transformations do standard data augmentation correspond to? In Sec. 2, we demonstrate that the success of the popular random resized crop (RRC) augmentation amounts to a precise combination of translation and scaling. To tease out the relative role of these factors, we decomposed RRC into separate augmentations. While neither augmentation alone was sufficient to fully replace RRC, we observed that despite the approximate translation invariance built into CNNs, translation alone is sufficient to improve performance close to RRC, whereas the contribution of scale was comparatively minor.

What types of invariance are present in ImageNet trained models? Do these invariances derive from the data augmentation, the architectural bias or the data itself (Figure 1A)? In Sec. 3, we demonstrate that when invariance is present, it is primarily learned from data independent of the augmentation strategy used with the notable exception of translation invariance which is enhanced by standard data augmentation. We also found that architectural bias has a minimal impact on invariance to the majority of transformations.

Which transformations account for intraclass variations in ImageNet? How do they relate to the models’ invariance properties discovered in Sec. 3? In Sec. 4, we show that appearance transformations, often absent from standard augmentations, account for intraclass changes for factors of variation in ImageNet. We found training enhances a model’s natural invariance to transformations that account for ImageNet variations (including appearance transformations), and decreases invariance for transformations that do not seem to affect factors of variation. We also found factors of variation are unique per class, despite common data augmentations applying the same transformations across all classes.
Our results demonstrate that the relationship among architectural inductive biases, the data itself, the augmentations used, and invariance is often more complicated than it may first appear, even when the relationship appears intuitive (such as for convolution and translation invariance). Furthermore, invariance generally stems from the data itself, and aligns with the data factors of variations. By understanding both which invariances are desirable and how to best encourage these invariances, we hope to guide future research into building more robust and generalizable models for large scale, realistic applications. Code to reproduce our results is in supplementary material.
2 Decomposing standard data augmentation techniques
Data augmentation improves performance and generalization by transforming inputs to increase the amount of training data and its variations (Niyogi98incorporatingprior). Transformations typically considered include taking a crop of random size and location (random resized crop), horizontal flipping, and color jittering (He2015; Touvron2019; Simonyan15). Here, we focus on random resized crop (denoted RandomResizedCrop; RRC) that is commonly used for training ResNets ^{1}^{1}1We follow the procedure of RandomResizedCrop
as implemented in the PyTorch library
(PyTorch) https://pytorch.org/vision/stable/_modules/torchvision/transforms/transforms.html..For an image of width and height , RandomResizedCrop (1) samples a scale factor
from a uniform distribution,
and an aspect ratio (2) takes a square crop of size in any part of the image (3) resizes the crop, typically to for ImageNet. Thus, the area of an object selected by the crop is randomly scaled proportional to 1/, which encourages the model to be scale invariant. The crop is also taken in any location of the image within its boundaries, which is equivalent to applying a translation whose parameters depend on the percentage of the area chosen for the crop. However, the way translation and scale interact remains obscure. In this section, we contrast the role of translation and scale in RRC and analyze the impact of the parameters used to determine these augmentations.2.1 The gain of RandomResizedCrop is largely driven by translation rather than scale
To study the role of both the scaling and translation steps, we separate RandomResizedCrop into two component data augmentations:

[noitemsep,topsep=0pt]

RandomSizeCenterCrop takes a crop of random size, always at the center of the image. The distribution for scale and aspect ratio are the same as those used in RandomResizedCrop. This augmentation impacts scale, but not translation.

FixedSizeRandomCrop takes a crop of fixed size () at any location of the image (the image is first resized to on the shorter dimension). This augmentation impacts translation, but not scale.


SEM (standard error of the mean).
As with RRC, both of these transformations can remove information from the image (for example, translation can shift a portion of the image out of the frame whereas zooming in will remove the edges of the image), but neither augmentation can fully reproduce the effect of RRC by itself.
We train ResNet18 on ImageNet and report results on the validation set as in commonly done (e.g. in Touvron2019) since the labelled test set is not publicly available. FixedSizeCenterCrop corresponds to what is usually done for augmenting validation/test images, i.e. resize the image to on the shorter dimension and take a center crop of size . We apply FixedSizeCenterCrop during all evaluation steps and refer to it as “no augmentation". Training details are in Appendix A.
RCC combines scale and translation in a precise manner. Table 0(a) shows that RandomResizedCrop performs best, with Top1 accuracy. Augmenting by taking a crop of random size (RandomSizeCenterCrop), a proxy for scaleinvariance of the center object, performs on par with FixedSizeRandomCrop, a proxy to invariance to translation, and both bring a substantial improvement compared to no augmentation ( 67.9% vs. 63.5%). However, neither is sufficient to fully recapture the performance of RRC. To further test that RRC amounts to translation and scale, we augment the image by translating it by at most 30% in width and height followed by taking a center crop of random size (denoted T.(30%) + RandomSizeCenterCrop). This method improves Top1 accuracy over RandomSizeCenterCrop and FixedSizeRandomCrop, and almost match RRC but with a gap of , showing that scaling and translating interact in a precise manner in RRC that is difficult to reproduce with both transformations applied iteratively.
RCC’s performance is driven by translation. FixedSizeRandomCrop impacts translation, but its behavior is contrived, for example an image in the top left corner will never be in the bottom left corner of a crop. Thus to further disambiguate the role of translation versus scale, we also experiment using T.(30%) only: we resize the image to on the shorter dimension, apply the random translation of at most and take a center crop of size . The gain in performance from T.(30%) compared to no augmentation is surprising given the (approximate) translation invariance built in to convolutional architectures such as ResNets. Furthermore, T.(30%) performs almost as well as T.(30%) + RandomSizeCenterCrop, which has scale augmentation as well. Thus, adding the change in scale to the translation does not further improve performance, which was not the case when comparing FixedSizeRandomCrop and RandomSizeCenterCrop
. We compare the validation images that become correctly classified (compared to no augmentation) when using
FixedSizeRandomCrop, T.(30%) and T.(30%) + RandomSizeCenterCrop in Appendix B but no clear pattern emerged.2.2 Tradeoff between variance and magnitude of augmentation
What role does the distribution over augmentations in RandomResizedCrop play? Default values for the distribution over are . Thus the scale factor can increase the size of an object in the crop up to to times larger. Does only the range of augmentation magnitudes matter? What if we change the shape of this distribution, for the same range of values?
To test this, we modified RandomResizedCrop
to use a Beta distribution
over the standard interval for (). Fixing , we vary , which changes the distribution shape. Setting corresponds to a uniform distribution , while smaller values of lead to heavily sampling values of near 1 (and vice versa; see Appendix Fig. 5 for visualizations). Table 0(b) shows that just varying the shape of the distribution results in a drop in performance. To explain this drop, we examine the discrepancy between average apparent object sizes during training and evaluation as in Touvron2019. We note smaller values of reduce the discrepancy between image sizes during training and evaluation (see Appendix C for the mean of ). However, very small values of (e.g., 0.1) also decrease performance, as they do not encourage scale invariance by sampling near 1 (no scale change) andhas very little variance. Thus, we observe a tradeoff between variability to induce invariance and consistency between training and evaluation object sizes.
3 Invariance across architectures and augmentations
So far, we have measured the impact of decomposed augmentations on model performance, but how do these augmentations impact invariance to these and other transformations? To what extent do other elements, such as architectural inductive bias and learning contribute to these invariances? And finally, how do these invariances differ across categories of transformations? In this section, we address these questions by defining a metric for invariance and evaluate this metric for a number of transformations across combinations of architectures, augmentations, and training.
3.1 Measuring invariance
A model is considered invariant to a transformation with a specific magnitude if applying leaves the output unchanged. We choose to measure invariance by measuring the cosine distance, between the embeddings of a sample and its transformed version, relative to a baseline, :
(1) 
where generates the embedding up from the penultimate layer of a ResNet18 or a vision transformer trained on ImageNet (dosovitskiy2020image) (rw2019timm), see Appendix D for training details. The baseline, , is the embedding distance across samples: to account for the effect different transformations (and magnitudes) may have on the distribution of embeddings. The closer to , the more invariance to . We report the distribution of across pairs.
To measure how invariance changes as transformations intensify, we report invariance as a function of transformation magnitude, with magnitude 0 meaning no transformation and magnitude 9 corresponding to a large transformation (see Appendix Fig. 6 for examples). We expect invariance to decrease as transformation magnitude increases for the majority of settings.
3.2 Invariance to geometric and appearance transformations
To understand the extent and source of invariance, we measured invariance to common appearance and geometric transformations for both ResNet18 and ViT models. First, we found that models indeed featured invariance to a number of common transformations, including translation and scale (Fig. 2). Examining the impact of architecture, we observed that for translation, both ResNet18 and ViT models learned to be invariant, with ViT models consistently achieving slightly higher translation invariance than ResNet18. Surprisingly, untrained ViT models also featured stronger invariance to translation when compared to untrained ResNet18 (Fig. 2a, b, compare orange with light blue), suggesting that despite the convolutional inductive bias present in ResNet18, translation invariance is more prominent in ViT models.
We next examined the impact of training on invariance. While training consistently increased invariance to translation and to appearance based transformations (Fig. 2a,b and additional figures in Appendix Fig. 7), training resulted in no change in average invariance to scale (zoomingin) but reduced the variability in this invariance (Fig. 2). In contrast, training reduced invariance to rotation and shear (Fig. 2d and 7). Finally, we examined the role of augmentation in learning invariance. Consistent with our finding in Sec. 2.1, we found standard augmentation improves translation invariance (albeit only slightly; compare dark and medium blue in Fig. 2a,b). Surprisingly, it barely increases scale invariance, and had equivocal effects on other transformations.
What role does the architectural inductive bias in ResNets play? To answer this, we measure equivariance to assess whether models encode predictable responses to transformations. We evaluate equivariance by examining the alignment of embedding responses to transformations. We find equivariance to translation for untrained ResNet18 that is absent for ViT, highlighting the architectural inductive bias of CNNs to translation. Regardless, we observe trained ViT and ResNet18 are able to learn invariance to both geometric and appearance transformations. Does invariance correspond to actual factors of variation in ImageNet?
4 Characterizing factors of variation with similarity search
In the previous section, we assessed whether ResNet18 and ViT are invariant to a set of transformations, and explored to what extent learning, data augmentation and architectural inductive biases impact this invariance. But do learned invariances correspond to the transformations that actually affect factors of variation in the data? What are the transformations that explain variations in ImageNet? In this section, we design a metric to answer these questions by comparing how well each transformation allows us to travel from one image to another image with the same label, thus capturing a factor of variation. Finally, we relate our findings to models’ invariances from the previous section.
Characterizing the factors of variation present in natural images is challenging since we don’t have access to generative model of these images. Here, we introduce a metric to determine these transformations based on a simple idea: factors of variation in the data should be able to explain the differences between images of the same class. For example, suppose one of the primary factors of variation in a dataset of animals is pose. In this case, by modifying the pose of one image of a dog, we should be able to match another image of a dog with a different prose. Concretely, we measure the change in similarity a transformation brings to image pairs. For an image pair from the same class and a transformation we measure the percent similarity change as
(2) 
where
measures cosine similarity (see Appendix
E). To control for any effect data augmentation, we take to be a ResNet18 model up to the penultimate layer trained without data augmentation^{2}^{2}2Note that if is fully invariant to then will be zero. Thus we also chose trained w/o augmentation to reduce translation invariance and confirmed full invariance is not achieved in Sec. 3. and report values on training image pairs. A higher similarity among pairs implies a stronger correspondence between the transformation and factors of variation. We select a relevant pool of transformations using AutoAugment, an automated augmentation search procedure across several image datasets (cubuk2019autoaugment). The set of transformations encompasses geometric and appearance transformations of varying magnitudes^{3}^{3}3See Appendix 6 for full list of possible transformations.. For each we report averaged on image pairs.4.1 How do transformations affect similarity of pairs?
Are there any transformations which consistently increase the similarity between image pairs across all classes? In Fig. 3A we show the distribution of average similarity changes across each transformation. We observe no transformation increases average similarity of image pairs across all classes, including geometric and appearance transformations. We find the same pattern holds whether a ResNet18 is trained with or without standard augmentations (RRC + horizontal flips). This result suggests that although standard approaches to data augmentation apply the same transformation distribution to all classes, no single transformation consistently improves similarity (see Appendix E).
Does using combinations of transformations increase similarity?
To answer this, we repeat the analysis using subpolicies, which combine two transformations of varying magnitudes. We found that while subpolicies can help or hurt by a larger margin as they apply multiple transformations, no subpolicy increases average similarity across all classes (Fig. 8).
Training augmentations dampen pair similarity changes.
Similar to our earlier results, we observed that training with augmentations induce more invariance relative to training without augmentations. Though all transformations consistently decrease similarity for models trained with and without augmentation, models trained with augmentation exhibited both a smaller decrease in similarity and lower variance (Fig. 3A). Training with augmentation also reduced the decrease in similarity for transformations beyond simply scale and translation (Appendix Fig. 9), suggesting that these augmentations impact the response even to transformations not used during training.
4.2 Appearance transformations are more prevalent
Data augmentation and most of the literature on invariant models often rely only on geometric transformations such as translations, rotations, and scaling. However, it is not clear whether or not the factors of variation in natural images are primarily geometric. If this is the case, we would expect the top transformations from our similarity search to be geometric rather than appearancebased. In contrast, we found that appearance transformations accounted for more variation in ImageNet than geometric transformations, consistent with recent work (cubuk2019autoaugment). Of the top transforms, were appearance based compared to only geometric. We confirmed this difference is not due to a sampling bias by ensuring an approximately even split between geometric and appearance transformations. In fact, if we isolate geometric transformations, we find for of classes the top transformation is the identity, suggesting geometric transformations are worse when applied to an entire class than no transformation at all. We find a similar pattern among the top transforms per class for ResNet18 trained with standard augmentations: for more than of classes the top transformation alters appearance not geometry. In Appendix E.1 we also examine local variation in foregrounds to translation and scale.
4.3 Are factors of variation specific to each class?
In Sec. 4.1, we showed no transformation (or subpolicy) consistently increased similarity across classes. However, the factors of variation and consequently, the optimal transformation, may be different for different classes, especially those which are highly different. Can we instead consistently increase similarity if we allow flexibility for transformations to be class specific?
To test this hypothesis, we examined the top transformation for each class. In contrast to the global result, we found the top transformation per class consistently increased the average similarity for all classes (Fig. 3B). Notably, the top transformation per class increased similarity by (mean SEM) compared to effectively no change () for the top transformation across all classes. Similar to Hauberg16
which learn per class transformations for data augmentation that boost classification performance on MNIST
(lecun2010mnist), we applied per class data augmentation variants on ImageNet, but observed no significant classification performance boost. We leave further applications of per class augmentation to future work.Data augmentation is not beneficial for all classes.
Since the optimal transformation differs across classes, do standard augmentations benefit all classes or only some classes? To test this, we examined the impact of RRC on the performance of individual ImageNet classes. Interestingly, we found that on average^{4}^{4}4Computed over 25 pairwise comparisons of 5 runs with both augmentations. have a lower performance when using RRC. Critically, some classes are consistently hurt by the use of RRC with a difference in top1 accuracy as high as . We systematically study these classes in Appendix D.1, but no clear pattern emerged.
4.4 Training increases invariance for factors of variation
In Sec. 3, we showed that training increases invariance to a number of transformations, independent of architectural bias and augmentation and in Sec. 4.1, we used similarity search to characterize the transformations present in natural images. However, do the invariances learned over training correspond to the factors of variation present in natural images?
To test this, we asked whether the transformations to which networks learn to be invariance correspond to the same transformations which are highly ranked in similarity search. We found transformations that exhibit increased invariance over training were substantially more likely to be ranked in the top 5 transformations per class compared to transformations which exhibited decreased or minimal change in invariance over training (Fig. 3C). This result demonstrates training increases invariance to factors of variation present in real data, regardless of whether there is an inductive bias or data augmentation is specifically designed to encourage invariance.
4.5 Characterizing factors of variation through pair similarity
We have shown that there exist factors of variation which consistently increase similarity for a given class, but it remains unclear why a particular factor might impact a particular class. Here, we investigate whether related classes feature related factors of variation.
One prominent pattern which emerged was among textilelike classes such as velvet, wool, handkerchief, and envelope. When considering single transformations, rotation is the top transformation or rotation plus an appearance transformations (such as color or posterize) for subpolicies (see Appendix E.2 for a full list). The relationship between rotation and textiles makes intuitive sense: fabrics generally don’t have a canonical orientation and can appear in many different colors.
To test this systematically, we measured the pairwise class similarity using Wordnet (wordnet) and compared it to the similarity between the top transforms for each class. We computed class similarities using the most specific common ancestor in the Wordnet tree against the Spearman rank of transformation types. We found that while dissimilar classes often have similar transformations, similar classes consistently exhibit more similar transformations (Fig. 4; Appendix E.3).
5 Related work
Data augmentation approaches.
Standard data augmentations often amount to taking a crop of random size and location, horizontal flipping and color jittering (He2015; Touvron2019; Simonyan15)
. In selfsupervised learning,
(NEURIPS2020_70feb62b) boost performance by using multiple crops of the same image. Hauberg16 learn per class augmentation and improve performance on the MNIST dataset (lecun2010mnist). AutoAugment is a reinforcement learningbased technique that discovers data augmentations that most aid downstream performance
cubuk2019autoaugment. antoniou2018datatrain a Generative Adversarial Networks (GANs
(NIPS2014_5ca3e9b1)) based model to generate new training samples. Wilk2018 follow a Bayesian approach and integrate invariance into the prior. Recent works aim to automatically discover symmetries in a dataset (Hashimoto2017), and enforce equivariance or invariance to these (benton2020learning; zhou_metalearning_2020; dehmamy2021lie). While these methods are promising, they have mostly been applied to synthetic datasets or augmented versions of real datasets. Their application to a real dataset such as ImageNet is not straightforward: we tried applying the Augerino model (benton2020learning) to ImageNet, we found it was struggling to discover effective augmentations composed of multiple transformations (see Appendix G).Consistency of neural architectures.
Zhang2019 show that invariance to translation is lost in neural networks, and propose a solution based on antialiasing by lowpass filtering. Kauderer2018 studies the source of CNNs translation invariance of CNNs on a translated MNIST dataset (lecun2010mnist) using translationsensitivity maps based on Euclidean distance of embeddings, and find that data augmentation has a bigger effect on translation invariance than architectural choices. Very recently, Myburgh2021 replace Euclidean distance with cosine similarity, and find that fully connected layers contribute more to translation invariance than convolutional ones. lenc18understanding study the equivariance, invariance and equivalence of different convolutional architectures with respect to geometric transformations. Here, we study invariance on a larger set of transformations with varying magnitudes and compare ResNet18 and ViT. Touvron2019; Engstrom19 explore the specificities of standard data augmentations, and Engstrom19 found that a model augmented at training with rotations and translations still fails on augmented test images. We differ from these works by grounding invariance into the natural factors of variation of the data, which we try to characterize. To the best of our knowledge, the links between invariance and the data factors of variation has yet not been studied on a largescale real images dataset.
6 Discussion
In this work, we explored the source of invariance in both convolutional and vision transformer architectures trained for classification on ImageNet, and how these invariances relate to the factors of variation of ImageNet. We compared the impact of data augmentation, architectural bias and the data itself on the models’ invariances. We observed that RandomResizedCrop relies on a precise combination of translation and scale that is difficult to reproduce and that, surprisingly, augmenting with translation recaptures most of the improvement despite the (approximate) invariance to translation built in to convolutional architectures. By analyzing the source of invariance in ResNet18 and ViT, we demonstrated that invariance generally stems from the data itself rather than from architectural bias, and is only slightly increased by data augmentation. Finally, we showed that the transformations that explain the variations in ImageNet are per class and mostly appearance based. Interestingly, we found invariance and factors of variation align: training enhances a model’s natural invariance for transformations that account for ImageNet variations, while it has the opposite effect for changes that do not affect factors of variation.
Limitations and future work
We provide an analysis of the invariant properties of models using a specific set of metrics based on model performance and similarities of input embeddings. Using these, some of our conclusions are shared with existing works, but a different set of metrics could potentially bring more insights on our conclusions. Additionally, we only experiment on ImageNet, but it would be interesting to perform the same type of analysis on a wider range of datasets and data types. Does a handful of transformations describe the variations of most standard image datasets? Our study sheds light on ImageNet factors of variation but some conclusions remain obscure, such as the role of scaling. This emphasizes the difficulty of performing a systematic study of real datasets.
Finally, our findings spark further exploration. Could tailoring augmentations per class or introducing appearancebased augmentations improve performance?
Potential negative societal impacts.
While our work is concerned with robustness of vision models which can have various societal impacts, our work is primarily analytical and we do not propose a new model. Hence, we do not foresee any potential negative societal impacts of our findings.
References
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work? See the Discussion section.

Did you discuss any potential negative societal impacts of your work? See the Discussion section.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? We did not include theoretical results.

Did you include complete proofs of all theoretical results? We did not include theoretical results.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? Our code to reproduce the main experiment is included in the supplementary materials.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? We report standard error of the mean (SEM) for all experiments.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix A.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? We cite all assets and existing code / pretrained models we used (ImageNet dataset, Wordnet database, U2Net, pretrained ViT, Augerino library, AutoAugment).

Did you mention the license of the assets? We used existing assets which licenses can be found in the references we cite. While ImageNet does not own the copyright of the images, they allow noncommercial uses of the data under the terms of access written in https://imagenet.org/download.php.

Did you include any new assets either in the supplemental material or as a URL? While we don’t provide any new assets such as datasets, we provide code to reproduce our main results in the Supplementary.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? We use the ImageNet dataset as is and reference to it.

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? We use the ImageNet dataset as is and reference to it.


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? We did not conduct research with human subjects or used crowdsourcing.

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? We did not conduct research with human subjects or used crowdsourcing.

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? We did not conduct research with human subjects or used crowdsourcing.

Appendix A Training details
Training regular Resnet18
The experiments of Sec. 2.1 are conducted using 1 seed to crossvalidate between 3 learning rates ( and 3 weight decay parameter
. Models are trained with Stochastic Gradient Descent with momentum equal to 0.9
[journals/nn/Qian99]on all parameters. We use a learning rate annealing scheme, decreasing the learning rate by a factor of 0.1 every 30 epochs. We train all models for 150 epochs. Then, we select the best learning rate and weight decay for each method and run 5 different seeds to report mean and standard deviation. We use the validation set of ImageNet to perform crossvalidation and report performance on it. Our code is a a modification of the pytorch example found in
https://github.com/pytorch/examples/tree/master/imagenet.Note that we also tried one seed with crossvalidation of hyperparameters of T.(50%) + RandomSizeCenterCrop, i.e. with translation, this gives poorer performances than translation (top1 accuracy ).
Training the Augerino model
In section G we train the Augerino method on top of the Resnet18 architecture. We employ Augerino on top of applying the FixedSizeCenterCrop preprocessing, in order to not induce any invariance by data augmentations. The experiments reported in section G are conducted using 5 seeds to crossvalidate between 7 regularization values (). We use the best learning rate and weight decay values of the Resnet18 trained with FixedSizeCenterCrop (learning rate and weight decay ). The parameters of the distribution bounds, specific to Augerino, are trained with a learning rate of and no weight decay as in the original Augerino code (https://github.com/gbenton/learninginvariances). Models are trained with Stochastic Gradient Descent with momentum equal to 0.9 [journals/nn/Qian99] on all parameters. We use a learning rate annealing scheme, decreasing both learning rates of the Resnet18 and the Augerino bounds parameters by a factor of 0.1 every 30 epochs. We train all models for 150 epochs. During training, 1 copy of the image transformed with the sampled transformation values is used, and during validation and test, 4 transformed versions of the image are used, as in the original Augerino code. We use the validation set of ImageNet to perform crossvalidation and report performance on it.
Total amount of compute, type of resources used
Our main code runs with the following configuration: pytorch 1.8.1+cu111, torchvision 0.9.1+cu111, python 3.9.4. Full list of packages used is released with our code. We use DistributedDataParallel and ran the experiments on 4 GPUs of 480GB memory (NVIDIA GPUs of types P100 and V100) and 20 CPUs, on an internal cluster. With this setting, training a ResNet18 on ImageNet with RandomResizedCrop takes approximately 6 mins, while for other augmentations (e.g.T.(30%) + RandomSizeCenterCrop) it can take up to approximately 20 mins depending on the type of GPU used.
For the experiments of Sec. G we use pytorch 1.7.1+cu110, torchvision 0.8.2+cu110, python 3.9.2 and torchdiffeq 0.2.1 as the Augerino original code relies on torchdiffeq and we could not run torchdiffeq with pytorch 1.8.1 (known issue, see https://github.com/rtqichen/torchdiffeq/issues/152).
Appendix B Comparing the samples helped by translation and/or scaling
In Section 2.1, we find that T.(30%) performs on par with T.(30%) + RandomSizeCenterCrop, and both outperform FixedSizeRandomCrop. To further study this, we compute the lists of samples that are incorrectly classified by no augmentation but correctly classified by these methods, for each of the three methods. As we trained seeds per methods, we have 25 lists of “helped samples" for each method. We then compare methods using the intersectionoverunion (IoU) of their respective lists. For each method, the IoUs of each method’s lists with itself are:

T.(30%)/T.(30%):

T.(30%) + RandomSizeCenterCrop/T.(30%) + RandomSizeCenterCrop:

FixedSizeRandomCrop/FixedSizeRandomCrop:
while the crossmethods lists IoU are:

FixedSizeRandomCrop/T.(30%) + RandomSizeCenterCrop:

FixedSizeRandomCrop/T.(30%):

T.(30%)+RandomSizeCenterCrop/T.(30%):
Thus, we do not see a pattern of consistency neither in the intramethods or crossmethods IoUs.
Appendix C Varying the distribution over the scale
For the experiment in Sec. 2.2, we used the best learning rate and weight decay ( and ) found by crossvalidation for RandomResizedCrop. We run training seeds of each distribution.
The expected value of the nonstandard Beta distribution for is . As explained in Touvron2019, using a random scale induces a discrepancy between the average objects apparent sizes during training and evaluation: their ratio is proportional to the expected value of . Thus, the smaller the smaller the expected value of , which reduces this discrepancy.
However, too small values of (e.g. ) also have smaller performance. Recall that RandomResizedCrop scales an object in the selected crop by a factor proportional to . While we don’t have a closed form expression for , the variance of (inverse of the nonstandard Beta over ), we computed estimated values for using samples drawn from . ^{5}^{5}5
We also checked the estimated values were sensible obtained by computing the probability distribution function (pdf) of
from the pdf of via the change of variable formula, and compute using Wolfram Alpha [wolfram].0.1  0.823 

0.5  3.193 
1 ( RRC)  4.962 
2  6.809 
3  7.626 
10  7.317 
Table 2 shows that the variance for is 3 and 5 times smaller than the variance for and respectively. Thus, while the three values of give distributions that peak close to , the value gives a smaller variance thus less encourage scale invariance. This in our view explains the poorer performance of .
Appendix D Invariance and Equivariance
Experimental details
We use a pretrained ViT (L/16) and untrained ViT from rw2019timm using the ‘forward features’ method to generate embeddings, see Timm’s documentation for details https://rwightman.github.io/pytorchimagemodels/feature_extraction/. The trained ViT achieves Top 1 accuracy on ImagNet. For ResNet18, we use the pretrained ResNet18 available from PyTorch https://pytorch.org/hub/pytorch_vision_resnet/. The trained ResNet18 achieves Top 1 accuracy. For all our measures, we control for the difference in testset versus trainset sizes by limiting the total number of embedding comparisons to pairs.
Equivariance
We also explore whether models are equivariant to transformations. A model is said to be equivariant if it responds predictably to the given transformation. Formally, a model, , is equivariant to a transformation of an input if the model’s output transforms in a corresponding manner via an output transformation , i.e. for any .
To disambiguate invariance from equivariance, we measure equivariance by examining whether embeddings respond predictably to a given transformation. To do so, we measure alignment among embedding differences, by first producing embedding differences
then measuring pairwise alignment of the embedding differences via cosine similarity. We compare, against a baseline B where we shuffle the rows in each column independently. We report , where are elements from the baseline. A higher value implies higher equivariance, with indicating no equivariance above the baseline. In Fig. 14
, we find equivariance to translation for untrained ResNet18 that is absent for ViT, highlighting the architectural inductive bias of CNNs to translation. Although for some magnitudes we also observe equivariance to zooming out, we note this is likely due to zooming out introduce padding rather than true equviariance to scale changes. We also observe equivariance to appearance transformations for ResNet18 such as posterize that are also absent from ViT.
d.1 Classes consistently hurt by RandomResizedCrop
Class  % of comparisons where hurt  Loss in Top1 accuracy (%) 

cassette player  100.0  22.00 
maillot  100.0  21.20 
palace  100.0  4.40 
academic gown  92.0  13.57 
missile  88.0  11.82 
mashed potato  88.0  9.09 
digital watch  84.0  7.52 
barn spider  80.0  7.70 
Indian elephant  80.0  7.20 
miniskirt  80.0  8.00 
pier  80.0  5.30 
wool  80.0  6.80 
ear  80.0  10.60 
brain coral  72.0  3.89 
crate  72.0  4.33 
fountain pen  72.0  5.22 
space bar  72.0  6.33 
We compare the perclass top1 accuracy when using RandomResizedCrop augmentation training versus FixedSized CenterCrop augmentation (which is the no augmentation). On average, over 25 pairwise comparisons of 5 runs with both augmentations, of classes are hurt by the use of RandomResizedCrop. We list in Table 3 the classes that are hurt in more than of the 25 comparisons, and the average amount of decrease in Top1 accuracy when incurred. We do not see a pattern in these classes when reading their labels. We confirm the lack of pattern by computing the similarities of the classes listed in Table 3 using the most specific common ancestor in the Wordnet [wordnet] tree. The similarity of the classes that are consistently hurt (17 classes) is ( if we include a class similarity with itself), while the similarity between the classes that are consistently hurt and the ones that are consistently helped ( of comparisons) is , and between classes consistently hurt and everything else (neither helped or hurt) is .
Appendix E Similarity Search
Experimental details
We use the same trained models using standard augmentations as we did for equivariance. For the no augmentation ResNet18, we use the training procedure outlined in Appendix A. We sample pairs from each class and compute SimChange
for each pair. We pool transformations from subpolicies discovered by AutoAugment on ImageNet, SVHN, and additional rescaling for zooming in and out. We extend the implementation of AutoAugment provided in the DeepVoltaire library
https://github.com/DeepVoltaire/AutoAugment. Note we disregard the learned probabilities from AutoAugment and instead apply each transformation independently for our similarity search analysis. For subpolicies, we apply each transformation in sequence. Transformations include ‘equalize’, ‘solarize’, ‘shearX’, ‘invert’, ‘translateY’, ‘shearY’, ‘color’, ‘rescale’, ‘autocontrast’, ‘rotate’, ‘posterize’, ‘contrast’, ‘sharpness’, ‘translateX’ with varying magnitudes. We illustrate the effect of transformations in Fig.
6.No transformation consistently increases pair similarity
In Fig. 7
we show the distribution of similarity change of each transformations over classes. While for some outlier classes, some transformations increase similarity among pairs, distributions are below or near 0 for all transformations. The top single transformation across all classes is posterize, which increases similarity by
, implying no statistically significant increase. In contrast, we find on average 6.5 out of the top 10 transformations per class increase similarity by .Geometric transformations
We also examine the distributions for transformations not in the standard augmentations by excluding all translations and rescales. In Fig. 9 we see even if we exclude translation and scale, standard data augmentation drastically decrease the variation of similarity changes even for other transformations not used during training.
e.1 Measuring local variation in foregrounds
To supplement our analysis of global transformations, we also directly measure local variation in ImageNet using foregrounds extracted from U2Net [Qin_2020] trained on DUTS [wang2017]. We measure the center coordinates and area of bounding boxes around the foreground object relative to the image frame using a threshold of to determine the bounding box. We measure foreground variation over all training images in ImageNet. We observe there is more variation in scale, which ranges from of the image, compared to translation which is centered ( (SEM)).
e.2 Textiles Weighted Boost
To account for both the size of the similarity increases and the proportion of images increased, we rank transformations by their weighted boost, defined as the average percent boost * proportion of image pairs boosted. We examine the top 10 transformations by weighted boost and find rotate with magnitudes ranging from 39 is the top transformation for all top 10 for the ResNet18 trained with standard augmentation. We find for rotation the corresponding classes are velvet, handkerchief, envelop, and wool with velvet appearing 4 times among the top 10. For subpolicies, we find both the ResNet18 trained with or without standard augmentations, rotation and a color transformation with varying magnitudes is the top 10 transformation also corresponding to velvet, wool, handkerchief, and envelope with jigsaw puzzle as an additional class.
e.3 Wordnet Similarity Search
To study whether similar class have similar factors of variation, we measured class similarity using the WordNet hierarchy. We compute similarity using several methods provided in the NLTK library https://www.nltk.org/howto/wordnet.html including WuPalmer score, LeacockChodorow Similarity, and path similarity—all of which compute similarity by comparing the lowest common ancestor in the WordNet tree. To compute similarity of transformations, we compared the Spearman rank correlation of the top transformation in each class by average percent similarity change as well as proportion of image pairs boosted. We found no significant difference between the two. We compare class WordNet similarity against transformation ranks for all ImageNet classes.
Appendix F Additional experiment: How does augmenting validation images affect accuracy?
Standard data augmentation methods are a proxy to implement geometric transformations such as translation and scale, do they actually bring invariance to these transformations? If so, the performance of a model trained with data augmentation should be equal even when validation images are augmented. To answer this question, similar to Engstrom19, we augment the images during evaluation to test for scale invariance. Indeed, robustness to augmentations is used a generalization metric in Aithal2021.
Specifically, we augment the validation images with the regular validation preprocessing FixedSixeCenterCrop, and then augment the images by taking a RandomSizeCenterCrop (disabling aspect ratio change). This scales the object in the crop. Using RandomSizeCenterCrop, we vary , that specifies a lower bound of the uniform distribution . This effectively varies the maximal increase in size potentially applied, which is proportional to . We compare the models trained with RandomResizedCrop and FixedSizeCenterCrop (no augmentation).
Specifically, we select values of increase in size peraxis uniformly between and , and set . Thus we augment the images from no augmentation () to scaling the image by a factor potentially as large as when . Note that for the value , , matching the value of RandomResizedCrop lower bound on the range of . To match the procedure done at evaluation, we use FixedSizeCenterCrop, and then augment the crop. For each model, we evaluate the training seeds of the best hyperparameters setting, running each augmentation experiment for different test seeds. We compute performance averaged over the test seeds and then report the mean of the training seeds standard error of the mean.
Fig. 11 shows that the accuracy on the validation set decreases as we increase the magnitude of augmentations on validation images. The stronger decrease is for the no augmentation model, while the model trained with RandomResizedCrop is more robust. However, while the latter has been trained with augmentations up to , its performance already decreases for the smallest values^{6}^{6}6Note that we apply FixedSizeCenterCrop, which resizes the image to on the shorter dimension, before RandomSizeCenterCrop. Thus, compared to what is done at training, there is an additional scaling of for the same value .. This suggests it might only present partial invariance.
Appendix G Additional experiment: Augerino [benton2020learning] on ImageNet
We have shown that standard data augmentation methods rely on a precise combination of transformations and parameters, and needs to be handtuned. To overcome these issues, and potentially discover relevant factors of variation of the data, recent methods have been proposed to automatically discover symmetries that are present in a dataset [benton2020learning, zhou_metalearning_2020, dehmamy2021lie]. We assess the potential of a stateoftheart model of this type to tackle ImageNet, that is, the Augerino model [benton2020learning].
g.1 The Augerino model and our modifications
Augerino method
Augerino is a method for automatic discovery of relevant equivariances and invariances from training data only, given a downstream task. Given a neural network parametrized by , Augerino creates a model approximately invariant to a group of transformations by averaging the outputs over a distribution over :
(3) 
Augerino considers the group of affine transformations in 2D, Aff(2), composed of 6 generators corresponding to translation in , translation in , rotation, uniform scaling, stretching and shearing.^{7}^{7}7While in the original paper benton2020learning mentions scale in and scale in and shearing, the generators of Aff(2), employed in the paper, in fact correspond to uniform scaling, stretching and shearing..
Instead of directly using a distribution over transformations in image space , the distribution is parametrized over the Lie algebra of . For insights on Lie Groups and Lie algebras, we refer the interested reader to hall2015lie. Thus, specifies the bounds of a uniform distribution in the Lie algebra. It is dimensional, each specifies the bounds of the distribution over the subgroup : . When the value sampled in the Lie algebra is , this corresponds to the transformation being the identity. The smaller , the smaller the range transformations in are used, and a dirac distribution on always returns the identity transformation, i.e. no use of the transformation .
The value of is learned along the parameters of the network , specifying which transformation is relevant for the task at hand. In the case of classification, the crossentropy loss is linear and thus expectation can be taken out of the loss.
(4) 
Furthermore, Augerino employs a negative regularization on , parametrized by , to encourage wider distributions. The resulting training objective is:
(5) 
where a larger pushes for wider uniform distribution. Everytime an image is fed to the model, Augerino (1) draws a sample from the uniform distribution on the Lie algebra of (2) computes the transformation matrix in image space through the exponential map (3) augments the image with the corresponding transformation (4) feeds the augmented image to the neural network to perform classification. See benton2020learning for more details on the model.
Regularizing by transformations
We noticed that the original code library of benton2020learning disables regularization if any of the coordinates in (one per each of the transformation in Aff(2)) has reached a certain value. From our initial experiments, we understand this is to prevent underflow errors. However, we find it to be too strict if we want to learn multiple transformations at the same time, and thus we shutdown regularization on each when a specific value for that coordinate is reached.^{8}^{8}8If the value goes below in the next iteration, regularization is enabled again. This also allows us to control separately each transformation regularization. While benton2020learning has found that their model was insensitive to the regularization strength, we find it to be a key parameter when applied to ImageNet. This is a first difficulty we face when trying to tackle a real dataset with an equivariant model. While the original model considers always all possible transformations, we also modified the original model to consider any transformation separately and any of their combinations.
g.2 Augerino on translation discovery
We study if Augerino discovers translation as a relevant transformation to improve performance on ImageNet. We shutdown regularization if the bound of the distributions (separately for each x and yaxis) has reached a value corresponding to translation in image space. We crossvalidate between multiple seeds and bounds regularization parameters. As Augerino is employed at validation, we perform testing with different seeds. We also show in Table 5 the results of Augerino disabled at evaluation.
Table 4 shows that for different values of the regularization parameter (see Equation 5), different bounds are learned by the model. In Fig. 12 we compare the learned bounds of the distribution. The bounds saturate at a value corresponding in image space to , i.e. when the regularization is disabled (shown in dashed lines). This shows that, contrarily to the experiments in benton2020learning, the regularization term in Equation 5 strongly impacts the learned bounds. Best performance is achieved for , with learned bounds that correspond to sampling a translation in on the xaxis, and on the yaxis. Augerino discovers translation as a relevant augmentation, and learn values that improve over the FixedSizeCenterCrop (no augmentation) method.^{9}^{9}9We use Augerino on top of FixedSizeCenterCrop preprocessing. For comparison, a model trained with T.(30%) with FixedSizedCenterCrop preprocessing achieves Top1 accuracy.
Table 5 shows the performance of the model trained with Augerino for translation discovery, when Augerino is not employed during evaluation time. We do note they are slightly higher (by up to for ) to the ones reported in Table 12.
Acc@1 SEM  

0.01  
0.1  
0.2  
0.4  
0.6  
0.8  
1.0 
g.3 Augerino on translation and scale
When we use Augerino on the scaletranslation group, we want to control the parameters so that we can use specific shutdown values corresponding to translation and scaling, and apply translation before scaling as is done in the T.(30%) + RandomSizeCenterCrop method. Thus when we use the Augerino model to learn parameters for the scaletranslation group, we performed a few modifications to the original code. First, we explicitly compute translation before scaling. Second, Augerino’s code uses the affine_grid and grid_sample methods in Pytorch, where the former performs an inverse warping. That is, for a given scale , the inverse scaling is performed. This does not impact the translation which is performed in both direction (negative and positive translation). For scale, we shutdown regularization if the value corresponds to , as the inverse scaling will be performed. Third, we want to sample scales corresponding to an increase in size (zoomingin) in order to mimic the effect of RandomResizedCrop which selects only a subset of the image. Hence we take . A sampled value which is negative corresponds to a positive scaling by Augerino given the inverse wrapping. We shutdown the regularization parametrized by when translation has reached of the width / height, and a scaling of (the object appears times larger).
Table 6 shows the performance of Augerino when trained on ImageNet with the possibility to learn about translation and scale in conjunction. Best performance is achieved with , and we see in Fig. 13 that the model has learned to augment with xtranslations only (the red marker being close to and the green and blue marker being close to ). Interestingly, when Augerino is disabled at evaluation time, Table 5 shows that the Top1 accuracy is much higher for large values of regularization compared to Table 6, with the best performing . This means that with a more aggressive augmentation, the use of Augerino hurts during inference and only slightly helps during training compared to the no augmentation case (see Table 0(a) for FixedSizeCenterCrop). Still, as in the experiment for translation only, we note that the value of greatly impacts the results, and that the gain in performance compared to no augmentation is quite small. More importantly, the performance is smaller than when using translation only (see Table 4), which the model could have fallback on if scale was not a relevant transformation. For comparison, a model trained with T.(30%) + RandomSizeCenterCrop with FixedSizedCenterCrop preprocessing achieves Top1 accuracy. We conclude that while Augerino is a promising model for automatic discovery of relevant symmetries in data, it remains a challenge to apply such methods on a real, largescale dataset such as ImageNet.
Acc@1 SEM  

0.01  
0.1  
0.2  
0.4  
0.6  
0.8  
1.0 