Convolutional neural networks (CNNs) define state-of the-art-performance in many computer vision tasks, such as image classification[krizhevsky2012imagenet], object detection [sermanet2013overfeat, girshick2014rich], and segmentation [girshick2014rich]. Although these tasks derive from problems solved by the human visual system, a number of recent results have shown that CNNs differ in intriguing ways from human vision, indicating fundamental deficiencies in our understanding of the workings of these models [szegedy2013intriguing, Nguyen_2015_CVPR, fawzi2015manitest, dodge2017study, richardwebster2018psyphy, geirhos2018generalisation, azulay2018deep, hendrycks2018benchmarking, hosseini2018semantic, alcorn2019strike, ilyas2019adversarial]. This paper focuses on one such result, namely that CNNs appear to make classifications based on superficial textural features [geirhos2018imagenet, baker2018deep] rather than on the shape information preferentially used by humans [landau1988importance, kucker2018reproducibility]. When presented with images with conflicting shape and texture information (e.g. elephant-textured knives), ImageNet-trained CNNs tend to classify these images according to their texture, whereas humans classify them according to shape [geirhos2018imagenet].
From the point of view of computer vision, texture bias is an important phenomenon for several reasons. First, it may be related to the vulnerability of CNNs to adversarial examples [szegedy2013intriguing], which may exploit features that are informative regarding the class label but undetectable to the human visual system [ilyas2019adversarial]. Second, a CNN preference for texture could indicate an inductive bias different than that of humans, a bias that could make it more difficult for models to learn human-relevant vision tasks in small-data regimes, and to generalize to different distributions than the distribution on which the model is trained.
In addition to these engineering considerations, texture bias raises important scientific questions. ImageNet-trained CNNs have emerged as the model of choice in neuroscience for modelling electrophysiological and neuroimaging data from primate visual cortex [yamins2014performance, khaligh2014deep, cichy2016comparison, ponce2019evolving, bashivan2019neural]. Evidence that CNNs are in fact preferentially driven by texture indicates a significant divergence from primate visual processing, in which shape bias is well documented [landau1988importance]. This mismatch raises an important puzzle for human-machine comparison studies. As we show in Section 6, even models specifically designed to match neural data exhibit a strong texture bias.
This paper explores the origins of texture bias in ImageNet-trained CNNs, looking at the effects of model architecture, dataset, task, and training procedure. We begin by probing the inductive biases present in CNN architectures. Do the texture-driven classifications documented by [geirhos2018imagenet] come from the model, as an inherent property of the CNN architecture or learning process, or from the data, as an accidentally useful regularity that CNNs are expressive enough to exploit? In Section 3, we examine the CNN inductive biases by studying the learning dynamics of two popular CNN architectures trained on three datasets of ambiguous images that could be classified according to either shape or texture. When given limited time or data, do models find it easier to learn shape or texture features? We find that across datasets and model architectures, shape information is learned at least as easily as texture information, a finding which argues against the idea that these models have a texture-favoring inductive bias.
In Section 4, we seek to tease apart the extent to which shape information is represented in a model from the extent to which it is used in a model’s classification decisions. Geirhos et al. showed that CNNs preferentially categorize objects according to texture, but this does not rule out the possibility that these models contain correct shape information that went unused. A fabric salesman, for example, might categorize an object as “velvet” while remaining fully aware that it is also an armchair. In a sign of cautious optimism for CNN shape representations, we show that it is possible to extract more shape information from a CNN’s later layers than is reflected in the model’s classifications. We also study how this information loss occurs as data flows through a network, and find that after the final convolutional layer, subsequent classification layers ablate progressively more shape information while preserving texture.
Geirhos et al. [geirhos2018imagenet] showed that texture bias can be mitigated by modifying the the model’s training with texture-varying data augmentation. In Section 5, we ask whether the same effect can be achieved by modifying the model’s objective function. We show that although models trained on some self-supervised objectives are less texture-biased than supervised models, other self-supervised objectives lead to a stronger texture preference. Further, what improvements in shape bias we do observe are mostly driven by loss of texture accuracy rather than by increases in shape accuracy. In Section 6, we examine the shape bias of a wide range of modern ImageNet architectures. We find that architectures that perform better on ImageNet exhibit higher shape bias, but neither architectures designed to match the human visual system nor self-attention-based models are substantively different from ordinary CNNs.
In Section 7, we look beyond the roles of model, task, and dataset in shaping texture bias to examine the effect of training methodology. We show that random-crop preprocessing, a widely used form of data augmentation, increases models’ texture bias. We also explore the effects of learning rate and weight decay on shape bias, and show that higher learning rates tend to result in higher shape accuracy and shape bias.
2 Related work
Adversarial examples. Early work investigating discrepancies between human perception and classification behavior in neural networks focused on adversarial examples [biggio2013evasion, szegedy2013intriguing, Nguyen_2015_CVPR]. Ever since Szegedy et al. [szegedy2013intriguing] showed that small perturbations could dramatically affect neural networks’ predictions, many researchers have sought to characterize the source of this sensitivity and to defend against it. Adversarial perturbations are not entirely misaligned with human perception [elsayed2018adversarial, zhou2019humans], and reflect true features of the training distribution [ilyas2019adversarial]. Current state-of-the-art defenses on ImageNet are based on adversarial training [szegedy2013intriguing, goodfellow2014explaining, madry2018towards, xie2019feature] or randomized smoothing [pmlr-v97-cohen19c], and images generated by optimizing class confidence under these models are more perceptually recognizable to humans [santurkar2019, 2019arXiv190600945E, tsipras2018robustness, kaur2019perceptually]. However, models that are robust to adversarial examples generated with respect to the norm are not robust to other forms of imperceptible perturbations, such as shifting of pixels [xiao2018spatially, tramer2019adversarial].
Sensitivity of CNNs to non-shape features. Our work builds on recent evidence for bias for texture versus shape in neural networks based on classification behavior with ambiguous stimuli [geirhos2018imagenet, baker2018deep]. Other studies have shown that CNNs are sensitive to Fourier statistics of the training set [jo2017measuring, tsuzuku2019structural, yin2019fourier], as well as to a wide range of other image manipulations that have little effect on human judgments [fawzi2015manitest, dodge2017study, richardwebster2018psyphy, geirhos2018generalisation, azulay2018deep, hendrycks2018benchmarking, hosseini2018semantic, alcorn2019strike]. Moreover, CNNs are relatively insensitive to manipulations such as grid scrambling that make images nearly unrecognizable to humans [brendel2019approximating], and are far superior to humans at classifying ImageNet images where the foreground object has been removed [zhu2016object].
Similarity of human and CNN perceptual biases. Despite these differences, ImageNet-trained CNNs appear to share some perceptual biases and representational characteristics with humans. Previous studies found preferences for shape over color [ritter2017] and perceptual shape over physical shape [kubilius2016deep]. Euclidean distance in the representation space of CNN hidden layers correlates well with human perceptual similarity [johnson2016perceptual, zhang2018unreasonable], although, when used to generate perceptual distortions, simple biologically inspired models are a better match to human perception [berardino2017eigen]. CNN representations also provide an effective basis for modeling the activity of primate visual cortex [yamins2014performance, khaligh2014deep, cadieu2014deep], even though CNNs’ image-level confusions differ from those of humans [rajalingham2018large].
3 Do CNNs more readily learn texture?
We used three datasets in which images differed in both shape and texture, meaning that for each dataset, we could define both shape and texture classification tasks. Examining the amounts of time and data required for models to perform well on these tasks is a way to determine whether shape or texture information is more easily exploited by CNNs. Instead of adopting a single parametric model (e.g.[heeger1995pyramid, portilla2000parametric, efros2001image, gatys2016image]), we construe texture broadly to include natural textures (e.g. dog fur), small units repeating across space, and surface-level noise patterns. Each of our datasets (Figure 2) captures a possible interpretation of texture. See Appendix C.1.1 for additional discussion of the datasets and their limitations. All datasets consisted of colored 224 x 224 images with both shape and texture labels.
Geirhos Style-Transfer dataset. Introduced as the “cue-conflict” probe stimulus set by [geirhos2018imagenet], this dataset contains images generated using neural style transfer [gatys2016image], which combines the content (shape) of one target natural image with the style (texture) of another. The dataset consists of 1200 images rendered from 16 shape classes, with 10 exemplars each, and 16 texture classes, with 3 exemplars each, used with permission from [geirhosrepo].
Navon dataset. Introduced by psychologist David Navon in the 1970s to study how people process global versus local visual information [navon1977forest], Navon figures consist of a large letter (“shape”) rendered in small copies of another letter (“texture”). Unlike the Geirhos Style-Transfer stimuli, the primitives for shape and texture are identical apart from scale, allowing for a more direct comparison of the two feature types. We rendered each possible shape-texture combination (26 26 letters) at 5 positions, yielding a total of 3250 items after excluding items with matching shape and texture. Each image was rotated with an angle drawn from [-45, 45] degrees.
We found that BagNet-17, a model that makes classifications based on local image patches without considering their spatial configuration [brendel2019approximating], was able to classify shape less well than texture for these stimuli (Figure A.1), suggesting it is necessary to use global features to classify shape for this dataset. In his experiments, Navon found that humans process the shape of these stimuli more rapidly than texture.
ImageNet-C dataset. ImageNet-C [hendrycks2018benchmarking] consists of ImageNet images corrupted by different kinds of noise (e.g. shot noise, fog, motion blur). Here, we take noise type as “texture” and ImageNet class as “shape”. The original dataset contains 19 textures each at 5 levels, with 1000 ImageNet classes per level and 50 exemplars per class. To balance shape and texture, for each of 5 subsets of the data (dataset “versions”), we subsampled 19 shapes, yielding a total of 90,250 items per version.
We trained AlexNet [krizhevsky2012imagenet] and ResNet-50 [he2016deep]
models separately for each task using [5%, 10%, 20%, 30%, …, 90%, 100%] of the training data, and compared validation accuracies for the shape and texture classification tasks. In subsampling the training data, we enforced the condition that an instance of each shape and texture class appear at least once. We trained all models for 90 epochs using Adam[kingma2014adam] with a learning rate of , weight decay of , and batch size of 64. For details, see Appendix C.1.3.
As shown in Figure 3, we found that both AlexNet and ResNet-50 models learned to classify shape at least as readily as texture. Indeed, AlexNet models required less data to achieve high accuracy on shape than on texture classification for most tasks, a difference that was most pronounced for the Geirhos Style Transfer dataset. For both AlexNet and ResNet-50, the final accuracy achieved for the Geirhos Style Transfer task was substantially higher for shape than for texture. For ImageNet-C, the already large proportion of data present at 5% makes it difficult to draw conclusions about data efficiency.
We see these patterns recapitulated in the learning dynamics over training time using the full training data. For the AlexNet models, for the Geirhos Style Transfer and Navon datasets, shape accuracy rises more quickly than does texture accuracy.
4 To what extent are shape and texture represented in ImageNet models?
Geirhos and colleagues’ finding that ImageNet-trained CNNs prefer to classify ambiguous stimuli according to their texture rather than their shape [geirhos2018imagenet] does not rule out the possibility that, even if texture preferentially drives classification decisions, shape information is still correctly represented within the model. In the following experiments, we tested the extent to which it is possible to directly read out stimulus shape versus texture information at different layers of AlexNet and ResNet-50.
To test for the presence of shape and texture information in AlexNet and ResNet-50, we trained linear multinomial logistic regression classifiers on two classification tasks. Taking as input activations from a given layer of a frozen, ImageNet-trained model, the classifier predicted either (i) the shape of a Geirhos Style Transfer image or (ii) its texture. We used the center-crop versions of AlexNet and ResNet-50 presented in Table2
. For AlexNet, we looked at “pool3” (the output of the final convolutional layer, including the max pool), “fc6” (the first linear layer of the classifier, including the ReLU), and “fc7” (the second linear layer of the classifier). For ResNet-50, we considered the layers “pre-pool” (output of the final bottleneck layer) and “post-pool” (the input to the final output layer, following the global average pool). We normalized the Geirhos Style Transfer train and validation items by the ImageNet train set statistics.
We trained linear classifiers to classify either the shape or texture of the Geirhos Style Transfer images given activations from some model layer. For each model layer-task pair, we first found a learning rate that effectively optimized the classifier, then searched over weight decay settings. We evaluated the mean classification accuracy for classifiers trained separately on each of the 5 splits of the data described in C.1.2. We trained each classifier for 90 epochs using Adam [kingma2014adam] at batch size 64.
We found that, despite the high degree of texture bias in AlexNet and ResNet-50’s classifications, shape can nonetheless be decoded with high accuracy (66%; chance = 6.25%). In fact, using linear classifiers, it is possible to classify shape (77.9%) more accurately than texture (65.6%) from AlexNet pool3 activations. In ResNet-50, while texture classification accuracy (80.9%) is higher than shape accuracy (66.2%) for the pre-global average pool activations, shape accuracy is still high. Interestingly, shape accuracy decreases through the fully-connected layers of the AlexNet classifier, and following the global average pool operation in ResNet-50, suggesting that these models’ classification layers remove shape information.
5 Does training objective affect shape bias?
One hypothesis is that texture bias is driven by the joint image-label statistics of the ImageNet dataset. To correctly label the many dog breeds in the dataset, for instance, a model would have to make texture-based distinctions between similarly shaped objects. To test this hypothesis, we compared the shape bias of models trained with standard supervised learning to models trained with self-supervised objectives different from supervised classification. To explore the interaction of objective and model architecture, we used each objective to train models with both AlexNet and ResNet-50 base architectures.
5.1.1 Self-supervised losses
We paired each self-supervised objective with two architecture backbones: AlexNet and ResNet-50 v2, allowing for comparison with architecture-matched supervised counterparts. We restricted ourselves to self-supervised approaches that learn representations of entire images and excluded patch-based approaches. The objectives were:
Rotation classification. Input images are rotated 0, 90, 180, or 270 degrees, and the task is to predict which rotation was applied (chance = ) [gidaris2018unsupervised, kolesnikov2019revisiting]. In their original presentation of this loss, Gidaris et al. [gidaris2018unsupervised] argued that to do well on this task, a model must understand which objects are present in an image, along with their location, pose, and semantic characteristics, perhaps portending high shape bias for this objective.
Exemplar. This objective, first introduced by [dosovitskiy2014discriminative], learns a representation where different augmentations of the same image are close in the embedding space. Our implementation follows [kolesnikov2019revisiting]
. At training time, each batch consists of 8 copies each of 512 dataset examples with different augmentations (random crops from the original image, converted to grayscale with a probability of). A triplet loss encourages distances between augmentations of the same example to be smaller than distances to other examples.
BigBiGAN. The BiGAN framework [donahue2016adversarial, dumoulin2016adversarially] jointly learns a generator that converts latent codes to images and an encoder that converts dataset images to latent codes. At training time, the encoder and generator are optimized adversarially with a discriminator. The discriminator is optimized to distinguish pairs of sampled latents with their corresponding generator output from pairs of dataset images with their corresponding latents , whereas the generator and encoder are optimized to minimize the discriminator’s performance. We use the representation from the penultimate layer of a ResNet-50 v2 encoder [donahue2019large].
AlexNet. We trained AlexNet models from scratch using a modified version of the code provided by Kolesnikov et al. [kolesnikov2019revisiting]
. Unlike AlexNet models used for other experiments, these models were trained using TensorFlow rather than PyTorch, and thus the shape and texture bias of the baseline supervised model are slightly different. See AppendixC.2.1 for full training details.
ResNet-50 v2. For rotation, exemplar, and supervised losses, we used ResNet-50 v2 models made available as part of the Visual Task Adaptation Benchmark [zhai2019visual, vtab_tfhub]. For BigBiGAN, we used the public model [bigbigan_tfhub].
5.1.3 Evaluation of shape bias
To facilitate comparison across supervised and self-supervised tasks, we trained classifiers on top of the learned representations using the standard ImageNet dataset and softmax cross-entropy objective. Specifically, we froze all convolution layers and reinitialized and retrained later layers. For ResNet-50 v2, this procedure amounts to training a multinomial logistic regression classifier on top of the average-pooled representation, whereas for AlexNet, the classifier is a multilayer perceptron with two 4096-dimensional hidden layers. For completeness, we also investigated the performance of logistic regression classifiers trained directly on the final max-pooling layer of AlexNet networks (Appendix TableB.1), as well as -nearest neighbors classifiers (Appendix Table B.2); results were similar. We provide full training details in Appendix C.2.2.
We used the same measure of shape bias as Geirhos et al. [geirhos2018imagenet]: the percentage of the time the model classified a probe item from the Geirhos Style-Transfer dataset according to shape, provided it classified either shape or texture correctly (see Appendix C.3 for additional details). Importantly, a high shape bias does not imply high shape accuracy.
|Objective||Shape Bias||Shape Accuracy||Texture Accuracy||ImageNet Top-1 Acc.|
We found effects of both objective and base architecture on shape bias (Table 1). The rotation model had significantly higher shape bias than supervised models for both architectures. We note, however, that for both base architectures, this gain appears to be mostly driven by the large drop in texture accuracy exhibited by the Rotation model relative to the supervised baselines, and only secondarily by a small increase in shape accuracy. BigBiGAN also had higher shape bias than its supervised counterpart, but the exemplar objective led to lower shape bias with ResNet-50, and highly similar shape bias with AlexNet. Thus, CNNs seem predisposed to learn textural features regardless of the objective. Rotation may have lower texture bias than other tasks because rotationally invariant texture features are not useful for performing the rotation classification task.
In general, shape accuracy was higher for models with AlexNet than ResNet-50 architecture (log odds, 95% CI [0.83, 1.24], logistic regression; see Appendix C.2.3 for details), whereas the reverse was true for texture accuracy (log odds , 95% CI [0.67, 0.92]). Interestingly, the effects of architecture and task appear to be largely independent: the rank order of shape bias across tasks was similar for the two model architectures.
6 Does architecture influence shape bias?
In Section 5, AlexNet consistently exhibited higher shape bias than ResNet-50 v2 for each of the self-supervised losses investigated. We thus searched for systematic patterns across a wider range of architectures.
6.1 Shape bias correlates with ImageNet accuracy
Figure 5 shows the shape bias, shape accuracy, and texture accuracy of 16 high-performing ImageNet models trained with the same hyperparameters (see Appendix C.4.1 for details). Both shape bias and shape accuracy correlated with ImageNet top-1 accuracy, whereas texture accuracy had no significant relationship. These results suggest that better-performing ImageNet models are more effective at extracting shape information. However, AlexNet, with an ImageNet top-1 accuracy of 57.0% and a shape bias of 29.8%, demonstrates that it is also possible for models to exhibit high shape bias without high ImageNet accuracy.
6.2 Shape bias in neurally motivated models
Human visual judgments are well known to be shape-biased. Would a model explicitly built to match the primate visual system display lower texture bias than standard CNNs? We tested the shape bias of the CORNet models introduced by Kubilius et al. These models have architectures specifically designed to better match the primate ventral visual pathway both structurally (depth, presence of recurrent and skip connections, etc.) and functionally (behavioral and neural measurements) [kubilius2018cornet]. We computed the shape bias of CORNet-Z, -R, and -S, using the publicly available trained models [cornetrepo]. The simplest of these models, CORNet-Z, had a shape bias of 14.9%, shape accuracy of 9.2%, and texture accuracy of 52.2%. CORNet-R, which incorporates recurrent connections, had a shape bias of 36.7%, and shape and texture accuracies of 19.6% and 33.8%, respectively. CORNet-S, the model with the highest BrainScore [schrimpf2018brain], had a shape bias of 20.3%, and shape and texture accuracies of 13.3% and 51.9%. Taken together, these models did not exhibit a greater shape bias than other models we tested. Perhaps surprisingly, the texture accuracy was in the high range of those we observed.
6.3 Shape bias of attention vs. convolution
We wondered whether convolution itself could be a cause of texture bias. Ramachandran et al. [ramachandran2019stand] recently proposed a novel approach to image classification that replaces every convolutional layer in ResNet-50 v1 with local attention, where attention weights are determined based on both relative spatial position and content. We compared this model (with spatial extent
, and 8 attention heads) against a baseline ResNet-50 v1 model trained with the same hyperparameters. The attention model had a shape bias of 20.2% (shape accuracy: 12.8%; texture accuracy: 50.7%), similar to the baseline’s shape bias of 23.2% (shape accuracy: 14.4%; texture accuracy: 47.7%). Thus, use of attention in place of convolution appears to have little effect upon texture bias.
7 Does the training process contribute to texture bias?
In addition to model architecture, training procedure details are an important determiner of the representations a model ends up learning [fahlman1988empirical, wilson2017marginal, kornblith2019better, mehta2019implicit, li2019towards, muller2019does, yin2019fourier]. Do training procedures influence the shape bias of CNNs?
7.1 Does random-crop preprocessing bias models towards texture?
Geirhos et al. [geirhos2018imagenet] followed the standard practice of augmenting their original dataset with random crops: crop shapes are sampled as random proportions of the original image size from [0.08, 1.0] with aspect ratio sampled from [0.75, 0.33], and then resized to px [random_resized_crop]. We hypothesized that such preprocessing might remove global shape information from the image, since for large central objects, randomly varying parts of the object’s shape may appear in the crop, rendering shape a less reliable feature relative to texture. We used the Geirhos Style Transfer dataset to evaluate the shape bias, and shape and texture accuracies, of AlexNet, VGG16 [simonyan2014very], ResNet-50 [he2016deep, torchvision_models, imagenet_training], and Inception-ResNet v2 [tf_inception] models trained on ImageNet with random- versus center-crop preprocessing.
|Model||Shape Bias||Shape Accuracy||Texture Accuracy||ImageNet Top-1 Acc.|
Overall, we found that shape bias was higher for center-crop than for random-crop models (Table 2), consistent with our hypothesis. Similarly, shape accuracy was higher for center-crop models. The direction for texture accuracy depended on the model: it was higher for random-crop than center-crop preprocessing for AlexNet and VGG16, but higher for center-crop than random-crop for ResNet-50 and Inception-ResNet v2.
For Inception-ResNet v2, we measured shape and texture accuracy over training and found consistent dynamics across preprocessing settings. Texture accuracy peaked early (random-crop: 8 epochs; center-crop: 6 epochs) while shape accuracy peaked later (random-crop: 50 epochs; center-crop: 48 epochs). The center-crop model had higher shape bias and shape accuracy for all epochs evaluated.
7.2 Do hyperparameters that maximize validation accuracy also maximize shape bias?
Extracting optimal performance from neural networks requires tuning hyperparameters for the specific architecture and dataset. However, the hyperparameters that optimize performance on a held-out validation set drawn from the same distribution do not necessarily optimize for either shape or texture bias. In order to determine whether there were consistent patterns in the relationship between hyperparameters and shape or texture accuracy, we performed a hyperparameter sweep across a grid of learning rate and weight decay settings. We trained ResNet-50 [tf_resnet50] networks on the 16 ImageNet superclasses used by Geirhos et al. [geirhos2018imagenet], taking 1000 images from each superclass, and computed mean per-class accuracy on the corresponding ImageNet validation subset. We trained networks for 40,000 steps at a batch size of 256 using SGD with momentum of 0.9 with a cosine decay learning rate schedule and standard random-crop preprocessing, and averaged results across 3 runs.
As shown in Figure 6, higher values of learning rate and weight decay were associated with greater shape accuracy and shape bias, whereas lower learning rates were associated with greater texture accuracy. We observed the highest shape accuracy at the highest learning rate where the network could be reliably trained and the highest texture accuracy at the lowest learning rate tested. Mean per-class accuracy on 16-class ImageNet was sensitive to the value of the product of the weight decay and learning rate, but relatively insensitive to the value of the learning rate itself.
In Section 3, we reported experiments suggesting that, across datasets representing three different conceptions of shape and texture, CNNs are able to learn to use shape features at least as easily as texture features. Why, then, do studies like [geirhos2018imagenet, baker2018deep] consistently find texture biases in ImageNet-trained models? We found that training objective, model architecture, data preprocessing, and hyperparameter choices all make distinct contributions to the level of texture bias in a model.
Among two state-of-the-art architectures, AlexNet exhibited a higher shape bias than ResNet-50 v2 for each training objective we investigated. In future work, it will be worthwhile to pinpoint the source of this difference.
Despite the fact that these models make texture-biased classification decisions, the shape information of ambiguous images is still decodable from their hidden representations. Our experiments suggest that the models’ classifications layers progressively downweight shape information.
Largely independently of model architecture, training objective affects a model’s level of texture bias. However, of the self-supervised objectives we investigated, none resulted in strongly ( 50%) shape-biased models, suggesting that current popular self-supervised objectives do not encourage reliance on shape over textural features. Further, though one might expect that generative models would develop more shape-biased representations, BigBiGAN exhibited a texture bias (31.9%). Still, we hypothesize that other generative models, for example ones that learn to decompose scenes into semantic constituents, or RL models that learn to interact with objects, may exhibit a stronger shape bias than the models investigated here.
Our finding in Section 6 that, in high-performing supervised ImageNet models, shape bias and shape accuracy are significantly positively correlated with ImageNet performance, presents a puzzle: if making classification decisions based on shape is useful for ImageNet performance, why do models not learn to better exploit this feature?
An important caveat to our investigations is that many of our experiments used the Geirhos Style Transfer dataset [geirhos2018imagenet]. Since the neural style transfer model used to create this dataset itself includes an ImageNet-trained CNN (VGG19), this dataset is not wholly independent of the models we use it to evaluate. In future work, it will be useful to develop a naturalistic dataset generated independently of ImageNet-trained CNNs.
Going forward, an engineering challenge will be to develop architectures that are more shape-biased. The correlation between shape bias and accuracy and ImageNet performance suggests this is worthwhile, and our learning dynamics experiments suggest that it should be possible for CNNs to better learn shape information. While Geirhos and colleagues have shown that training on a style-transfer-augmented dataset increases ImageNet performance for a ResNet-50 architecture [geirhos2018imagenet], building model architectures that do not require data augmentation to learn shape-biased representations remains an appealing goal. Rather than simply maximizing shape bias, which can be achieved by minimizing texture accuracy, an ideal shape-biased model would make use of both the shape and texture information present in images.
We thank Jay McClelland, Andrew Lampinen, Akshay Jagadeesh, and Chengxu Zhuang for useful conversations, and Guodong Zhang and Lala Li for comments on an earlier version of the manuscript. KLH was supported by NSF GRFP grant DGE-1656518.
Appendix A Supplemental Figures
Appendix B Supplemental Tables
|Objective||Shape Bias||Shape Accuracy||Texture Accuracy||ImageNet Top-1 Acc.|
|Objective||Shape Match||Texture Match||Both Match||Other|
Appendix C Methods
c.1 Learning experiments
c.1.1 Dataset considerations
On their own, none of these datasets are unproblematic representations of texture. For the Geirhos Style Transfer dataset, for example, human subjects were unable to attain good performance on the style classification task (mean accuracy = 14.2%, chance = 6.25%; analysis of data from [geirhos2018imagenet] human experiment in which subjects were given texture-biased instructions, originally presented in Fig 10b of [geirhos2018imagenet] plotted by shape class; data obtained from [geirhosrepo]). Further, the performance of the style transfer algorithm on individual images introduces another source of variability, and the fact that style transfer itself relies on ImageNet-trained CNN features means that the data were not generated independently of the models being evaluated. The Navon stimuli, meanwhile, strongly deviate from the statistics of natural images. Finally, the noise textures from ImageNet-C arguably deviate the farthest from what people generally mean by the term. We hope that presenting results for all three datasets will dilute any idiosycracies of the datasets individually. In future work, we hope to create new datasets that combine the controllability of the Navon stimuli with the naturalism of the Geirhos Style Transfer and ImageNet-C datasets.
c.1.2 Dataset splits
Geirhos Style-Transfer dataset. We created 5 cross-validation splits of the data, using each cv split for both classification tasks. To create a given split, we held out a single shape exemplar and a single texture exemplar, and confirmed that no whole shape or texture classes were held out. During the texture task, then, a model was required to generalize a given texture across exemplars of that texture; during the shape task, it had to generalize a given shape across exemplars of that shape. The mean validation size over cv splits was 483 items (40.3% of the data). Although the dataset contains 80 images where shape and texture match, [geirhos2018imagenet] excluded these when computing shape and texture bias, and we exclude these from our experiments.
Navon dataset. For the Navon dataset, we created 5 cv splits independently for each task. For the shape task, we held out 3 texture classes (e.g. the letters “T”, “U”, “E”), and for the texture task, we held out 3 shape classes. The validation size was 375 items (11.5% of the data).
ImageNet-C dataset. We split each version of the dataset separately for the shape and texture tasks. For the shape task, we held out 2 texture classes (e.g. “snow”, “fog”); for the texture task, we held out two shape classes (e.g. wnid’s “n01632777”, “n03188531”). The validation size was 9,500 items (10.5% of the data).
We trained AlexNet models with the output layer modified to reflect the number of classes present in the dataset at hand (Geirhos Style Transfer: 16, Navon: 26, ImageNet-C: 19). We additionally reduced the widths of the fully connected layers in proportion to the reduction in number of output classes vs. ImageNet.
We preprocessed training and validation images by normalizing the pixel values by the mean and standard deviation of the subset of data used for training. For the Geirhos Style-Transfer and ImageNet-C datasets, we randomly horizontally flipped each training image with p = 0.5 during training.
c.2 Self-supervised representation experiments
c.2.1 Self-supervised training
We trained AlexNet models from scratch using a modified version of the code provided by Kolesnikov et al. [kolesnikov2019revisiting]. As the base network, we used the AlexNet implementation from TensorFlow-Slim (https://github.com/tensorflow/models/tree/master/research/slim
). For consistency with the PyTorch AlexNet, we modified the first convolutional layer of the TensorFlow-Slim network to use padding and trained at 224224 pixel resolution. Unlike Gidaris et al. [gidaris2018unsupervised]
, we did not use batch normalization. We trained all AlexNet models for 90 epochs using SGD with momentum of 0.9 at a batch size of 512 examples, with a weight decay ofand an initial learning rate of 0.02. We decayed the learning rate by a factor of 10 at epochs 30 and 60. For all models, we used preprocessing consisting of random crops sampled as random proportions of the original image size and random flips.
c.2.2 Training supervised classifiers on self-supervised representations
We trained all classifiers using SGD with momentum of 0.9 with data augmentation consisting of random flips and random crops obtained by resizing the image to 256 pixels on its shortest side and cropping 224 224 regions. This less aggressive form of cropping was used for training classifiers on top of self-supervised representations in previous work [gidaris2018unsupervised, kolesnikov2019revisiting, donahue2019large], and we found it to be essential to produce their results.
For fair comparison between supervised and self-supervised models, Table 1 presents supervised AlexNet results where the network was first trained with aggressive random crops sampled as random proportions of the original size and random flips, and then the convolutional layers were frozen and the fully connected layers were retrained with the less aggressive cropping strategy described above, thus replicating the training procedure for the self-supervised AlexNet models. However, the model obtained by retraining the fully connected layers performed very similarly to the original model. Both obtained ImageNet top-1 accuracies of 57.0%, and shape bias was also nearly identical (original model: 30.6%; model with retrained fully connected layers: 29.9%).
Logistic regression on ResNet representations. We trained for 520 epochs at a batch size of 2048 and an initial learning rate of 0.8 without weight decay, decaying the learning rate by a factor of 10 at 480 and 500 epochs.
AlexNet MLP training. When retraining the MLP at the end of AlexNet networks, we trained for 90 epochs at a batch size of 512 with an initial learning rate of 0.02. decayed by a factor of 10 at 30 and 90 epochs. We optimized weight decay by choosing the best value out of on a held-out set of 50,046 examples, and then trained on the full ImageNet dataset. Optimal values for weight decay were for the supervised model, for rotation, and for the exemplar loss.
Logistic regression on AlexNet pool3 layer. We trained for 600 epochs, decaying the learning rate by a factor of 10 at 300, 400, and 500 epochs. As for AlexNet MLP training, we optimized weight decay on a held out validation set. The optimal values did not change.
c.2.3 Statistical modeling
We performed statistical modeling of the effects of self-supervised loss and architecture using logistic regression. We modeled the logit of the probability of correct shape/texture classification of each example with each network as a linear combination of effects of architecture, loss, the individual example, and an intercept term. This model is a generalization of repeated measures ANOVA where the dependent variable is binary. We fit the model using iteratively reweighted least squares using statsmodels[seabold2010statsmodels]
. We excluded examples that all networks classified correctly or incorrectly; these do not affect the values of parameters corresponding to architecture or loss, but cause per-example parameters to diverge during model fitting. Coefficients provided in the paper are maximum likelihood estimates with Wald confidence intervals computed based on the corresponding standard errors from the Fisher information matrix.
c.3 Evaluation of shape bias, shape accuracy, and texture accuracy
To evaluate shape and texture accuracy and shape bias in ImageNet-trained models, following [geirhos2018imagenet], we presented models with full, uncropped images from the Geirhos Style Transfer dataset, collected the class probabilities returned by the model, and mapped these to the 16 superclasses defined by [geirhos2018imagenet] by summing over the probabilities for the ImageNet classes belonging to each superclass. For networks in Section 7.2, which were trained to classify only the 16 superclasses, we simply took the class to which the network assigned the highest probability. Shape accuracy was the percentage of the time a model correctly predicted probe items’ shapes, texture accuracy was the percentage of the time the model correctly predicted probe items’ textures, and shape bias was the percentage of the time the model predicted shape for trials on which either shape or texture prediction was correct.
c.4 Architecture experiments
c.4.1 Training settings for comparison of ImageNet architectures
We trained at a batch size of 4096 using SGD with Nesterov momentum of 0.9 and weight decay ofand performed evaluation using an exponential moving average of the training weights computed with decay factor 0.9999. The learning rate schedule consisted of 10 epochs of linear warmup to a maximum learning rate of 1.6, followed by exponential decay at a rate of 0.975 per epoch. For all conditions we randomly horizontally flipped images and performed standard Inception-style color augmentation.
c.5 Data augmentation experiments
For AlexNet, VGG16, and ResNet-50 models, we used implementations available through torchvision (https://github.com/pytorch/vision). We trained these models for 90 epochs using SGD with momentum of 0.9 at a batch size of 64 and with weight decay of . For AlexNet and VGG16, we used an initial learning rate of 0.0025; for ResNet-50, we used an initial learning rate of 0.025. For all models, the learning rate was decayed by a factor of 10 at epochs 30 and 60. For all conditions we randomly horizontally flipped training images. We evaluated shape bias and shape and texture accuracy at the point over the training period corresponding to maximum classification accuracy on the validation set.
For Inception-ResNet v2, we used the implementation from TensorFlow-Slim (https://github.com/tensorflow/models/tree/master/research/slim), trained as described in Section C.4.1. To evaluate shape and texture accuracy, we used the checkpoint that achieved the highest accuracy on the ImageNet validation set, at 122 epochs with random crops and 58 epochs without random crops.
Our results for AlexNet and VGG16 differ slightly from those in Geirhos et al. [geirhos2018imagenet]
, which reported shape biases of AlexNet (42.9%) and VGG16 (17.2%) models implemented in Caffe. Since publication of their paper, they have reported the shape biases for PyTorch implementations of these models, with the random-crop preprocessing we have described, obtaining 25.3% for AlexNet, 9.2% for VGG16, 22.1% for ResNet-50[geirhosrepo]. Using the pretrained models available through PyTorch’s model zoo, which uses random-crop preprocessing, we obtained comparable results to theirs: 26.9% for AlexNet, 10% for VGG16, and 22.1% for ResNet-50; slight differences may be due differences in random initialization. These models were trained with a batch size of 256, but the results we report in Table 2 and Figure A.3 for our models trained at batch size 64 are within a few percentage-points of the larger-batch-size models.