Code for our paper "Informative Dropout for Robust Representation Learning: A Shape-bias Perspective" (ICML 2020)
Convolutional Neural Networks (CNNs) are known to rely more on local texture rather than global shape when making decisions. Recent work also indicates a close relationship between CNN's texture-bias and its robustness against distribution shift, adversarial perturbation, random corruption, etc. In this work, we attempt at improving various kinds of robustness universally by alleviating CNN's texture bias. With inspiration from the human visual system, we propose a light-weight model-agnostic method, namely Informative Dropout (InfoDrop), to improve interpretability and reduce texture bias. Specifically, we discriminate texture from shape based on local self-information in an image, and adopt a Dropout-like algorithm to decorrelate the model output from the local texture. Through extensive experiments, we observe enhanced robustness under various scenarios (domain generalization, few-shot classification, image corruption, and adversarial perturbation). To the best of our knowledge, this work is one of the earliest attempts to improve different kinds of robustness in a unified model, shedding new light on the relationship between shape-bias and robustness, also on new approaches to trustworthy machine learning algorithms. Code is available at https://github.com/bfshi/InfoDrop.READ FULL TEXT VIEW PDF
Shape and texture are two prominent and complementary cues for recognizi...
Despite the initial belief that Convolutional Neural Networks (CNNs) are...
There is an emerging sense that the vulnerability of Image Convolutional...
Humans rely heavily on shape information to recognize objects. Conversel...
Building extraction in VHR RSIs remains to be a challenging task due to
Robustness and counterfactual bias are usually evaluated on a test datas...
Recent works demonstrate the texture bias in Convolutional Neural Networ...
Code for our paper "Informative Dropout for Robust Representation Learning: A Shape-bias Perspective" (ICML 2020)
Despite the impressive performance in a broad range of visual tasks, Convolutional Neural Network (CNN) is surprisingly vulnerable compared with the human visual system. For example, features learned by CNN have trouble in generalizing across shifted distributions between training and test data (Chen et al., 2019; Wang et al., 2019a). Random image corruptions can also considerably degrade its performance (Hendrycks and Dietterich, 2019). CNN is extremely defenseless under well-designed image perturbation as well (Szegedy et al., 2013). This is opposite to the human visual system, which is robust to domain gap, noisy input, etc. (Biederman, 1987; Bisanz et al., 2012; Geirhos et al., 2017).
Another intriguing property of CNN is its ‘texture bias’, namely its bias towards texture instead of shape. Despite the earlier belief that CNN extracts more abstract shapes and structures layer by layer as human does (Kriegeskorte, 2015; LeCun et al., 2015), recent works reveal its reliance on the local texture when making decisions (Geirhos et al., 2019; Brendel and Bethge, 2019)
. For instance, given an image with a cat’s shape filled with an elephant’s skin texture, CNN tends to classify it as an elephant instead of a cat(Geirhos et al., 2019).
Supported by some recent works, there seems to be a surprisingly close relationship between CNN’s robustness and texture-bias. For example, Zhang and Zhu (2019) find that adversarially trained CNNs are innately less texture-biased. There are also a few attempts to tackle a specific task by training a less texture-biased model. Carlucci et al. (2019) propose to improve robustness against domain gap by training on jigsaw puzzles, which relies more on global structure information. Geirhos et al. (2019) find that shape-biased CNNs trained on stylized images are more robust to random image distortions. Up to this point, one may naturally wonder:
Is texture-bias a common reason for CNN’s different kinds of non-robustness against distribution shift, adversarial perturbation, image corruption, etc.?
To explore the answer, this work aims at improving various kinds of robustness universally by alleviating CNN’s texture bias and enhancing shape-bias. Some approaches to train shape-biased CNNs have been proposed recently. However, they either are susceptible to complex patterns (see Fig. 1(b)) (Radenovic et al., 2018), or have high computational complexity and auxiliary tasks (Geirhos et al., 2019; Wang et al., 2019a; Carlucci et al., 2019; Wang et al., 2019b). In this work, we propose a light-weight model-agnostic method, namely Informative Dropout (InfoDrop). The inspiration comes from earlier works on saliency detection and human eye movements: humans tend to look at regions with high self-information , i.e., regions whose being observed based on surrounding regions contains more ‘surprise’ (Bruce and Tsotsos, 2006, 2009). In other words, people tend to pay more attention to regions that look different from neighboring regions. In our case, patterns like flat regions or high-frequency textures tend to repeat themselves in the neighboring region, thus being less informative. On the other hand, visual primitives (e.g. edges, corners) are more unique and thus more informative among its neighborhood. Fig. 1(c) provides a visualization of the information distribution in natural images. Note that both shape and important characteristics (e.g. eyes, stripes) are accentuated, while texture (e.g. hair) is relatively repressed.
To this end, InfoDrop is proposed to reduce texture-bias by decorrelating each layer’s output with less informative input regions. Specifically, we adopt a Dropout-like algorithm (Srivastava et al., 2014)
: for input regions with less information, we zero out the corresponding output neurons with higher probability. In this way, reliance on textures can be reduced and the model is trained to be more biased towards shapes. By eliminating InfoDrop after training, the model is further demonstrated to be internally shape-biased without InfoDrop during inference. The shape-bias property is exhibited through different experiments, both qualitatively and quantitatively.
To evaluate the robustness of InfoDrop, we conduct extensive experiments in four different tasks: domain generalization, few-shot classification, robustness against random corruption, and adversarial robustness. Results show a consistent improvement in different kinds of robustness over various baselines, demonstrating the effectiveness and versatility of our method. We also demonstrate that InfoDrop can be combined with other algorithms (e.g. adversarial training) to further enhance the robustness.
With inspiration from the human visual system, we propose InfoDrop, an effective albeit simple plug-in method to reduce the general texture bias of any CNN-based model.
As shown by extensive experiments, InfoDrop achieves consistently non-trivial improvement over multiple baselines in a wide variety of robustness settings. Furthermore, InfoDrop can be incorporated together with other algorithms to obtain higher robustness.
To the best of our knowledge, this work is one of the earliest attempts to improve different kinds of robustness in a unified model. This sheds new light on the relationship between CNN’s texture-bias and non-robustness, also on new approaches to building trustworthy machine learning algorithms.
An important feature of intelligence is its ability to generalize knowledge across tasks, domains and categories (Csurka, 2017). However, CNNs still struggle when different kinds of distribution shifts exist between training and test data. For instance, in few-shot classification, where large class gap is the main challenge, complex algorithms make little improvement upon simple baselines (Chen et al., 2019; Huang et al., 2020; Dhillon et al., 2020). CNNs also have trouble with transferring knowledge across different domains, especially when data is unavailable in the target domain as in the task of domain generalization (Khosla et al., 2012; Li et al., 2017, 2018b; Carlucci et al., 2019). In this work, we evaluate our method’s robustness against distribution shift on tasks of few-shot classification and domain generalization.
CNNs are also sensitive to small perturbations and corruptions in images, which can be easily dealt with by humans (Azulay and Weiss, 2019). Hendrycks and Dietterich (2019) benchmark CNN’s robustness against 18 types of random corruption, demonstrating its vulnerability. It is also shown that well-designed perturbation, namely adversarial perturbation, can severely degrade the performance of CNNs (Szegedy et al., 2013). We evaluate the robustness of our approach against both random corruption and adversarial perturbation, with other methods towards model robustness as baseline, e.g., adversarial training (Madry et al., 2018; Zhang et al., 2019).
Despite the recent impressive performance of CNNs in various vision tasks, the visual processing mechanism behind remains controversial. One widely accepted hypothesis is that CNNs extract low-level primitives (e.g. edges, corners) in lower layers and try to combine them into complex shapes in higher layers (Kriegeskorte, 2015; LeCun et al., 2015). This hypothesis is supported by numbers of empirical findings, both from computational (Zeiler and Fergus, 2014) and psychological (Ritter et al., 2017) perspectives. However, recent work argues that local texture is sufficient for CNNs to perform correct classification (Brendel and Bethge, 2019). Shape or contour information, on the other hand, seems hard for CNNs to understand (Ballester and Araujo, 2016). CNNs also fail at transferring between images with similar shapes yet distinct textures (Geirhos et al., 2019). These findings indicate an alternative explanation for the success of CNNs: local texture is what CNNs base on when making decisions.
More and more work indicates a close relationship between CNNs’ non-robustness and texture-bias. Zhang and Zhu (2019) find that adversarially trained networks are less texture-biased. Geirhos et al. (2019) show that shape-biased models trained with stylized images are more robust against image distortion. Carlucci et al. (2019) propose to boost domain generalization by training to solve jigsaw puzzles, which relies more on global structure. Wang et al. (2019a) propose to penalize CNN’s local predictive power to reduce the domain gap induced by image background. With the same objective, Wang et al. (2019b) propose to project out superficial statistics in feature space. However, none of the work has discussed the relationship between texture-bias and different types of non-robustness in a unified model.
It is known that human eyes tend to fixate on specific regions (saliency) rather than scan the whole image they see (Yarbus, 2013). The mechanism behind this kind of bias has attracted lots of interest. Itti et al. (1998) reveal the importance of center-surround contrast of units in the human visual system. Hou and Zhang (2007) detect saliency using residual contrast in the spectral domain. Other works propose to use Shannon entropy to measure saliency and predict fixation (Fritz et al., 2004; Renninger et al., 2005). In Bruce and Tsotsos (2006), self-information is proposed to better model saliency.
In addition, shape-bias is also found critical in the human visual system. A large amount of evidence shows shape is the most important single clue for human vision learning and processing (Landau et al., 1988). For example, young children tend to extend object names based on its shape, rather than size, color or material (Diesendruck and Bloom, 2003). The shape bias of human vision, together with its bias towards self-information, further motivates our proposed method.
Let denotes an image with channels and spatial shape of . For a CNN, we denote the input of -th convolutional layer by and output by . Note that equals to the input image . Assume the -th layer has a convolutional kernel and bias , where is the kernel size. Then for -th channel’s -th element in output (), we have , where is the -th patch in , and are the kernel and bias for -th output channel, indicates inner product and
is an entry-wise activation function (e.g
. ReLU). All through this paperdenotes Euclidean norm.
Now we develop our information-based Dropout method for alleviating texture-bias. As discussed in Section 1, regions of textures tend to contain low self-information. To this end, we propose to reduce texture-bias by decorrelating each layer’s output with low-information regions in input. Specifically, we adopt a Dropout-like approach for the purpose. In traditional Dropout (Srivastava et al., 2014), a multiplicative Bernoulli noise is introduced to help prevent overfitting, where each neuron is zeroed out with equal probability. In order to suppress texture-bias, we propose to zero out an output neuron with higher probability if the input patch contains less information, and vice versa. Specifically, we model the drop coefficient r of the -th neuron in output’s -th channel with a Boltzmann distribution:
where is the patch in the input related to the computation of , denotes self-information and is temperature. When value of is low, the corresponding neuron is likely to be dropped, and the network tends to rely less on . Here the temperature serves as a ‘soft threshold’ of information. When is small, the threshold lowers down, and only patches with least information (e.g. a patch in a solid-colored region) will be dropped. When goes to infinity, all neuron will be dropped with equal probability, and the whole algorithm becomes regular Dropout.
First we discuss how to estimate. The definition of information could date back to Shannon’s work (Shannon, 1948), from where we borrow the concept of self-information to describe the information of a patch:
where is the distribution which is sampled from, if we see
as a realization of a random variable. As a simple case, we can assume that all patches in the neighborhood ofare different realizations of the same random variable, i.e., they are all sampled from the same distribution . In this case, if contains more “texture” than “shape”, its pattern shall repeat itself within a local region, resulting in a high likelihood and hence low self-information and should be zeroed out with high probability.
To approximate , we assume that and other patches in its neighbourhood come from the same distribution . Here the neighbourhood means a local region centered at , with Manhattan radius , i.e., the neighborhood contains patches. Then, with neighboring patches as samples111 Here all the patches in the neighborhood are used. Nonetheless, one can only use a random part of the patches for an unbiased estimation to reduce the computational load, especially when the radius of the neighborhood is large. From our observation, this barely affects the performance.
Here all the patches in the neighborhood are used. Nonetheless, one can only use a random part of the patches for an unbiased estimation to reduce the computational load, especially when the radius of the neighborhood is large. From our observation, this barely affects the performance., we approximate
with its kernel density estimator, i.e.
where is kernel function. Here we use Gaussian kernel, i.e., , where is the bandwidth. Then the information of is estimated by
As one can observe, the more different is from neighbouring patches, the more information it contains. For regions of solid color or high-frequency texture, similar patterns tend to repeat in the neighborhood, and thus little information is presented. Local shapes, on the other hand, are more unique in their surroundings and thus more informative.
Then we discuss how the dropout process works. A direct way is to sample neurons in the output with probabilities given by Eq. 1, and set them to zero. During training, for the -th channel of -th layer’s output , we randomly choose neurons to drop by running weighted multinomial sampling with replacement for times,222Here we choose sampling with replacement over without replacement because the former runs faster in practice. Hence here can be any positive real number due to collision of samples, and the actual dropout rate (expected ratio of sampled neurons) will be lower than . where is a hyper-parameter controlling the amount of dropped neurons. The algorithm is shown in Alg. 1.
Note that when training with InfoDrop on, we are intentionally filtering out texture to make the model learn to recognize shape. However, during inference, we expect to see a genuinely shape-biased model which can filter out texture by itself without InfoDrop’s help. To check if our model has obtained this “internal” shape-bias, one way is to directly remove the InfoDrop blocks during inference. However, there may be statistical mismatch (e.g
. in batch normalization) between clean images and images processed by InfoDrop. To this end, we take the inspiration from(Geirhos et al., 2019) and propose to finetune the network on clean images with InfoDrop removed, as an extra step after InfoDrop training. In this way, we can safely remove InfoDrop during testing, and examine whether our network has truly learned shape-bias.
There are two parts of computational cost in InfoDrop: (i) calculation of self-information for input patches, and (ii) manipulation of each output element. For self-information calculation, there are input patches, each with size . Note that kernel size and scale of neighborhood are constants. This means a time complexity of for part (i). As for part (ii), both sampling and element-wise product needs . Note that spatial shape often stays unchanged through convolution. Therefore, time complexity of InfoDrop is , which is little overhead compared with in convolutional operation.
We conduct extensive experiments for further understanding properties of InfoDrop and its benefits over standard CNN-based models. First we discuss the shape-bias property of InfoDrop in Sec. 4.1. Then in Sec. 4.2 we evaluate robustness of InfoDrop through four different tasks, viz. domain generalization, few-shot classification, robustness against random corruption and adversarial robustness, and also compare with other shape-biased approaches. In Sec. 4.3, we conduct ablation studies for further analysis. The balance between shape and texture is discussed in Sec. 4.4. Please refer to Appendix for specific experimental settings.
We conduct several experiments, both qualitatively and quantitatively, to analyze the shape-bias property of InfoDrop. Due to limited space, we refer readers to Appendix for more visualization and detailed experimental settings.
A Frequency Perspective We first analyze the shape-bias property of self-information by visualizing how it responds to local regions with different spatial frequency. To obtain the average frequency of a local region, we apply Discrete Cosine Transform (DCT) (Ahmed et al., 1974) to the local patch to get the power spectrum, which is further used as weights of each frequency level to get the average frequency. We repeat the process for each position and get the frequency map (Fig. 2(b)). We also calculate each position’s self-information (Fig. 2(c)). As one can observe, for visual primitives including edges and corners (green boxes), they present medium frequency, but are most highlighted by self-information. High-frequency textures (red boxes), as highlighted in frequency map, however, contain relatively low information due to its high-frequency self-repeating. Flat regions (yellow boxes) are filtered by both frequency and information map. This is also consistent with our previous discussions.
Saliency Map of CNN To verify the shape-bias InfoDrop brings to CNNs, we visualize gradients of model output w.r.t. input pixels, which serve as a “saliency map” of the network. Specifically we use SmoothGrad (Smilkov et al., 2017) to calculate saliency map ,
where is original image with i.i.d. Gaussian noise , and is the network. An example is shown in Fig. 3. We can see that InfoDrop is more human-aligned, sensitive to shapes of objects, while the saliency map of regular CNN is more noisy and less shape-biased, lacking interpretability.
Patch Shuffling We also evaluate the shape-bias of InfoDrop through recognizing images whose shape information is ruined but texture is retained. Following (Zhang and Zhu, 2019), we achieve this goal by dividing images into patches and randomly shuffling them. Through patch shuffling, global structure is ruined while local texture in each patch is left untouched. We train our model on clean images and test on patch-shuffled test set. We set different values of and results are listed in Table 1. Note that means no shuffling is used. As goes up, global structures are severely ruined, causing a rapid declination in InfoDrop’s performance. However, regular CNN is barely influenced since most texture information is preserved. This also indicates that CNN with InfoDrop is more biased towards shape information.
To understand the features extracted by InfoDrop, we conduct ablations in the task of style transfer. Recently,Huang and Belongie (2017)
proposed AdaIN algorithm to render a content image with the style of another image (style image). Specifically, features of both content and style images are first extracted by encoder, and then the mean and variance of the content feature is aligned with those of the style feature. Transferred image is then decoded from the aligned content feature. In our experiment, we apply InfoDrop in the encoder and observe changes in the rendered image. By doing so, we expect to see that only the edging style of the content image is rendered by that of the style image, and the texture style is preserved. This is verified by the results in Fig.4. Take the first row as example, we can see that baseline method mainly change the tone of the whole image. In contrast, InfoDrop inherits the style of red edging and sketching, and applies it on the shape of content image, indicating that InfoDrop is more shape-biased in both content and style images.
In this section, we first evaluate various kinds of robustness (against distribution shift, image corruption and adversarial perturbation) of InfoDrop through four different tasks (Sec. 4.2.1 Sec. 4.2.4). Since InfoDrop can be applied to any CNN-based models, and extensive exploration of more complicated base models is beyond the main scope of our studies in this section, we only use simple architecture (e.g. ResNet (He et al., 2016)) and baseline algorithms, and observe incremental results when InfoDrop is applied. Then we compare InfoDrop with other approaches towards shape-bias (Sec. 4.2.5). Due to limited space, detailed experimental configuration and additional results are deferred to Appendix.
Due to the natural data variance induced by time, location, weather, etc., it’s a significant feature for visual models to generalize across different domains. To this end, the task of domain adaptation is proposed, where labeled data from source domain and unlabeled data from target domain are provided (Shimodaira, 2000). Prior arts mainly focus on diminishing the distribution shift in feature space between source and target domain (Gretton et al., 2007, 2009; Long et al., 2015). A more challenging task, namely domain generalization, is later proposed, where data from target domain is unavailable during training. Previous solutions include learning invariant features (Muandet et al., 2013), or utilizing auxiliary tasks (Carlucci et al., 2019).
In our experiment, we use the naive algorithm as baseline: training a classification model on source domain, and testing on target domain. Following the literature (Carlucci et al., 2019), we use PACS (Li et al., 2017) as dataset, which consists of four domains, viz. photo, art, cartoon and sketch.
Results on single-source domain generalization are shown in Table 2. Here we report the relative improvement of InfoDrop over baseline. For the absolute accuracies, please refer to Appendix. Compared with baseline, InfoDrop boosts performances in multiple settings, especially with sketch as the source or target domain. This also reflects the shape-bias of InfoDrop, considering that sketches mainly consist of shape information. It is also worth noticing that our model can keep the performance on the source domain after InfoDrop is applied.
We also obtain results on multi-source domain generalization. Table 3 shows results on each domain after trained on other three domains. When trained with InfoDrop, the model is more robust to the distribution shift between different domains, and obtains consistent improvements over all target domains. Moreover, the vanilla baseline with InfoDrop is already better than or comparable with other state-of-the-art methods on each target domain.
Current CNNs rely on huge amount of labeled data to learn powerful representations for downstream tasks. However, the learned representations may generalize poorly to unseen objects and scenes. This is in contrast to the human visual system, which is able to quickly grasp the feature of an unseen object given only a few examples. To this end, the task of few-shot classification is proposed, where a model needs to recognize classes unseen during training with limited examples. The main challenge here is the huge class-wise distribution shift. Following the literature, we use ‘-way -shot classification’ to refer to the setting where test data come from novel classes each with examples provided.
Following the setting in Chen et al. (2019), we evaluate InfoDrop on two popular datasets: CUB (Wah et al., 2011) and mini-ImageNet (Ravi and Larochelle, 2017), meanwhile also test our model in the cross-domain scenario (Chen et al., 2019), where mini-ImageNet is used for training and CUB for testing. We denote this setting by mini-ImageNetCUB. For a full comparison, we test models trained both with and without data augmentation. For baseline algorithms, we follow Chen et al. (2019) and adopt three common approaches, viz. ProtoNet (Snell et al., 2017), MatchingNet (Vinyals et al., 2016) and RelationNet (Sung et al., 2018).
|MatchingNet||71.18 0.70||57.81 0.88|
|+ InfoDrop||72.32 0.69||57.88 0.91|
|ProtoNet||67.13 0.74||51.62 0.90|
|+ InfoDrop||70.94 0.72||52.40 0.90|
|RelationNet||69.85 0.75||56.71 1.01|
|+ InfoDrop||73.72 0.71||59.21 0.98|
First we use ProtoNet as baseline and evaluate our method under different settings (Table 4). Under almost all the settings, InfoDrop brings a non-trivial improvement in performance. One may notice that improvements on mini-ImageNet are larger than CUB, which is reasonable due to the larger distribution shift to overcome in mini-ImageNet (Chen et al., 2019). As another observation, the improvements on 5-shot classification is larger than 1-shot. This implies that despite the robustness of shape features, they may not be as discriminative as texture features, hence requiring more examples for recognition. As a consequence, we may still need some texture to learn a discriminative and robust model (Sec. 4.4). Also, note that for baseline method, sometimes data augmentation may damage performance, which is possibly because augmentation leads to overfitting in the base classes. However, similar behavior is not observed on InfoDrop.
Then we check whether InfoDrop can bring a consistent improvement on different baselines. As shown in Table 5, on three baseline methods, InfoDrop improves the robustness universally. Note that InfoDrop most benefits RelationNet, possibly because its relation head learns a better similarity metrics between complex shapes.
It is essential for visual models to give stable predictions under various kinds of corruptions (e.g. weather, blur, noise), especially in safety-critical situations. However, current CNNs are vulnerable to random corruptions and hardly generalize to different kinds of corruptions when trained on a specific one (Dodge and Karam, 2017). Recently, Geirhos et al. (2019) find that a consistently improved robustness against different corruptions can be achieved by training a shape-biased model. In Hendrycks and Dietterich (2019), benchmarks of model robustness are provided on 18 common types of corruption. In our experiments, we apply the same corruption functions on Caltech-256 dataset (Griffin et al., 2007) to test the robustness of InfoDrop. For comparison, we also test robustness of adversarially trained networks with and without InfoDrop. Adversarial training is known to improve robustness to noise and blur corruptions, while degrade performance on some others (e.g. fog, contrast) (Gilmer et al., 2019). Results are shown in Table 12. Due to limited space, we only show 12 types of corruptions here. Full comparisons can be found in Appendix. Clearly, InfoDrop improves baseline’s robustness against most corruptions (e.g. noise, weather, digital) universally, although no noisy data is used for training. This also implies the potential of InfoDrop to generalize to other untested types of corruptions. Nonetheless, the performance may further degrade under blurring nonetheless, which is reasonable because blurring brings more distortion of shapes while others mainly corrupts texture information. It is also noticeable that InfoDrop can be incorporated with adversarial training and obtain even better robustness with little overhead.
Except for random corruptions, CNNs are also vulnerable to carefully-designed imperceptible perturbations, namely adversarial perturbations (Szegedy et al., 2013). This leads to another crucial challenge for current CNN-based models. Most work on adversarial robustness is based on adversarial training (Madry et al., 2018). To evaluate adversarial robustness of InfoDrop, we conduct ablations on both baseline and adversarial trained models. Following the literature, we use CIFAR-10 (Krizhevsky et al., 2009), a widely-reported benchmark. For attacking, we use 20 runs of PGD (Madry et al., 2018) with constrained norm in both adversarial training and testing. As shown in Table 7, InfoDrop can improve robustness of baseline models under low-norm attack, but it still fails when the perturbation is large. Moreover, InfoDrop can be combined with adversarial training and provide extra robustness. Under the norm , InfoDrop brings an improvement of 1% accuracy.
Some approaches have also been proposed recently to train a shape-biased model. For example, Geirhos et al. (2019) propose to train the network on extra images with various texture styles in order to learn the shared shape features. Wang et al. (2019b) propose to use Gray-level Co-occurrence Matrix (Lam, 1996) as an indicator of texture, and decompose the feature from it. Other attempts include using different auxiliary tasks (Wang et al., 2019a; Carlucci et al., 2019).
|IN + SIN||72.80||40.04||58.70|
|IN + SIN||74.51||38.38||42.61|
Here we compare InfoDrop with the approach in Geirhos et al. (2019), which pretrains the network on ImageNet (IN) as well as Stylized-ImageNet (SIN). For comparison with other shape-biased methods, please refer to Appendix. Specifically, we evaluate the performances on single-source domain generalization. We compare InfoDrop (pretrained only on IN) with a ResNet50 pretrained on both IN and SIN. Results are shown in Table 8. We can see that both methods can bring an improvement in the model robustness. Particularly, pretraining on SIN can largely increase the accuracy on Sketch domain, which is probably because SIN already contains images with sketch style. Remarkably, InfoDrop can improve the robustness consistently without seeing any target domain examples beforehand.
In this section we mainly discuss how different configurations or hyperparameters will impact the performance of InfoDrop. We first start with the role of temperaturein Eq. 1. Intuitively, lower temperature means more conservative filtering, i.e., only patches with the least information (e.g. constant-valued regions) are dropped, while most shape and texture are preserved. An infinite temperature, however, will wipe out differences between shape and texture and act in a purely random way as regular Dropout. Apparently, somewhere between is what we intend for, where it can distinguish shape and texture, and filter out the latter. As verification, we conduct ablations on 5-way 1-shot classification on mini-ImageNetCUB. As shown in Table 9, it reaches the highest accuracy when . Higher or lower will degrade the performance. This means to be more robust, the model needs to filter out textures whilst preserve shape information, which is consistent with our analysis.
Now we discuss to which layers should InfoDrop be applied. Technically, it can be integrated into any convolutional layers. But since InfoDrop extracts local self-information and locate important primitives, intuitively, as a local algorithm, it should be applied to lower layers of a CNN. In our experiments, we apply InfoDrop to the first residual blocks of ResNet18, where , where means InfoDrop is applied only to the first convolutional layer before all residual blocks. As shown in Table 10, gives the best performance. For higher layers, extracted features are more abstract and dropping them may degrade performance.
In previous sections we have demonstrated how shape-bias can benefit CNN’s robustness under different scenarios. This raises another question: how biased should our model be? For example, does a visual model still work well if it only perceives shape information? The answer may be “no”, considering that texture information plays a different but also important role in the human visual system (e.g. multi-modal perception (Sann and Streri, 2007)). It is also verified in experiments on deep models (Xiao et al., 2019) that shape itself does not suffice for high-quality visual recognition. Intuitively, there should exist an optimal “bias level” so that the model can be robust enough and meanwhile recognize objects with a proper precision, and this optimal level may vary from task to task.
To verify this, we conduct experiments on domain generalization. Specifically, we tune the temperature to train models with different levels of shape-bias. To quantify the shape-bias, we use the classification error on patch-shuffled images as an indicator, considering that larger shape-bias generally leads to higher classification error on patch-shuffled images. We use photo as source domain, and test the performances on art, cartoon and sketch. As shown in Fig. 5, the performances on all target domains all go through an ascending at first, and then fall back when the shape-bias keeps being enhanced. Moreover, different target domains prefer different optimal bias levels. This implies that current CNNs are overly texture-biased, and we need to reach a “sweet spot” between shape and texture.333Another question would be what is the proper relationship between shape and texture? Should they act like two separate cues in a parallel way, or in a hierarchical way, where shape first provides a quick, coarse recognition, and then details are observed through texture? We leave this for further exploration.
In this work, we aim at universally improving various kinds of robustness of CNN by alleviating its texture-bias. To reduce texture-bias, we get our inspiration from the human visual system and propose Informative Dropout, an effective model-agnostic algorithm. We detect texture and shape by the local self-information in an image, and use a Dropout-like algorithm to decorrelate the model output from the local texture. Through extensive experiments we observe improved shape-bias as well as various kinds of robustness. Furthermore, we find our method can be incorporated with other algorithms (e.g. adversarial training) and achieve higher robustness. Through this work, we shed some light on the relationship between CNN’s shape-bias and robustness, as well as new approaches to trustworthy machine learning algorithms.
Prof. Yadong Mu is partly supported by National Key R&D Program of China (2018AAA0100702) and Beijing Natural Science Foundation (Z190001).
Dr. Zhanxing Zhu is supported by National Natural Science Foundation of China (No.61806009 and 61932001), PKU-Baidu Funding 2019BD005 and Beijing Academy of Artificial Intelligence (BAAI).
Dinghuai Zhang is supported by the Elite Undergraduate Training Program of Applied Math of the School of Mathematical Sciences at Peking University. The authors are thankful to Tianyuan Zhang, Dejia Xu, Yiwen Guo and the anonymous reviewers for the insightful discussions and useful suggestions.
Prof. Yadong Mu is partly supported by National Key R&D Program of China (2018AAA0100702) and Beijing Natural Science Foundation (Z190001). Dr. Zhanxing Zhu is supported by National Natural Science Foundation of China (No.61806009 and 61932001), PKU-Baidu Funding 2019BD005 and Beijing Academy of Artificial Intelligence (BAAI). Dinghuai Zhang is supported by the Elite Undergraduate Training Program of Applied Math of the School of Mathematical Sciences at Peking University. The authors are thankful to Tianyuan Zhang, Dejia Xu, Yiwen Guo and the anonymous reviewers for the insightful discussions and useful suggestions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2229–2238. Cited by: Table 11, §A.1, §1, §1, §2.1, §2.3, §4.2.1, §4.2.1, §4.2.5, Table 3.
A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pp. 1–7. Cited by: §4.2.3.
Several shape-biased methods have recently been proposed to learn robust representations under different domains (Carlucci et al., 2019; Wang et al., 2019a, b). For a full comparison with these state-of-the-art methods, we also test performance of InfoDrop on domain generalization with AlexNet (Krizhevsky et al., 2012) as backbone. We follow the setting in Wang et al. (2019a). Results are shown in Table 11. On all four domains, InfoDrop with vanilla AlexNet as baseline is already better than or comparable to other state-of-the-art methods. Note that among these methods, JiGen, HEX and PAR are methods which explicitly train a shape-biased model.
Complete results of robustness against image corruption are shown in Table 12. As baseline, we use both vanilla CNN and adversarially trained CNN. Then we apply InfoDrop and report the improved results. As shown in the table, on both baselines InfoDrop can improve robustness against most corruptions non-trivially.
Additional visualization results of saliency map of InfoDrop are plotted in Fig. 6. For comparison, saliency map of vanilla CNN is also displayed. Obviously, saliency of InfoDrop is more biased towards global structure, thus more human-aligned and interpretable.
We further visualize self-information on the dataset of Stylized-Imagenet (Geirhos et al., 2019). As a comparison, we also show the results of edge detecting. As shown in Fig. 7 (Top), the original image is stylized with different art work. As a result (Middle), edge detecting is largely influenced by texture information in different style, sometimes even ruining the image content severely. However, distribution of self-information in each stylized image keeps mostly the same, accentuating global structure and meanwhile repressing local texture.
Of all hyper-parameters, we find and the most important for model performance. For all the tasks, we search in and in . We fix , through the whole experiment. For ResNet18 (He et al., 2016), we apply InfoDrop in both the first convolutional layer and the first residual block, or just in the first layer under some settings. All hyper-parameters are selected according to results on validation set. We use PyTorch (Paszke et al., 2019) for implementation and train all the models on single NVIDIA Tesla P100 GPU.
We use PACS (Li et al., 2017) as our dataset for domain generalization. PACS consists of four domains (photo, art painting, cartoon and sketch), each containing 7 categories (dog, elephant, giraffe, guitar, horse, house and person). The dataset is created by intersecting classes in Caltech-256 (Griffin et al., 2007), Sketchy (Sangkloy et al., 2016), TU-Berlin (Eitz et al., 2012) and Google Images. Dataset can be downloaded from http://sketchx.eecs.qmul.ac.uk/. Following protocol in Li et al. (2017), we split the images from training domains to 9 (train) : 1 (val) and test on the whole target domain. We use a simple data augmentation protocol by randomly cropping the images to 80-100% of original sizes and randomly apply horizontal flipping.
We use ResNet18 (He et al., 2016)
as our backbone. Models are trained with SGD solver, 100 epochs, batch size 128. Learning rate is set to 0.001 and shrinked down to 0.0001 after 80 epochs. Bandwidthand radius are fixed at and , respectively. For photo as source domain, we set and . For art or cartoon as source domain, we set and . For sketch as source domain, we set and .
We mainly use mini-Imagenet (Ravi and Larochelle, 2017) and CUB (Wah et al., 2011) as dataset for few-shot classification. Downloadable links of both dataset can be found in this repository https://github.com/wyharveychen/CloserLookFewShot.
mini-Imagenet contains a subset of 100 classes from the whole ImageNet dataset (Deng et al., 2009) and contains 600 images for each class.Following settings in previous works (Ravi and Larochelle, 2017), we randomly divide the whole 100 classes into 64 training classes,16 validation classes and 20 novel classes.
CUB (abbreviation for CUB-200-2011) dataset contains 200 classes with 11788 images. We divide it into 100 base classes, 50 validation classes and 50 novel classes following Hilliard et al. (2018).
We also test our models on the cross-domain scenario, namely mini-ImagenetCUB, where mini-ImageNet is used as our base class and the 50 validation and 50 novel classes come from CUB.
Following Chen et al. (2019), we apply data augmentation including random crop, horizontal flip and color jitter.
We use 4-layer convolutional neural network (Conv-4) as our backbone, following (Snell et al., 2017). All methods are trained from scratch and use the Adam optimizer with initial learning rate . In meta-training stage, we train 60000 episodes for 5-way 5-shot classification without data augmentation, and 80000 episodes for 5-way 1-shot classification without data augmentation. When data augmentation is applied, we add an extra 20000 episodes in meta-training stage. In each episode, we sample 5 classes to form 5-way classification. For each class, we pick k labeled instances as our support set and 16 instances for the query set for a k-shot task. Drop coefficient , temperature , bandwidth and radius are fixed at , , and , respectively. InfoDrop is applied in first two convolutional layers for Conv-4 network, which we use as the backbone through all experiments.
In the fine-tuning or meta-testing stage for all methods, we average the results over 600 experiments. In each experiment, we randomly sample 5 classes from novel classes, and in each class, we also pick k instances for the support set and 16 for the query set. For other settings, we follow the protocol in Chen et al. (2019).
For clean images, we use Caltech-256 (Griffin et al., 2007) as dataset. It consists of 257 object categories containing a total of 30,607 images with high resolution. Dataset can be downloaded from http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html. We manually split 20% of images as the test set. Rescaling and random cropping are used as data augmentation following the protocol in He et al. (2016).
For generation of corrupted images, we use the library provided in Hendrycks and Dietterich (2019). Original code for corruption generation can be found in https://github.com/hendrycks/robustness/tree/master/ImageNet-C/imagenet_c. The repository contains 18 types of corruptions: ‘gaussian noise’, ‘shot noise’, ‘impulse noise’, ‘defocus blur’, ‘motion blur’, ‘zoom blur’, ‘snow’, ‘frost’, ‘fog’, ‘brightness’, ‘contrast’, ‘elastic transform’, ‘pixelate’, ‘jpeg compression’, ‘speckle noise’, ‘gaussian blur’, ‘spatter’, ‘saturate’. The repository provides 5 different levels of corruption severity. In our experiments, we use the highest level, i.e., level-5 severity.
We train all models for 10 epochs. We use SGD with learning rate 0.01 for 5 epochs, 0.001 for 3 epochs and 0.0001 for 2 epochs. Through all experiments, we only apply InfoDrop to the first convolutional layer before all residual blocks of ResNet18. Bandwidth and radius are fixed at and , respectively. For InfoDrop applied on baseline model, we set and . For InfoDrop applied together with adversarial training, we set and . For adversarial training, we use 20 runs of PGD attack (Madry et al., 2018) with norm of . Here we use a relatively small norm to simulate the situation where severity of corruption may exceed the norm of adversarial training. Note that we mainly evaluate InfoDrop’s incremental effect on baseline and adversarial methods, while not directly comparing InfoDrop with adversarial training.
For evaluation of adversarial robustness, we use two datasets separately, viz. Caltech-256 and CIFAR10. For Caltech-256, as in B.3.1, we manually split 20% of images as the test set and use rescaling and random cropping for data augmentation. For CIFAR10, we adopt the protocol in Zhang et al. (2019).
For experiments on Caltech-256, we train all models for 10 epochs. We use SGD with learning rate 0.01 for 5 epochs, 0.001 for 3 epochs and 0.0001 for 2 epochs. We apply InfoDrop to the first convolutional layer and first residual block of ResNet18. Bandwidth and radius are fixed at and , respectively. For InfoDrop applied on baseline model, we set and . For InfoDrop applied together with adversarial training, we set and . For adversarial training, we use 20 runs of PGD attack (Madry et al., 2018).
For experiments on CIFAR10, we follow the protocol in Zhang et al. (2019). We train models for 105 epochs as a common practice. The learning rate is set to initially, and is reduced by 10 times at epoch 79, 90 and 100, respectively. We use a batch size of 256, a weight decay of and a momentum of 0.9 for both algorithm. For adversarial attacks, we use 20 runs of PGD with norm of and step size of . We apply InfoDrop only on the first convolutional layer of ResNet18. We set , , , .
In the plotting of CNN’s saliency map and experiments of patch shuffling, we use photo-domain in PACS as our dataset and adopt the same settings as in domain generalization. In style transfer, we use pretrained ResNet18 and finetune on content and style images from the repository https://github.com/xunhuang1995/AdaIN-style.