Robustness to occlusions is an important property of image recognition systems. That is, a robust image classifier should be able to solve the problem even if only a portion of the object of interest is visible in an image. However, the image classification datasets commonly used to train high-performance models such as deep neural networks are strongly affected by the so called “photographer bias”. Among other things, this bias means that the main subject of these pictures tends to be centred and clearly visible. As a consequence, learning a model on such data may results in “lazy” networks that focus too much on easily recognizably details (such as the face of a cat) and cannot understanding other, more subtle cues (such as the cat’s body) that may be important in harder scenarios.
A few authors have proposed to address this issue by augmenting the training data via simulated occlusions. While details change depending on the specific method, the general idea is that, if part of the image is not visible at training time, then the network should be stimulated to learn to recognize all available evidence, thus avoiding to over-rely on the most obvious evidence. However, the success of these techniques has been mixed. [21, 16] showed improvements on the ability of the network to localize objects but not on the original task of object recognition.  demonstrated better performance in classification performance in simpler datasets such as CIFAR10  but not in larger, more complex ones such as ImageNet  (as confirmed in our experiments and in ).
A hypothesis for this behaviour is that training using occlusion augmentation improves the robustness of the model to occlusions, but that this does not correspond to a test-time performance improvement because the test set does not, in fact, contain occlusions.
In this paper, we show that this is not the case. The issue can be solved, and a performance improvement observed consistently, provided that the augmentation is incorporated properly in the training procedure. We make three main contributions: (1) We demonstrate that augmenting image batches with several versions of the same image, in the spirit of batch augmentation , allows occlusion augmentation to consistently outperform the baselines on for CNN architectures that are sufficiently powerful (e.g., ResNet50). (2) We present a detailed analysis of why occlusion augmentation has not yielded improvements in the past. (3) We conduct a thorough investigation on how to optimally tune occlusion augmentation, showing differences as a function of the model architecture. For example, we demonstrate that more powerful models (e.g., ResNet50 ) can handle, and benefit from, significantly more substantial occlusions during training than weaker ones (e.g., AlexNet ).
2 Related work
Occlusions have been successfully used for model interpretability and weak localization. A few attribution methods have used fixed , stochastic , and optimized  occlusions to diagnose “where” a network is “looking” in the input for evidence for its prediction. A few works have demonstrated that applying random  or optimized  occlusions to the input or intermediate activations  can improve weakly supervised localization (but not necessarily image classification) by forcing a classification network to be robust to occlusions and thus rely other parts of an object besides its most discriminative parts.
Cutout  and Hide-and-Seek  both introduce stochastic input-level occlusions: Cutout “drops” (i.e., zeros out) randomly positioned squares (Figure 2), while Hide-and-Seek divides an image into a square grid and “drops” grid patches independently (Figure 1). Hide-and-Seek  highlights its improvements of weak localization at the expense of classification performance on ImageNet . Although Cutout  improves performance on CIFAR10 and CIFAR100 ,  reported that it did not improve classification performance on ImageNet.
Other regularization methods related to occlusions are techniques inspired by Dropout 
, which “drop” parts of intermediate activation tensors, such as DropPath, Scheduled DropPath , Spatial Dropout  and DropBlock . Whereas Dropout  drops a single voxel from a 3D activation tensor of a given input, DropPath  drops a whole branch of a network while Spatial Dropout  drops a whole slice in a 3D activation tensor associated to a filter. DropBlock  can be viewed as an extension of Cutout  applied to intermediate activations. In this method, contiguous blocks in each activation slice associated to a filter are dropped. These techniques, particularly DropBlock , yield modest but consistent improvements in ImageNet  classification performance; however, they all require architectural change and, in the case of Scheduled DropPath  and DropBlock , requires using a training schedule specific for its modules. Label smoothing  is another related regularization technique, in which noise is added to the training labels.
Recently, batch augmentation  was introduced as a way to augment existing data augmentation techniques by including multiple copies of the same image (i.e. copying an original batch times) and applying data augmentation to each of the copies. When coupled with Cutout , batch augmentation significantly improved performance on small datasets like CIFAR10 and CIFAR100 .
Similar to , which uses CAM (class activation maps) , we explore using the heatmaps produced by attribution methods to occlude images during training. We focus on the gradient-based saliency method introduced in , which is closely related to Grad-CAM  and the linear approximation  at a specific layer.  shows that their method, when used to aggressively occlude images during training outperforms other baselines, including Grad-CAM . While  focused on dataset compression and willingly sacrificed on task performance, we are interested in using occlusions to improve task performance.
We introduce a simple paradigm for using occlusions effectively as data augmentation. For every image , we generate a pattern of occlusion using one of the methods described below. Then, for a given batch of images , we copy the batch. We apply the set of occlusions, , to one copy of the batch, leaving the other batch unoccluded, and train jointly with one combined batch: . We occlude a pixel by replacing it with the mean average colour (i.e. setting it to zero after mean normalization). Our joint training is inspired by batch augmentation .
We first consider two existing ways of generating occlusions stochastically: Hide-and-Seek  and Cutout . Hide-and-Seek (H&S) divides an image into a grid and drops patches in the grid independently with probability , where denotes the probability of preserving the original patch (Figure 1). Cutout (CO) drops square patches111The original Cutout paper  only considers patch. of side length ; the center of these patches are placed uniformly at random on the whole image, thereby allowing for some patches to “overflow” off the image, as done in  (Figure 2).
To analyse whether jointly training with an image occluded and unoccluded in the same batch is necessary, we introduce another hyperparameter,. When using Hide-and-Seek occlusions without joint training (i.e., every image is in the batch exactly once), we show the full image with probability . Otherwise, we show an image occluded with Hide-and-Seek-style dropout, in which a patch is preserved with probability . When , every image is potentially occluded; when , all images are unoccluded (i.e., standard training).
When comparing these two types of stochastic methods, Hide-and-Seek allows us to more easily and precisely define the amount of occlusion being applied on average. This is because Cutout occlusions are allowed to flow over image boundaries and can overlap with one another in the case of patches being cut out. Pairing Hide-and-Seek with standard data augmentation (i.e., random cropping and resizing) simulates dynamic occlusions while its disjoint grid makes it easy to reason about the occlusions being applied. Nevertheless, Cutout is more comparable to the next type of occlusions we consider: saliency-based occlusions.
We also consider generating occlusions based on saliency. Given a saliency heatmap, we extract an occlusion that is most salient compared to other potential patches. In this way, we use saliency heatmaps to guide occlusion locations as opposed to randomly sampling their locations. This allows us to fairly compare against Cutout  as we consider occlusions of the same size.
In our experiments, we use ’s gradient-based saliency method, which we summarize here (Figure 3; see  for more details). For a given layer , a saliency heatmap can be generated by computing the Frobenius norm of the product of layer
’s activation and gradient vectors,, at every spatial location :
Intuitively, ’s saliency method precisely characterizes the contribution of every spatial location to the gradient of a hypothetical, subsequent convolution weights tensor initialized with identity. We chose to use  because it generates high-quality, dense saliency maps at any network depth. In contrast, Grad-CAM  only works at the last conv layer.
For every image, we compute a saliency map with respect to the ground truth label and upsample the saliency map to the original image resolution .
We then find the square patch with side length of the upsampled saliency map222In practice, we do this by convolving the saliency map with a convolutional filter with stride
convolutional filter with strideand filled with 1s.. Finally, we add a small amount of jitter to the extracted patches. Unlike Cutout, we do not allow our patch to overflow the image boundaries (i.e., it will always be fully contained in the image).
4.1 Implementation details
All models were trained for 100 epochs with the learning rate decayed byevery 30 epochs (i.e., at 30, 60, and 90 epochs). The initial learning rate for ResNet50  and VGG16-BN was ; for AlexNet  and VGG16  it was 333
This was chosen for the non-batch normalization models based on grid search over the following learning rates:, , , , .. All models used an original batch size of 256; jointly trained models used an actual batch size of 512, in which the original batch is duplicated and one copy is occluded. The actual batch was split across 8 GPUs. The original batch was preprocessed using standard data augmentation444
We used default PyTorch ImageNet preprocessing:https://github.com/pytorch/examples/tree/master/imagenet: random cropping to , horizontal flipping, data normalization to .
When jointly training, the standard data augmentation (i.e., random cropping, etc.) occurs before the batch is duplicated, so the images are identical except for the regions that are occluded, i.e., . This differs from batch augmentation, in which images are preprocessed independently rather than identically.
For non-joint training baselines, we trained networks in the usual fashion without occlusions. We introduced another set of baselines to account for the possible effect of doubling training time via joint training. The joint training baseline refers to networks that have been trained without occlusions but with duplicated batches, that is, every image appears exactly twice in the batch.
4.2 Stochastic occlusions
We first trained networks using Hide-and-Seek  occlusions. For these experiments, we divided images into grids () and preserved patches with . With a cropped image size and grid size of , the size of the Hide-and-Seek patches were ().
Based on our Hide-and-Seek results, we then train select networks (ResNet50 and AlexNet) using Cutout-style occlusions. Here, we occluded an image with either square patches with side lengths .
For both kinds of occlusions, we trained networks jointly, that is, every batch was doubled and one copy was preserved as is (i.e., with full images) while occlusions were applied to the other copy. At evaluation time, no occlusions are applied.
|baseline (w/o joint)||76.40||93.10||74.11||91.81||71.75||90.45||56.39||79.19|
|baseline (w/ joint)||76.40||93.03||74.95||92.31||72.37||90.91||57.34||79.72|
Table 1 reports ImageNet top-1 and top-5 accuracy for various networks when trained jointly with Hide-and-Seek occlusions, while Table 2 and Table 3 reports results for ResNet50 and AlexNet respectively when trained jointly with Cutout occlusions.
Table 1 shows that ResNet50 improves significantly ( in top-1 and in top-5 for the optimal ) when jointly trained with H&S occlusions. Furthermore, ResNet50 consistently beats the baseline (10 of 10 results improve) regardless of the hyperparameter. However, for all other networks, the best improvements are negligible: in top-1 and in top-5 for VGG16-BN, VGG16, and AlexNet respectively. Consistent with the results reported in , the difference between the joint and non-joint baselines in Table 1 appears roughly correlated with network performance, with ResNet50 having no difference and while the others demonstrate significant improvement with joint training: top-1 baselines improve by for VGG16-BN, VGG16, and AlexNet respectively.
Thus, we focus our attention on ResNet50 for Cutout experiments. Table 2 shows that Cutout with joint training on ResNet50 nearly always improves on the baseline (23 of 25 results improve), regardless of the size and number of patches occluded ( and ). The best result improves for top-1 and for top-5 over baselines, with the top-1 improvement being substantially higher with the best Cutout hyper parameters () than that with the best Hide-and-Seek ones ().
In contrast, Table 3 shows that Cutout with joint training on AlexNet rarely improves on the joint baseline (only 1 of 25 results improves; we include this table for comparison with saliency-based occlusions in Section 4.4).
Taken together, these results suggest that, for complex datasets like ImageNet, a suitably powerful architecture like ResNet50 is likely necessary to benefit from occlusion augmentation.
Occlusions as a stethoscope for model capacity.
The results for both kinds of stochastic occlusions (Table 1 and Table 2) peak in performance with the best hyper-parameters and then roughly monotonically decrease from that point. Thus, training with occlusions is beneficial from a model understanding perspective, as it provides a way to identify and quantify an architecture’s upper bound for handling occlusions at evaluation time. For Hide-and-Seek (Table 1), we see that the optimal for ResNet50 and VGG16-BN, for VGG16, and for AlexNet. This suggests that AlexNet can only handle a small amount of occlusion (images occluded up to 10% on average), while VGG16-BN and ResNet50 are capable of handling images that have been occluded up to 50% on average, when trained properly with occlusions (ResNet50 and VGG16-BN may be able to handle more than 50%, but this was not tested).
Figure 4 compares a ResNet50 non-joint baseline against a ResNet trained jointly with Hide-and-Seek best by using ’s saliency method on layer3.0.conv1. Here, we visualize saliency maps for a few examples in which the occlusion-augmented network was correct and the baseline was wrong. Qualitatively, we observe difference in the models’ predictions in their visualizations: In the suit image, the augmented network focuses on the tie while the baseline is attracted to the man’s gaze and elbow. Same with the ball player, we see the baseline’s mistake in focusing on the bottom edge of the image. In line with previous work [16, 21, 20], we also observe that visualizations of the augmented network tend to cover the object surface more than those of the baseline model.
4.3 Joint vs. non-joint training
We then thoroughly tested the necessity of joint training to make occlusion augmentation effective. We trained networks with Hide-and-Seek occlusions without joint training by introducing another hyper-parameter that determines whether an image is left completely unoccluded (see Section 3 for more details). We train these networks with and . We then compare those networks with our baselines and our jointly trained networks from Table 1. If joint training is not strictly necessary, we would expect our non-jointly trained networks to beat the baselines.
Figure 5 shows that this is not the case. Overwhelming, the non-jointly trained networks (green lines) perform worse than our baselines (dotted lines). While we might expect that when , that is, when images are always occluded and thus the training domain might be too different from the test domain, it is surprising that even when showing full images half of the time (), we do not see an improvement. This suggests that that seeing an image occluded and unoccluded in the same batch is necessary for occlusion augmentation to work well. Our finding are consistent with ’s observation that Cutout did not improve ImageNet classification performance.
We also briefly explored finetuning models on full images after they have been trained on exclusively occluded images but did not see an improvement over baselines.
Averaged over 2 runs except where * (denotes 1 run); standard deviation meanwith range .
|Best from Tbl 1||75.02||92.38||72.41||90.86||57.36||79.65|
|Best from Tbl 1||77.02||93.45|
4.4 Saliency-based occlusions
For AlexNet, VGG16, and VGG16-BN, we train networks with occlusions based on 
’s saliency maps at the following layers (post-ReLU but pre-pooling): conv3, conv4, and conv5555For VGG16(-BN), convX refers to the last convolutional layer in the X-th block.
. For ResNet50, we train networks on saliency maps on the max pool before the first block and on the very first convolutional layers in the first, second, and third blocks before batch normalization. Given a saliency heatmap, we extract aCutout-like patch that covers the most salient part of the image. We then jitter the patch uniformly by pixels.
Table 4 and Table 5 show that the best results from training jointly with saliency-based occlusions for all networks except ResNet50 are consistently better (albeit by a small margin) than the best results from training jointly with stochastic Hide-and-Seek occlusions. Most notably, a much smaller amount of saliency-based occlusion is needed to yield the comparable improvements to Hide-and-Seek occlusions (i.e., for VGG16-BN, occluding of an image using saliency is comparable to occluding on average using Hide-and-Seek). This is likely due to the fact that the saliency-based occlusions should be covering the most “important” parts of an image. Our saliency-based occlusion of side length is roughly comparable to Hide-and-Seek with a grid () and , that is, on average only one patch is occluded. It is also is directly comparable with Cutout with the same hyper-parameters (; see Table 2 and Table 3 for Cutout on ResNet50 and AlexNet respectively).
The slim differences between results from different layers suggests that occlusions based on ’s saliency method are reasonably robust to layer choice. Saliency-based occlusion also yields a lower mean standard deviation of compared to for Hide-and-Seek occlusions, due to the significantly less stochastic nature of saliency-based occlusion augmentation.
One limitation of our current approach is that we can extract one maximal patch, thereby limiting to a certain degree the size of our occlusions, which would need to be larger in order to match the effects of the best parameterizations of the stochastic methods. This limitation is likely the reason that results from saliency-based occlusions on ResNet50 do not beat the best stochastic occlusion results, since a larger amount of occlusion is needed for Hide-and-Seek () and Cutout ( for ).
|top-1 (%)||top-5 (%)|
|baseline from Table 1||76.40||76.40||93.10||93.03|
|Dropout  ()||76.34||76.41||93.02||93.10|
|Spatial Dropout  ()||75.95||76.31||92.77||93.04|
|DropBlock  ( & )||75.88||76.33||92.77||92.98|
|Label smoothing  (0.1)||76.64||76.26||93.25||93.11|
|Best H&S () from Table 1 (ours)||–||77.02||–||93.45|
|Best CO () from Table 2 (ours)||–||77.25||–||93.48|
4.5 Comparison with other regularization methods
We compare our method with variants of Dropout  and primarily follow ’s protocol (see Section 2 for more details). For Dropout , Spatial Dropout , and DropBlock , we follow ’s procedure and add dropout modules after every convolutional layer in the third and fourth block of ResNet50. For DropBlock, we also add its module to the skip connections in those blocks. For Dropout and Spatial Dropout, we train ResNet50 networks without joint training using , while for DropBlock, we use ( is analogous to ). We also compare against label smoothing  with fixed .
We deviate from  in that we train for 100 epochs using a 30–60–90 epoch lr decay schedule (vs. their 300 epochs using a 100–200–265 schedule) to compare fairly with our method. We do not use a schedule to ease in the amount of dropout for DropBlock, as  reported that DropBlock without scheduling still yielded significant boosts over their ResNet50 baseline. We expected that these two changes would decrease the improvements observed in  but that those improvements would still persist.
Table 6 shows that all the variants of Dropout methods under-performed our ResNet50 non-joint training baseline, suggesting that they are sensitive to and require the custom longer training schedule used in  in order to be effective (see  for results with the longer training schedule). Label smoothing also under-performed our occlusion augmentation training.
|JT bsl (Tbl 1)||JT CO (Tbl 2)||BA bs||BA CO ()||BA CO ()|
|Batch Augment||Dataset Augment||Joint Training|
4.6 Comparison with Batch Augmentation 
Lastly, we compare our joint training paradigm with batch augmentation . The key difference between batch augmentation and joint training is that, for joint training, all standard pre-processing occurs before image duplication; in contrast, for batch augmentation, pre-processing occurs after duplication. Thus, transformation from pre-processing are identical in joint training but independent (i.e., different) in batch augmentation. In all our previous results, we used joint training ( copies).
Batch augment joint training.
Table 7 shows results when we use batch augmentation to include stochastic Cutout occlusions during training, with fixed CO hyper-parameters . The results for in Table 7 improve upon and are comparable to our joint training Cutout results for in Table 2: (top-1) and (top-5) for batch augmented Cutout (666 denotes the probability that an image copy is left unoccluded.) vs. (top-1) and (top-5) for joint training. However, batch augmentation also significantly improves its respective baseline; thus, relative improvement of batch-augmented Cutout are smaller when compared to that of jointly trained Cutout: For , is quite slim for top-1 (and non-existent for top-5) when using copies.
No full images needed.
Most notably, Cutout with achieves similar performance to that with . This suggests that one can train a network with images that are always occluded (i.e., without ever seen a full, natural image) and achieve superior inference-time performance on full images than standard training methods.
Table 8 shows results when training baseline ResNet50 models with batch augmentation, dataset augmentation, and joint training. Dataset augmentation iterates through the training set times (i.e., copies are in distinct mini-batches), while batch augmentation copies an image in the same mini-batch. These results verify  by showing the necessity of having image in the same mini-batch.
We show an effective paradigm for using occlusion augmentation to improve ImageNet classification performance. The primary insight from our work is using some variant of batch augmentation  is necessary to gain this improvement. This suggests that further research on what is being learned during joint training and more broadly batch augmentation  is warranted. We also demonstrate training-time occlusions can be a way to understand model’s upper bound for robustness to occlusions generally. There is likely room to improve our work here, particularly in exploring further the potential of batch augmentation , in developing better saliency-based approaches to occlusion augmentation, and in elucidating further the interaction between and impact of dataset and model complexity for effective occlusion augmentation. Further research could also be done on other kinds of occlusions, such as blur or random noise or even ignoring regions . In conclusion, in contrast to other regularization techniques that require architectural changes, we present a simple paradigm for making occlusions effective on ImageNet for sufficiently capable models (e.g., ResNet50) that can be easily added into existing training paradigms. ††Acknowledgements. We are grateful for support from Open Philanthropy Project (R.F.). We also thank Sylvestre-Alvise Rebuffi for helpful discussions and sharing his code as well as Chris Olah and the OpenAI Clarity team for organizing insightful discussions on interpretability research.
Improved regularization of convolutional neural networks with cutout. arXiv. Cited by: §1, §2, §2, §2, §2, Figure 2, §3, §3, footnote 1.
-  (2017) Interpretable explanations of black boxes by meaningful perturbation. In Proc. CVPR, Cited by: §2.
-  (2018) DropBlock: a regularization method for convolutional networks. In Proc. NeurIPS, Cited by: §1, §2, §2, §4.3, §4.5, §4.5, §4.5, Table 6.
-  (2016) Deep residual learning for image recognition. In Proc. CVPR, Cited by: §1, §4.1.
-  (2019) Augment your batch: better training with larger batches. arXiv. Cited by: §1, §2, §2, §3, §4.2, §4.6, §4.6, §4.6, §5.
-  (2014) The cifar-10 dataset. online: http://www.cs.toronto.edu/kriz/cifar.html. Cited by: §1, §2, §2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Proc. NeurIPS, Cited by: §1, §4.1.
-  (2017) Fractalnet: ultra-deep neural networks without residuals. In Proc. ICLR, Cited by: §2.
-  (2018) Image inpainting for irregular holes using partial convolutions. In Proc. ECCV, Cited by: §5.
-  (2018) The building blocks of interpretability. Distill. Cited by: §2.
-  (2018) RISE: randomized input sampling for explanation of black-box models. In Proc. BMVC, Cited by: §2.
-  Pay attention to this: finding the pixels that matter for training. Cited by: §2, §2, Figure 3, §3, Figure 4, §4.2, §4.4, §4.4, Table 4.
-  (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §1, §2, §2.
-  (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. ICCV, Cited by: §2, §3.
-  (2015) Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, Cited by: §4.1.
-  (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In Proc. ICCV, Cited by: §1, §2, §2, §2, Figure 1, §3, §4.2, §4.2, Table 1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §2, §4.5, Table 6.
Rethinking the inception architecture for computer vision. Cited by: §2, §4.5, Table 6.
-  (2015) Efficient object localization using convolutional networks. In Proc. CVPR, Cited by: §2, §4.5, Table 6.
-  (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In Proc. CVPR, Cited by: §2, §4.2.
Object region mining with adversarial erasing: a simple classification to semantic segmentation approach.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1568–1576. Cited by: §1, §2, §2, §4.2.
-  (2014) Visualizing and understanding convolutional networks. In Proc. ECCV, Cited by: §2.
Learning deep features for discriminative localization. In Proc. CVPR, Cited by: §2.
-  (2018) Learning transferable architectures for scalable image recognition. In Proc. CVPR, Cited by: §2.