We propose a novel data augmentation method `GridMask' in this paper. It utilizes information removal to achieve state-of-the-art results in a variety of computer vision tasks. We analyze the requirement of information dropping. Then we show limitation of existing information dropping algorithms and propose our structured method, which is simple and yet very effective. It is based on the deletion of regions of the input image. Our extensive experiments show that our method outperforms the latest AutoAugment, which is way more computationally expensive due to the use of reinforcement learning to find the best policies. On the ImageNet dataset for recognition, COCO2017 object detection, and on Cityscapes dataset for semantic segmentation, our method all notably improves performance over baselines. The extensive experiments manifest the effectiveness and generality of the new method.READ FULL TEXT VIEW PDF
In this paper, we propose a new data augmentation strategy named Thumbna...
Mirrors are everywhere in our daily lives. Existing computer vision syst...
Region modification-based data augmentation techniques have shown to imp...
We propose a novel data augmentation method named 'FenceMask' that exhib...
Data augmentation is a key component of CNN based image recognition task...
Data augmentation has been widely adopted for object detection in 3D poi...
Recent advances in computer vision take advantage of adversarial data
Deep convolutional neural networks (CNNs) have achieved great success in many computer vision tasks in recent years, including image classification[11, 16, 19, 10, 20, 22], object detection [7, 8, 15, 9], and semantic segmentation [13, 2, 27]. A CNN has millions of parameters, making training demand a lot of data. Otherwise, the serious over-fitting problem  could arise. Data augmentation is a very important technique to generate more useful data from existing ones for training practical and general CNNs.
Existing data augmentation methods can be roughly divided into three categories: spatial transformation , color distortion , and information dropping [4, 29, 17]. Spatial transformation involves a set of basic data augmentation solutions, such as random scale, crop, flip and random rotation, which are widely used in model training. Color distortion, which contains changing brightness, hue, etc. is also used in several models . These two methods aim at transforming the training data to better simulate real-world data, through changing some channels of information.
Information deletion is widely employed recently for its effectiveness and/or efficiency. It includes random erasing , cutout , and hide-and-seek (HaS) . It is common knowledge that by deleting a level of information in the image, CNNs can learn originally less sensitive or important information and increase the perception field, resulting in a notable increase of robustness of the model.
Avoiding excessive deletion and reservation of continuous regions is the core requirement for information dropping methods. We found intriguingly a successful information dropping method should achieve reasonable balance between deletion and reserving of regional information on the images. The reason is twofold intuitively.
On the one hand, excessively deleting one or a few regions may lead to complete object removal and context information be removed as well. Thus remaining information is not enough to be classified and the image is more like noisy data. On the other hand, excessive preserving regions could make some objects untouched. They are trivial images that may lead to a reduction of the network’s robustness. Thus designing a simple method that reduces the chance of causing these two problems becomes essential.
Existing information dropping algorithms have different chances of achieving a reasonable balance between deletion and reservation of continuous regions. Both cutout  and random erasing  delete only one continuous region of the image. The resulting imbalance of these two conditions is obvious because the deleted region is one area. It has a good chance to cover the whole object or none of it depending on size and location. The approach of HaS  is to divide the picture evenly into small squares and delete them randomly. It is more effective and still stands a considerable chance for continuously deleting or reserving regions. Some unsuccessful examples of existing methods are shown in Fig. 1. Statical and more specific quantitative analysis is provided in Sec. 3.
Contrary to previous methods, we surprisingly observe the very easy strategy that can balance these two conditions statistically better is by using structured dropping regions, such as deleting uniformly distributed square regions. Our proposed information removal method, named GridMask, is thus to expand structured dropping. Its structure is really simple as shown in Fig.2, making it easy, fast, and flexible to implement and incorporated in all existing CNN models.
Our GridMask neither removes a continuous big region like Cutout, nor randomly selects squares like hide-and-seek. The deleted region is only a set of spatially uniformly distributed squares. In this structure, via controlling the density and size of the deleted regions, we have statistically higher chance to achieve a good balance between the two conditions, as shown in Fig.5. As a result, we improve many state-of-the-art CNN baseline models by a good margin using our very simple GridMask at an extremely low computation budget.
To demonstrate the effectiveness of GridMask, extensive experiments are designed and conducted as shown in Table. 1. In the image classification task using dataset ImageNet, GridMask can improve the accuracy of ResNet50 from 76.5% to 77.9%, much more effective than Cutout and HaS, which accomplish 77.1% and 77.2%. Our result is also better than that of AutoAugment (77.6%), which is a combination of several existing policies through reinforcement learning. Note that our method is just one simple policy, which can also be incorporated into AutoAugment.
Further, on the COCO2017 dataset for the object detection task, GridMask increases mAP of Faster-RCNN-50-FPN from 37.4% to 39.2%. On the semantic segmentation task using the challenging dataset Cityscapes, our method improves mIoU of PSPNet50 from 77.3% to 78.1%. They all demonstrate the surprisingly high effectiveness and generality on a large variety of tasks and training data. Our code is available at https://github.com/akuxcw/GridMask.
Regularization is an important skill for training neural networks. In recent years, various regularization techniques have been proposed. Dropout  is effective and is mainly used in fully connected layers. Dropconnect  is very similar to dropout, except that it does not drop the output value, but instead the input value of some nodes. In addition, adaptive dropout , stochastic pooling , droppath , shakeshake regulation  and dropblock  were also proposed. These methods add noise to a few parameters in the training process according to different rules, so as to avoid over-fitting training data and improve models’ generalization ability. Besides, Mixup  and CutMix 
use multi-image information during the training process. By modifying the input images, labels, and loss functions, these methods can fuse information of multiple images and achieve good results.
Data augmentation is also an effective regularization. Compared with other methods, data augmentation has many advantages. For example, it only needs to operate on the input data, instead of changing the network structure. And data augmentation is easy to apply to many tasks, while other loss- or label-based methods may need extra design. The basic policy of data augmentation contains random flip, random crop, etc., which are commonly used on CNNs. Except for the basic strategies, the inception-preprocess  is more advanced with random disturbance of color of the input image. Recently, AutoAugment  improved the inception-preprocess using reinforcement learning to search existing policies for the optimal combination. Besides, some recently proposed methods based on information dropping have also achieved good results of random erasing , hide-and-seek , cutout , etc. These methods delete information on input images through certain policies. They usually work well on small datasets, while the effect on large datasets is limited.
Our method also belongs to information dropping augmentation. Compared with previous methods, ours can achieve consistently better results on various datasets, outperforming all previous unsupervised strategies, including the optimal combination proposed by AutoAugment. Our method can serve as a new baseline policy for data augmentation.
GridMask is a simple, general, and efficient strategy. Given an input image, our algorithm randomly removes some pixels of it. Unlike other methods, the region that our algorithm removes is neither a continuous region  nor random pixels in dropout. Instead, our algorithm removes a region with disconnected pixel sets, as shown in Fig. 3.
We express our setting as
where represents the input image, is the binary mask that stores pixels to be removed, and is the result produced by our algorithm. For the binary mask , if we keep pixel in the input image; otherwise we remove it. Our algorithm is applied after the image normalization operation.
The shape of looks like a grid, as shown in Fig. 3. We use four numbers to represent a unique . Every mask is formed by tiling the units as shown in Fig. 4. is the ratio of the shorter gray edge in a unit. is the length of one unit. and are the distances between the first intact unit and boundary of the image.
Next, we talk about the choices of these four parameters.
determines the keep ratio of an input image. We define the keep ratio of a given mask as
which means the proportion of the region between reserved and input images. The keep ratio is a very important parameter to control the algorithm. With a large keep ratio, CNN may still suffer from over-fitting. If it is too small, we could lose excessive information causing under-fitting. There is a close relation between and . Ignoring incomplete units in a mask, we get
The keep ratio is fixed following common practice. We perform extensive experimnents to verify the choice of in Section 4.1.3.
The length of one unit does not affect the keep ratio. But it decides the size of one dropped square. When is fixed, the relation between the side length of one dropped square and is
The larger is, the larger becomes. The keep ratio is constant during training. Yet we still add randomness to enlarge the variety of images – is suitable to achieve this goal . We randomly select from a range as
It is easy to conclude that a smaller can avoid most failure cases. But some recent works [4, 6] show that dropping a very small region is useless for convolutional operation, in accordance with our experimental results given later in Section 4.1.3.
and can shift the mask given and , making the mask cover all possible situations. So we randomly choose and as
Here we statistically show the probability of unsuccessful data augmentation being produced. Basically, a good balance between deletion and reservation of information is the key. We preliminary manifest that our method has lower chance to yield failure cases than Cutout and HaS.
We simulate the condition in real datasets and calculate the probability of failure cases for different methods when varying lengths of removal squares. All images are resized to and the object in an image is with size within . The keep ratio is set to a typical value of 0.75  for all methods. We assume all methods randomly choose the length of removal squares within , where is the value of the -axis. Random erasing is very similar to Cutout, so we only test cutout. And we expand Cutout to multi-region Cutout for a better performance, which means randomly dropping squares until reaching the keep ratio. If 99 percent of an object is removed or reserved, we call it a failure case. We simulate 100,000 images and the probability of failure case for every method is summarized in Fig. 5.
Compared with other algorithms, our method always has the best performance. With the increasing length, the superiority of our method becoming increasingly obvious. This observation allows us to choose generally larger square sizes to effectively augment data.
We use two ways to apply GridMask in practice to network training. One is to set a constant probability , where we have a chance of
to apply GridMask to every input image. The other way is to increase the probability of GridMask linearly with the training epochs until an upper boundis achieved. We empirically verify that the second way is better for most experiments.
We conduct extensive experiments on several major computer vision tasks including image classification, semantic segmentation, and object detection. Our augmentation method improves the baseline on all these important tasks by a large margin.
|Handkerchief||Tusker||Cellphone||Pencil case||Cardigan||Fountain pen|
ImageNet-1K is the most challenging dataset for image classification. To demonstrate the strong capability of our proposed augmentation, we conduct experiments on it.
We experiment with a wide range of differently sized models, from ResNet50 to ResNet152. We train them with our augmentation on ImageNet from scratch for 300 epochs. The learning rate is set to 0.1, decayed by 10-fold at epochs 100, 200, 265. We train all our models on 8 GPUs, using batchsize 256 (32 per GPU). For the baseline augmentation, we follow the common practice. We first randomly crop a patch from the original image and then resize the patch to the target size (224 224). Finally, the patch is horizontally flipped with a probability of 0.5.
For our method, we only use GridMask along with the baseline augmentation. We choose , and we linearly increase the probability of GridMask from 0 to 0.8 with the increasing of training epochs until the 240th epoch, and then keep it until 300 epochs. The mask is also rotated before use. It is worth noting that we do not use any augmentation on color, while the strategy still consistently achieve better results, as summarized in Table 2.
In terms of the accuracy, our method improves many different models, from ResNet50 to ResNet152. Our method increases ResNet50, ResNet101, and ResNet152 from 76.5%, 78.0%, and 78.3% to 77.9%, 79.1%, and 79.7%, respectively, with 1.4%, 1.1%, and 1.4% increase. It also proves that the strategy is nicely suitable for models of various scales without careful hand tuning.
|ResNet50 + Cutout ||77.1|
|ResNet50 + HaS ||77.2|
|ResNet50 + AutoAugment ||77.6|
|ResNet50 + GridMask (Our Impl.)||77.9|
|ResNet101 + GridMask (Our Impl.)||79.1|
|ResNet152 + GridMask (Our Impl.)||79.7|
Cutout  also does not distort color and only drops information. Its performance on ImageNet improves ResNet50 by 0.6% (from 76.5% to 77.1%). Our method drops information in a more effective structure, improving ResNet50 by 1.4% on ImageNet.
HaS  is the previous SOTA information dropping method, which is better than Cutout. It uses smaller removal squares (between sizes 16 and 56). When the squares get larger, the result becomes worse in the experiments reported in the original paper. Our setting, contrarily, produces better results even when removal squares are large. It is because we handle the aforementioned failure cases better.
AutoAugment  is a SOTA data augmentation method. It uses reinforcement learning to search using tens of thousands of GPU hours to find a combination of existing augmentation policies. It thus performs reasonably better than previous strategies and improve the accuracy of ResNet50 by 1.1%. Our method, by simply dropping part of the information of the input image in a regular way, even exceeds the effect of AutoAugment. Our method is extremely easy, only uses one data augmentation policy, and still performs more satisfyingly than this type of exhaustive combination of various data augmentation policies. The effectiveness and generality are well demonstrated.
To analyze what the model trained with our GridMask learns, we compute class activation mapping (CAM)  for ResNet50 model trained with our policy on ImageNet. We also show the CAM for models trained with baseline augmentation and AutoAugment for comparison. We intriguingly observe common properties between our method and AutoAugment. Compared to the baseline method, both AutoAugment and ours tend to focus on larger spatially distributed regions. It indicates successful augmentation makes the system put attention to large and salient representations. It can quickly improve the generalization ability of models. This figure also demonstrates that the method with our strategy attends to most structurally comprehensive regions. The third image is an example, where the two cellphones are both important. The baseline method just focuses on the right phone, and AutoAugment pays attention to the left one. Contrarily, our method notices both cellphones and helps recognition with this set of more accurate information.
|+ Randomerasing ||95.32|
|+ Cutout ||96.25|
|+ HaS ||96.10|
|+ AutoAugment ||96.07|
|+ GridMask (Ours)||96.54|
|+ AutoAugment & Cutout ||96.51|
|+ AutoAugment & GridMask (Ours)||96.64|
|+ Radnomerasing *||96.92|
|+ Cutout ||97.04|
|+ HaS ||96.94|
|+ AutoAugment ||97.01|
|+ GridMask (Ours)||97.24|
|+ AutoAugment & Cutout ||97.39|
|+ AutoAugment & GridMask (Ours)||97.48|
|+ Randomerasing ||96.46|
|+ Cutout ||96.96|
|+ Has ||96.89|
|+ Autoaugment ||96.96|
|+ GridMask (Ours)||97.20|
|+ AutoAugment & Cutout ||97.36|
|+ AutoAugment & GridMask (Ours)||97.42|
The CIFAR10 dataset has 50,000 training images and 10,000 testing images. CIFAR10 has 10 classes, each has 5,000 training images and 1,000 testing images.
We summarize the result on CIFAR10 in Table 34, 5]
, except using larger training epochs for ResNet-18 and WideResNet-28-10. We use the same hyperparameters to train all methods. For the baseline augmentation, we first pad the input to 4040 and randomly crop a patch of size 3232. Depending on models, the patch is chosen to be horizontally flipped or not. Other augmentation methods are added after the baseline augmentation. We use , and the same scheduling method as described in Section 4.1.1. One thing to note is that, in , authors train their policies together with Cutout. For the sake of fairness, we add Cutout and AutoAugment separately in our experiments. Some results are our reimplementation with the same training strategy as ours – we achieve similar results reported in the original papers. We train every network for three times and report the mean accuracy.
The table indicates that our GridMask improves many baseline models by a large margin. With GridMask, we improve the accuracy of ResNet18 from 95.28% to 96.54% (+1.26%), which surpasses previous information dropping methods significantly. Also, our result is better than AutoAugment. For other models, our method can improve the accuracy of WideResNet28-10 and ShakeShake-26-32 from 96.13% and 96.43% to 97.24% (+1.11%), and 97.20% (+0.88%), which is still better than other augmentation methods. Combined with AutoAugment, we achieve SOTA result on these models.
In this section, we train models with GridMask under different parameters and show variations of GridMask.
We experiment with setting as 0.4, 0.5, 0.6, 0.7, and 0.8 on ImageNet with ResNet50. The result is summarized in Fig. 7(a). According to the result, we choose the most effective as the choice of for different models on the Imagenet dataset. We also experiment with being 0.2, 0.3, 0.4, 0.5, and 0.6 on CIAFR10 with ResNet18. The result is shown in Fig. 7(b), and we choose as the choice of on CIFAR10. Through experiments, it is important to note that the selected on more complex datasets (such as ImageNet) becomes larger. Put differently, we should keep more information on complex datasets to avoid under-fitting, and delete more on simple datasets to reduce over-fitting. This finding is in obedience to our common sense.
|Range of||Accuracy (%)|
We experiment with setting different ranges of on ImageNet, and the results are summarized in Table 4. When the range of is concentrated in some small intervals, the accuracy is low. Also when is too small, the result is even worse. We set in the optimal range of . These experiments verify our previous analysis that different can bring varying effect to networks, and the diversity of can increase robustness of the network.
The first variation is reversed GridMask, which means we keep what we drop in GridMask, and drop what we keep in GridMask. According to our analysis, the reversed GridMask should yield similar performance on different challenging datasets because a good balance of the two conditions in GridMask should be similarly good for reserved GridMask. We try different for reversed GridMask. The result is listed in Table 5. The reversed GridMask runs better than other augmentation methods.
|+ RevGridMask ()||77.42|
|+ RevGridMask ()||77.74|
|+ RevGridMask ()||77.55|
|+ RevGridMask ()||96.18|
|+ RevGridMask ()||96.46|
|+ RevGridMask ()||96.33|
Another variation of GridMask is random GridMask. In the GridMask, we can regard the mask as composed of many units, and we drop a block in every unit. This forms our structured information dropping. A natural variation is to break the structure and randomly drop a block in every unit with a certain probability of . The result is summarized in Table 6. Using random dropping decreases the final accuracy. The structured information dropping is more effective.
In this section, we use our GirdMask policy to train objection detectors on the COCO dataset, to show our method is a generic augmentation policy. We use Faster-RCNN as our baseline model with open-source PyTorch implementation . All models are initialized using an ImageNet pre-trained weight and are then finely tuned for some epochs on the COCO2017 dataset.
The baseline augmentation including randomly deforming the brightness, contrast, saturation, and hue of the input image. And then the image is randomly scaled into a certain range. After that, a horizontal flip operation is randomly applied to the scaled image. Finally, the image is normalized to around zero. Our GridMask is used after the baseline augmentation.
We use the same hyperparameters as described in , except for the training epochs. We first double the original training epochs for both baseline and our GridMask. Then, we increase the training time for both methods further – but the baseline models face a serious over-fitting problem and tend to decrease after training epochs. But models trained with our GridMask yield better results. This demonstrates that our method can handle the over-fitting problem generally and essentially.
The result of our method with different hyperparameters is shown in Table 7. We choose following previous experience. The experiments with different probability on Faster-RCNN-50-FPN are conducted. With a large range of from 0.3 to 0.9, we all achieve excellent results, which only fluctuate between 38.0% and 38.3%. This further demonstrates the stability of our method. When the probability is 0.7, we obtain the best result, which increases mAP by 0.9%. When we further increase the training epochs, we get a higher result of 39.2%, which promotes the baseline by 1.8%. For Faster-RCNN-X101-32x8d-FPN, we increase the mAP from 41.2% to 42.6%, by 1.4%.
|Model||mAP (%)||AP50 (%)||AP75 (%)|
|+ GridMask ()||38.2||60.0||41.4|
|+ GridMask ()||38.1||60.1||41.2|
|+ GridMask ()||38.3||60.4||41.7|
|+ GridMask ()||38.0||60.1||41.2|
|+ GridMask ()||39.2||60.8||42.2|
|+ GridMask ()||42.6||65.0||46.5|
Semantic segmentation is a challenging task in computer vision, which densely predicts the semantic category for every pixel in an image. To demonstrate the universality of our GridMask, we also conduct experiments on challenging Cityscapes dataset.
We use PSPNet  as our baseline model, which achieved SOTA results for semantic segmentation. We use the same hyperparameters as suggested in , except for the training epochs. We train for longer epochs following common practice. We do not increase the training epochs for the baseline model, because training longer will cause serious over-fitting problems and decrease the accuracy of the baseline model. All models are initialized by the same ImageNet pre-trained weights and then fine-tuned on the Cityscapes dataset.
The baseline model already uses strong augmentation policies, including randomly scaling the image from 0.5 to 1.0, randomly rotating the image in degrees, with random Gaussian blur the image, with random horizontal flip of the image, and randomly cropping a patch form the image. The strong baseline augmentation greatly raises the difficulty of adding another augmentation policy. Surprisingly, we still achieve a better result after adding our GridMask along with the baseline augmentation. We summarize the result in Table 8.
Data augmentation is only one of the regularization methods. Shape grid is not only useful in data augmentation but also work in other aspects. Inspired by , we combine our Grid shape with Mixup. And we train ResNet50 with our method on ImageNet, we also obtain SOTA results compared with other regularization methods, as shown in Table 9.
We have proposed a simple, general, and effective policy for data augmentation, which is based on information dropping. It deletes uniformly distributed areas and finally forms a grid shape. Using this shape to delete information is more effective than setting complete random location. It has achieved remarkable improvement in different tasks and models. On the ImageNet dataset, it increases the baseline by 1.4%. In the task of COCO2017 object detection, we improve the baseline by 1.8%, and in the task of Cityscapes semantic segmentation, we boost the baseline by 0.9%. This effect is consistently stronger than other information deletion based data augmentation methods. Further, our method can serve as a new baseline policy in future data augmentation search algorithms.
Our method is one successful way of using structured information dropping, and we believe there are more also with excellent structures. We hope the study on information dropping methods inspires more future work to understand the importance of designing effective structures, which may even help reinforcement learning to get improved.
Learning deep features for discriminative localization. In CVPR, Cited by: Figure 6, §4.1.1.