Log In Sign Up

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

by   Sangdoo Yun, et al.

Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects ( leg as opposed to head of a person), thereby letting the network generalize better and have better object localization capabilities. On the other hand, current methods for regional dropout removes informative pixels on training images by overlaying a patch of either black pixels or random noise. Such removal is not desirable because it leads to information loss and inefficiency during training. We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies on CIFAR and ImageNet classification tasks, as well as on the ImageNet weakly-supervised localization task. Moreover, unlike previous augmentation methods, our CutMix-trained ImageNet classifier, when used as a pretrained model, results in consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. We also show that CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances.


page 3

page 7


SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization

Advanced data augmentation strategies have widely been studied to improv...

Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification

Convolutional neural networks (CNN) are capable of learning robust repre...

Evolving Image Compositions for Feature Representation Learning

Convolutional neural networks for visual recognition require large amoun...

VideoMix: Rethinking Data Augmentation for Video Classification

State-of-the-art video action classifiers often suffer from overfitting....

Hierarchical Complementary Learning for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging problem wh...

Selective Output Smoothing Regularization: Regularize Neural Networks by Softening Output Distributions

In this paper, we propose Selective Output Smoothing Regularization, a n...

Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated Label Mixing

The Mixup scheme suggests mixing a pair of samples to create an augmente...

1 Introduction

Deep convolutional neural networks (CNNs) have shown promising performances on various computer vision problems such as image classification 

[30, 19, 11], object detection [29, 23], semantic segmentation [1, 24], and video analysis [27, 31]. To further improve the training efficiency and performance, a number of training strategies have been proposed, including data augmentation [19] and regularization techniques [33, 16, 37].

In particular, to prevent a CNN from focusing too much on a small set of intermediate activations or on a small region on input images, random feature removal regularizations have been proposed. Examples include dropout [33] for randomly dropping hidden activations and regional dropout [2, 49, 32, 7] for erasing random regions on the input. Researchers have shown that the feature removal strategies improve generalization and localization by letting a model attend not only to the most discriminative parts of objects, but rather to the entire object region [32, 7].

ResNet-50 Mixup [46] Cutout [2] CutMix
Label Dog 1.0 Dog 0.5 Cat 0.5 Dog 1.0 Dog 0.6 Cat 0.4
ImageNet Cls (%) 76.3 (+0.0) 77.4 (+1.1) 77.1 (+0.8) 78.4 (+2.1)
ImageNet Loc (%) 46.3 (+0.0) 45.8 (-0.5) 46.7 (+0.4) 47.3 (+1.0)
Pascal VOC Det (mAP) 75.6 (+0.0) 73.9 (-1.7) 75.1 (-0.5) 76.7 (+1.1)
Table 1:

Overview of the results of Mixup, Cutout, and our CutMix on ImageNet classification, ImageNet localization, and Pascal VOC 07 detection (transfer learning with SSD 

[23] finetuning) tasks. Note that CutMix improves the performance on various tasks.

While regional dropout strategies have shown improvements of classification and localization performances to a certain degree, deleted regions are usually zeroed-out [2, 32] or filled with random noise [49], greatly reducing the proportion of informative pixels on training images. We recognize this as a severe conceptual limitation as CNNs are generally data hungry [26]. How can we maximally utilize the deleted regions, while taking advantage of better generalization and localization using regional dropout?

We address the above question by introducing an augmentation strategy CutMix. Instead of simply removing pixels, we replace the removed regions with a patch from another image (See Table 1). The ground truth labels are also mixed proportionally to the number of pixels of combined images. CutMix now enjoys the property that there is no uninformative pixel during training, making training efficient, while retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The added patches further enhance localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain the same.

CutMix shares similarity with Mixup [46]

which mixes two samples by interpolating both the images and labels. Mixup has been found to improve classification, but the interpolated sample tends to be unnatural (See the mixed image in Table 

1). On the other hand, CutMix overcomes the problem by replacing the image region with a image patch from another training image.

Table 1 gives an overview of Mixup [46], Cutout [2], and CutMix on image classification, weakly supervised localization, and transfer learning to object detection methods. Although Mixup and Cutout enhance the ImageNet classification accuracies, they suffer from performance degradation on ImageNet localization and object detection tasks. On the other hand, CutMix consistently achieves significant enhancements across three tasks, proving its superior classification and localization ability beyond the baseline and other augmentation methods.

We present extensive evaluations of our CutMix on various CNN architectures and on various datasets and tasks. Summarizing the key results, CutMix has significantly improved the accuracy of a baseline classifier on CIFAR-100 and has obtained the state-of-the-art top-1 error . On ImageNet [30], applying CutMix to ResNet-50 and ResNet-101 [11] has improved the classification accuracy by and , respectively. On the localization front, CutMix improves the performance of the weakly-supervised object localization (WSOL) task on CUB200-2011 [42] and ImageNet [30] by and gains, respectively. The superior localization capability is further evidenced by fine-tuning a detector and an image caption generator on CutMix-ImageNet-pretrained models; the CutMix pretraining has improved the overall detection performances on Pascal VOC [5] by mAP and image captioning performance on MS-COCO [22] by BLEU score. CutMix also enhances the model robustness and dramatically alleviates the over-confident issue [12, 21] of deep networks.

2 Related Works

Regional dropout: Methods [2, 49, 32] removing random regions in images have been proposed to enhance the generalization and localization performances of CNNs. CutMix is similar to those methods, while the critical difference is that the removed regions are filled with patches from other images. On the feature level, DropBlock [7] has generalized the regional dropout to the feature space and have shown enhanced generalizability as well. CutMix can also be performed on the feature space, as we will see in the experiments.

Synthesizing training data: Some works have explored synthesizing training data for further generalizability. Generating [6] new training samples by Stylizing ImageNet [30] has guided the model to focus more on shape than texture, leading to better classification and object detection performances. CutMix also generates new samples by cutting and pasting patches within mini-batches, leading to performance boosts in many computer vision tasks; the main advantage of CutMix is that the additional cost for sample generation is negligible. For object detection, object insertion methods [4, 3] have been proposed as a way to synthesize objects in the background. These methods are different from CutMix because they aim to represent a single object well while CutMix generates combined samples which may contain multiple objects.

Mixup: CutMix shares similarity with Mixup [46, 39] in that both combines two samples, where the ground truth label of the new sample is given by the mixture of one-hot labels. As we will see in the experiments, Mixup samples suffer from the fact that they are locally ambiguous and unnatural, and therefore confuses the model, especially for localization. Recently, Mixup variants [40, 34, 9] have been proposed; they perform feature-level interpolation and other types of transformations. Above works, however, generally lack a deep analysis in particular on the localization ability and transfer-learned performances.

Tricks for training deep networks: Efficient training of deep networks is one of the most important problems in research community, as they require great amount of compute and data. Methods such as weight decay, dropout [33]

, and Batch Normalization 

[17] are widely used to train more generalizable deep networks. Recently, methods adding noises to internal features  [16, 7, 44] or adding extra path to the architecture  [14, 13] have been proposed. CutMix is complementary to the above methods because it operates on the data level, without changing internal representations or architecture.

3 CutMix

We describe the CutMix algorithm in detail.

3.1 Algorithm

Let and denote a training image and its label, respectively. The goal of CutMix is to generate a new training sample by combining two training samples and . Then, the new generated training sample

is used to train the model with its original loss function. To this end, we define the combining operation as


where denotes a binary mask indicating where to drop out and fill in from two images, is a binary mask filled with ones, and is element-wise multiplication. Like Mixup [46], the combination ratio

between two data points is sampled from the beta distribution

. In our all experiments, we set to , that is

is sampled from the uniform distribution

. Note that the major difference is that CutMix replaces image region with a patch from another training image and can generate more locally natural image than Mixup.

To sample the binary mask , we first sample the bounding box coordinates indicating the cropping regions on and . The region in is dropped out and filled with the patch cropped at of .

In our experiments, we sample rectangular masks whose aspect ratio is proportional to the original image. The box coordinates are uniformly sampled according to:


making the cropped area ratio . With the cropping region, the binary mask is decided by filling with within the bounding box , otherwise .

Since implementing CutMix is simple and has negligible computational overheads as existing data augmentation techniques as used in [35, 15], we can efficiently utilize it to train any network architectures. In each training iteration, a CutMix-ed sample is generated by combining randomly selected two training samples in a mini-batch according to Equation (1). Code-level details are presented in Appendix A.

3.2 Discussion

What does model learn with CutMix? We have motivated CutMix such that full object regions are considered for classification, as Cutout is designed for, while ensuring two objects are recognized from partial views from a single image to increase training efficiency. To verify that CutMix is indeed learning to recognize two objects from the augmented samples from their respective partial views, we visually compare the activation maps for CutMix against Cutout [2] and Mixup [46]. Figure 1 shows example augmentation inputs as well as corresponding class activation maps (CAM) [50] for two classes present, Saint Bernard and Miniature Poodle. We use vanilla ResNet-50 model111

We use ImageNet-pretrained ResNet-50 provided by PyTorch 

[28]. for obtaining the CAMs to clearly see the effect of augmentation method only.

Figure 1: Class activation mapping (CAM) [50] visualizations on ‘Saint Bernard’ and ‘Miniature Poodle’ samples using various augmentation techniques. From top to bottom rows, we show the original images, input augmented image, CAM for class ‘Saint Bernard’, and CAM for class ‘Miniature Poodle’, respectively. Note that CutMix can take advantage of the mixed region on image, but Cutout cannot.
Mixup Cutout CutMix
Usage of full image region
Regional dropout
Mixed image & label
Table 2: Comparison among Mixup, Cutout, and CutMix.

We can observe that Cutout successfully lets a model focus on less discriminative parts of the object. For example, the model focuses on the belly of Saint Bernard on the Cutout-ed sample. We also observe, however, that they make less efficient use of training data due to uninformative pixels. Mixup, on the other hand, makes full use of pixels, but introduces unnatural artifacts. The CAM for Mixup, as a result, shows the confusion of model in choosing cues for recognition. We hypothesize that such confusion leads to its suboptimal performance in classification and localization as we will see in Section 4. CutMix efficiently improves upon Cutout by being able to localize the two object classes accurately, as Cutout can only deal with one object on a single image. We summarize the comparison among Mixup, Cutout, and CutMix as in Table 2.

Analysis on validation error: We analyze the effect of CutMix on stabilizing the training of deep networks. We compare the top-1 validation error during the training with CutMix against the baseline. We train ResNet-50 [11] for ImageNet Classification, and PyramidNet-200 [10] for CIFAR-100 Classification. Figure 2 shows the results.

We observe, first of all, that CutMix achieves lower validation errors than the baseline at the end of training. After the half of the epochs where learning rates are further reduced, we observe that the baseline suffers from overfitting with increasing validation error. CutMix, on the other hand, shows a steady decrease in validation error, proving its ability to reduce overfitting by guiding the training with diverse samples.

4 Experiments

In this section, we evaluate CutMix for its capability to improve localizability as well as generalizability of a trained model on multiple tasks. We first study the effect of CutMix on image classification (Section 4.1) and weakly supervised object localization (Section 4.2). Next, we show the transferability of a pretrained model using CutMix when it is fine-tuned for object detection and image captioning tasks (Section 4.3). We also show that CutMix can improve the model robustness and alleviate the over-confident issue in Section 4.4. Throughout the experiments, we verify that CutMix outperforms other state-of-the-art regularization methods in above tasks and we further analyze the inner mechanisms behind such superiority.

All experiments were implemented and evaluated on NAVER Smart Machine Learning (NSML) 

[18] platform with PyTorch [28]. Codes will be released in near future.

Figure 2: Top-1 test error plot for CIFAR100 (left) and ImageNet (right) classification. Cutmix can avoid overfitting problem and achieves lower test errors than the baseline at the end of training.

4.1 Image Classification

4.1.1 ImageNet Classification

We evaluate on ImageNet-1K benchmark [30] a dataset containing over 1M training images and 50K validation images of 1K categories. For fair comparison, we use the standard augmentation setting for ImageNet dataset such as resizing, cropping, and flipping, as also done in [10, 7, 15, 36]. We found that such regularization methods including Stochastic Depth [16], Cutout [2], Mixup [46], and our CutMix, require a greater number of training epochs till convergence. Therefore, we have trained all the models with epochs initial learning rate , decayed by factor at epochs , , and . The batch size is set to . The mixture hyper-parameter for CutMix is set to .

Model # Params Top-1 Err (%) Top-5 Err (%)
ResNet-152* 60.3 M 21.69 5.94
ResNet-101 + SE Layer* [14] 49.4 M 20.94 5.50
ResNet-101 + GE Layer* [13] 58.4 M 20.74 5.29
ResNet-50 + SE Layer* [14] 28.1 M 22.12 5.99
ResNet-50 + GE Layer* [13] 33.7 M 21.88 5.80
ResNet-50 (Baseline) 25.6 M 23.68 7.05
ResNet-50 + Cutout [2] 25.6 M 22.93 6.66
ResNet-50 + StochDepth [16] 25.6 M 22.46 6.27
ResNet-50 + Mixup [46] 25.6 M 22.58 6.40
ResNet-50 + Manifold Mixup [40] 25.6 M 22.50 6.21
ResNet-50 + DropBlock* [7] 25.6 M 21.87 5.98
ResNet-50 + Feature CutMix 25.6 M 21.80 6.06
ResNet-50 + CutMix 25.6 M 21.60 5.90
Table 3: ImageNet classification results based on ResNet-50 model. ‘*’ denotes results reported in the original papers.

We briefly describe the settings for baseline augmentation schemes. We set the dropping rate of residual blocks to for the best performance of Stochastic Depth [16]. The mask size for Cutout [2] is set to and the location for dropping out is uniformly sampled. The performance of DropBlock [7] is from the original paper and the difference from our setting is the training epochs which is set to . Manifold Mixup [40] applies Mixup operation on the randomly chosen internal feature map. Hyper-parameter for Mixup and Manifold Mixup was tested with 0.5 and 1.0 and we selected 1.0 which shows better performance. Conceptually it is also possible to extend CutMix to feature-level augmentation. To test this, we propose “Feature CutMix” scheme, which applies the CutMix operation at a randomly chosen layer per minibatch as Manifold Mixup does.

Model # Params Top-1 Err (%) Top-5 Err (%)
ResNet-101 (Baseline) [11] 44.6 M 21.87 6.29
ResNet-101 + CutMix 44.6 M 20.17 5.24
ResNeXt-101 (Baseline) [43] 44.1 M 21.18 5.57
ResNeXt-101 + CutMix 44.1 M 19.47 5.03
Table 4: Impact of CutMix on ImageNet classification for ResNet-101 and ResNext-101.

Comparison against baseline augmentations: Results are given in Table 3. We observe that our CutMix method achieves the best result, 21.60% top-1 error, among the considered augmentation strategies. CutMix outperforms Cutout and Mixup, the two closest approaches to ours, by and , respectively. On the feature level as well, we find CutMix preferable to Mixup, with top-1 errors and , respectively.

Comparison against architectural improvements: We have also compared improvements due to CutMix against the improvements due to architectural improvements (greater depth or additional modules). We observe that CutMix improves the performance by +2.08% while increased depth (ResNet-50 ResNet-152) boosts and SE [14] and GE [13] boosts and , respectively. The improvement due to CutMix is more impressive, since it does not require additional parameters or more computation per SGD update (as architectural changes do). CutMix is a competitive data augmentation scheme that requires minimal additional cost.

CutMix for Deeper Models: We have explored the performance of CutMix for the deeper networks, ResNet-101 [11] and ResNeXt-101 (324d) [43], on ImageNet. As seen in Table 4, we observe +1.60% and +1.71% improvements in top-1 errors due to CutMix, respectively.

4.1.2 CIFAR Classification

Here we describe the results on CIFAR classification. We set mini-batch size to and training epochs to for CIFAR classification. The learning rate was initially set to and decayed by the factor of at and epoch. To ensure the effectiveness of the proposed method, we used very strong baseline, PyramidNet-200 [10], the widening factor and the number of parameters is M, which shows state-of-the-art performance on CIFAR-100 (top-1 error is ).

PyramidNet-200 (=240) (# params: 26.8 M) Top-1 Err (%) Top-5 Err (%)
Baseline 16.45 3.69
+ StochDepth [16] 15.86 3.33
+ Label smoothing (=0.1) [37] 16.73 3.37
+ Cutout [2] 16.53 3.65
+ Cutout + Label smoothing (=0.1) 15.61 3.88
+ DropBlock [7] 15.73 3.26
+ DropBlock + Label smoothing (=0.1) 15.16 3.86
+ Mixup (=0.5) [46] 15.78 4.04
+ Mixup (=1.0) [46] 15.63 3.99
+ Manifold Mixup (=1.0) [40] 16.14 4.07
+ Cutout + Mixup (=1.0) 15.46 3.42
+ Cutout + Manifold Mixup (=1.0) 15.09 3.35
+ ShakeDrop [44] 15.08 2.72
+ CutMix 14.23 2.75
+ CutMix + ShakeDrop [44] 13.81 2.29
Table 5: Comparison of state-of-the-art regularization methods on CIFAR-100.

Table 5 shows the performance comparison with other state-of-the-art data augmentation and regularization methods. All experiments were conducted three times and the averaged performance were reported.

Hyper-parameter settings: We set the hole size of Cutout [2] to . For DropBlock [7], keep_prob and block_size are set to and , respectively. The drop rate for Stochastic Depth [16] is set to 0.25. For Mixup [46], we tested the hyper-parameter with 0.5 and 1.0. For Manifold Mixup [40], we applied Mixup operation at a randomly chosen layer per minibatch.

Model # Params Top-1 Err (%) Top-5 Err (%)
PyramidNet-110 ([10] 1.7 M 19.85 4.66
PyramidNet-110 + CutMix 1.7 M 17.97 3.83
ResNet-110 [11] 1.1 M 23.14 5.95
ResNet-110 + CutMix 1.1 M 20.11 4.43
Table 6: Impact of CutMix on lighter architectures on CIFAR-100.
PyramidNet-200 (=240) Top-1 Error (%)
Baseline 3.85
+ Cutout 3.10
+ Mixup (=1.0) 3.09
+ Manifold Mixup (=1.0) 3.15
+ CutMix 2.88
Table 7: Impact of CutMix on CIFAR-10.

Combination of regularization methods: One step further for validating each regularization methods, we also tested the combination of the various methods. We found that both Cutout [2] and label smoothing [37] could not improve the accuracy when independently adopted to the training, but it was effective when we used the two methods simultaneously. Dropblock [7], the generalized version of Cutout to feature-maps, was also more effective when label smoothing was attached. We observe that Mixup [46] and Manifold Mixup [40] both achieved higher accuracy when the image is applied by Cutout. The combination of Cutout and Mixup tends to generate locally separated and mixed samples since the Cutout-ed region has less ambiguity than the vanilla Mixup. Thus, the success of combining Cutout and Mixup shows that mixing via cut-and-paste manner is better than interpolation, as we conjectured.

Consequently, we achieved error in CIFAR-100 classification, which is higher than baseline error-rate. Also, we note that we achieved a new state-of-the-art performance when adding CutMix and ShakeDrop [44], which is a regularization technique by adding noise to feature space.

CutMix for various models: Table 6 shows CutMix can also significantly improve over the weaker baselines, such as PyramidNet-110 [10] and ResNet-110.

CutMix for CIFAR-10: We evaluated CutMix on CIFAR-10 dataset using the same baseline and training setting for CIFAR-100. The results are given in Table 7. CutMix can also enhance the performance over the baseline by and outperforms Mixup and Cutout.

4.1.3 Ablation Studies

Figure 3: Impact of and CutMix layer depth on top-1 error on CIFAR-100.

We conducted ablation study in CIFAR-100 dataset using the same experimental settings in Section 4.1.2. We evaluated CutMix with varing the parameters to , , , , and and the results are given in the left graph of Figure 3. From all the different values of , we achieved better results than the baseline (), and the best performance was achieved when .

The performance of feature-level CutMix is given in the right graph of Figure 3. We changed the location where to apply CutMix from image level to feature level. We denote the index as (0=image level, 1=after first conv-bn, 2=after layer1, 3=after layer2, 4=after layer3). CutMix achieved the best performance when it was applied to input. Still, feature-level CutMix except the layer3 case can improve the accuracy over the baseline ().

PyramidNet-200 (=240)
( params: 26.8 M)
Error (%)
Error (%)
Baseline 16.45 3.69
Proposed (CutMix) 14.23 2.75
Center Gaussian CutMix 15.95 3.40
Fixed-size CutMix 14.97 3.15
One-hot CutMix 15.89 3.32
Scheduled CutMix 14.72 3.17
Table 8: Performance of CutMix variants on CIFAR-100.

Table 8

shows the performance of CutMix over various configurations. ‘Center Gaussian CutMix’ denotes the experiment adapting Gaussian distribution whose mean is the center of image instead of uniform distribution when sampling

of Equation (2). ‘Fixed-size CutMix’ fixes the size of cropping region as , thus is always

. ‘Scheduled CutMix’ changes the probability to apply CutMix or not during training as

[7, 16] do. ‘One-hot CutMix’ denotes the case where the label is not combined as Equation (1), but decided to a single label which has larger portion in the image. We scheduled the probability from to linearly as increasing training epoch. The results show that adding the priors in center for CutMix (Center Gaussian CutMix) and fixing the size of cropping region (Fixed-size CutMix) lead performance degradation. Scheduled CutMix shows slightly worse performance than CutMix. One-hot CutMix shows much worse performance than CutMix, highlighting the effect of combined label.

4.2 Weakly Supervised Object Localization

Loc Acc (%)
Loc Acc (%)
ResNet-50 49.41 46.30
Mixup [46] 49.30 45.84
Cutout [2] 52.78 46.69
CutMix 54.81 47.25
Table 9: Weakly supervised object localization results on CUB200-2011 and ImageNet.
ImageNet Cls
Top-1 Error (%)
Detection Image Captioning
SSD [23]
Faster-RCNN [29]
NIC [41]
NIC [41]
ResNet-50 (Baseline) 23.68 76.7 (+0.0) 75.6 (+0.0) 61.4 (+0.0) 22.9 (+0.0)
Mixup-trained 22.58 76.6 (-0.1) 73.9 (-1.7) 61.6 (+0.2) 23.2 (+0.3)
Cutout-trained 22.93 76.8 (+0.1) 75.0 (-0.6) 63.0 (+1.6) 24.0 (+1.1)
CutMix-trained 21.60 77.6 (+0.9) 76.7 (+1.1) 64.2 (+2.8) 24.9 (+2.0)
Table 10: Impact of CutMix on transfer learning of pretrained model to other tasks, object detection and image captioning.

Weakly supervised object localization (WSOL) task aims to train the classifier to localize target objects by using only the class label. To localize the target well, it is important to make CNNs look at the full object region not to focus on a discriminant part of the target. That is, learning spatially distributed representation is the key for improving performance on WSOL task. Thus, here we measure how CutMix learns spatially distributed representation beyond other baselines by conducting WSOL task. We followed the training and evaluation strategy of the existing WSOL methods

[32, 47, 48]. ResNet-50 is used as the base model. The model is initialized using ImageNet Pre-trained model before training, and is modified to enlarge the final convolutional feature map size to from . Then, the model is fine-tuned on CUB200-2011 [42] and ImageNet-1K [30] dataset only using class labels. At evaluation, we utilized class activation mappings (CAM) [50]

to estimate the bounding box of an object. The quantitative and qualitative results are given in Table 

9 and Figure 4, respectively. The implementation details are in Appendix B.

Comparison against Mixup and Cutout: CutMix outperforms Mixup [46] by and on CUB200-2011 and ImageNet dataset, respectively. We observe that Mixup degraded the localization accuracy over baseline and tends to focus on a small region of image as shown in Figure 4. As we hypothesized in Section 3.2, the Mixuped sample has the ambiguity, so the CNN trained with those samples tends to focus on the most discriminative part for classification, which leads the degradation of localization ability. Although Cutout [2] can improve the localization accuracy over the baseline, CutMix still outperforms Cutout by and on CUB200-2011 and ImageNet dataset, respectively.

Furthermore, CutMix achieved comparable localization accuracy on CUB200-2011 and ImageNet dataset compared with state-of-the-art methods [50, 47, 48] that focus on learning spatially distributed representations.

Figure 4: Qualitative comparison of the baseline (ResNet-50), Mixup, Cutout, and CutMix for weakly supervised object localization task on CUB-200-2011 dataset. Ground truth and the estimated results are denoted as red and green, respectively.

4.3 Transfer Learning of Pretrained Model

In this section, we check the generalization ability of CutMix by transferring task from image classification to other computer vision tasks such as object detection and image captioning, which require the localization ability of the learned CNN feature. For each task, we replace the backbone network of the original model with various ImageNet-pretrained models using Mixup [46], Cutout [2], and our CutMix. Then the model is finetuned for each task and we validate the performance improvement of CutMix over the original backbone and other baselines. ResNet-50 model is used as a baseline.

Transferring to Pascal VOC object detection: Two popular detection models, SSD [23] and Faster RCNN [29], are used for the experiment. Originally the two methods utilized VGG-16 as a backbone network, but we changed it to ResNet-50. The ResNet-50 backbone is initialized with various ImageNet-pretrained models and finetuned using Pascal VOC 2007 and 2012 [5] trainval data, and evaluated with VOC Pascal 2007 test data using mAP metric. We follow the finetuning strategy as the original methods [23, 29] do and the implementation details are in Appendix C. The performance result is represented in Table 10. The pretrained backbone models of Cutout and Mixup failed to improve the performance on object detection task over the original model. However, the backbone pretrained by CutMix can clearly improve the performance of both SSD and Faster-RCNN. This shows that our method can train CNNs to have generalization ability to be applied to object detection.

(a) Analysis for occluded samples
(b) Analysis for in-between class samples
Figure 5: Robustness experiments on ImageNet validation set.

Transferring to MS-COCO image captioning: We used Neural Image Caption (NIC) [41] as the base model for image captioning experiment. We change the backbone network of encoder from GoogLeNet [41] to ResNet-50. The backbone network is initialized with ImageNet-pretrained models, and then we trained and evaluated NIC on MS-COCO dataset [22]

. The implementation details and other evaluation metrics (METEOR, CIDER, etc.) are in Appendix 

D. Table 10 shows the result performance when applying each pretrained model. CutMix outperforms Mixup and Cutout in both BLEU1 and BLEU4 metrics.

Note that simply replacing backbone network with our CutMix-pretrained model gives clear performance gains for object detection and image captioning tasks.

4.4 Robustness and Uncertainty

Many researches have shown that deep models are easily fooled by small and unrecognizable perturbations to the input image, which is called adversarial attacks [8, 38]. One straightforward way to enhance robustness and uncertainty is an input augmentation by generating unseen samples [25]. We evaluate robustness and uncertainty improvements by input augmentation methods including Mixup, Cutout and CutMix comparing to the baseline.

Robustness: We evaluate the robustness of the trained models to adversarial samples, occluded samples and in-between class samples. We use ImageNet-pretrained ResNet-50 models with same setting in Section 4.1.1.

Fast Gradient Sign Method (FGSM) [8] is used to generate adversarial perturbations and we assume that the adversary has full information of the models, which is called white-box attack. We report top-1 accuracy after attack on ImageNet validation set in Table 11. CutMix significantly improves the robustness to adversarial attacks compared to other augmentation methods.

For occlusion experiments, we generate occluded samples in two ways: center occlusion by filling zeros in a center hole and boundary occlusion by filling zeros outside of the hole. In Figure 4(a), we measure the top-1 error by varying the hole size from to . For both employed occlusion scenario, Cutout and CutMix achieve significant improvements of robustness while Mixup nearly improves robustness to occlusion. Interestingly, CutMix almost achieves compatible performance compared to Cutout even though CutMix did not obseverve occluded samples during the training stage unlike Cutout.

Finally, we evaluate the top-1 error of Mixup and CutMix in-between samples. The probability to predict neither two classes by varying the combination ratio is illustrated in Figure 4(b). We randomly select in-between samples in ImageNet validation set. In the both in-between class sample experiments, Mixup and CutMix improves the performance of the network while improvements by Cutout is almost neglectable. Similarly to the previous occlusion experiments, CutMix even improves the robustness to the unseen Mixup in-between class samples.

Baseline Mixup Cutout CutMix
Top-1 Acc (%) 8.2 24.4 11.5 31.0
Table 11: Top-1 accuracy after FGSM white-box attack on ImageNet validation set.
Method TNR at TPR 95% AUROC Detection Acc.
Baseline 26.3 (+0) 87.3 (+0) 82.0 (+0)
Mixup 11.8 (-14.5) 49.3 (-38.0) 60.9 (-21.0)
Cutout 18.8 (-7.5) 68.7 (-18.6) 71.3 (-10.7)
CutMix 69.0 (+42.7) 94.4 (+7.1) 89.1 (+7.1)
Table 12: Out-of-distribution (OOD) detection results with CIFAR-100 trained models. Results are averaged on seven datasets. All numbers are in percents; higher is better.

Uncertainty: We measure the performance of the out-of-distribution (OOD) detectors proposed by [12] which determines whether the sample is in- or out-of-distribution by score thresholding. We use PyramidNet-200 trained on CIFAR-100 datasets with same setting in Section 4.1.2. In Table 12, we report averaged OOD detection performances against seven out-of-distribution samples from [12, 21], including TinyImageNet, LSUN [45], uniform noise, Gaussian noise, etc. More results are illustrated in Appendix E. Note that Mixup and Cutout seriously impair the baseline performance, in other words, Mixup and Cutout augmentations aggravate the overconfidence issue of the base network. Meanwhile, our proposed CutMix significantly outperforms the baseline performance.

5 Conclusion

In this work, we introduced CutMix for training CNNs to have strong classification and localization ability. CutMix is simple, easy to implement, and has no computational overheads, but surprisingly effective on various tasks. On ImageNet classification, applying CutMix to ResNet-50 and ResNet-101 brings and top-1 accuracy improvements. On CIFAR classification, CutMix also can significantly improve the performance of baseline by and achieved the state-of-the-art top-1 error performance. On weakly supervised object localization (WSOL), CutMix can enhance the localization accuracy and achieved comparable localization performance to state-of-the-art WSOL methods without any WSOL techniques. Furthermore, simply using CutMix-ImageNet-pretrained model as the initialized backbone of the object detection and image captioning brings overall performance improvements. Last, CutMix achieves performance improvements in robustness and uncertainty benchmarks compared to the other augmentation methods.


We are grateful to Clova AI members with valuable discussions, and to Ziad Al-Halah for proofreading the manuscript.


Appendix A CutMix Algorithm

1:for each iteration do
2:     input, target = get_minibatch(dataset) input is NCW

H size tensor, target is N

K size tensor.
3:     if mode training then
4:         input_s, target_s = shuffle_minibatch(input, target) CutMix starts here.
5:         lambda = Unif(0,1)
6:         r_x = Unif(0,W)
7:         r_y = Unif(0,H)
8:         r_w = Sqrt(1 - lambda)
9:         r_h = Sqrt(1 - lambda)
10:         x1 = Round(Clip(r_x - r_w / 2, min=0))
11:         x2 = Round(Clip(r_x + r_w / 2, max=W))
12:         y1 = Round(Clip(r_y - r_h / 2, min=0))
13:         y2 = Round(Clip(r_y + r_h / 2, min=H))
14:         input[:, :, x1:x2, y1:y2] = input_s[:, :, x1:x2, y1:y2]
15:         target = lambda * target + (1 - lambda) * target_s CutMix ends.
16:     end if
17:     output = model_forward(input)
18:     loss = compute_loss(output, target)
19:     model_update()
20:end for
Algorithm A1 Pseudo-code of CutMix

We present the code-level description of CutMix algorithm in Algorithm A1. N, C, and K denote the size of minibatch, channel size of input image, and the number of classes. First, CutMix shuffles the order of the minibatch input and target along the first axis of the tensors. And the lambda and the cropping region (x1,x2,y1,y2) are sampled. Then, we mix the input and input_s by replacing the cropping region of input to the region of input_s. The target label is also mixed by interpolating method.

Note that CutMix is easy to implement with few lines (from line to line ), so is very practical algorithm giving significant impact on a wide range of tasks.

Appendix B Weakly-supervised Object Localization

We describe the training and evaluation procedure in detail.

Network modification: Basically weakly-supervised object localization (WSOL) has the same training strategy as image classification does. Training WSOL is starting from ImageNet-pretrained model. From the base network structure (ResNet-50 [11]), WSOL takes larger spatial size of feature map whereas the original ResNet-50 has . To enlarge the spatial size, we modify the base network’s final residual block (layer4

) to have no stride, which originally has stride


Since the network is modified and the target dataset could be different from ImageNet [30], the last fully-connected layer is randomly initialized with the final output dimension of and for CUB200-2011 [42] and ImageNet, respectively.

Input image transformation: For fair comparison, we used the same data augmentation strategy except Mixup, Cutout, and CutMix as the state-of-the-art WSOL methods do [32, 47]. In training, the input image is resized to size and randomly cropped size images are used to train network. In testing, the input image is resized to , cropped at center with size and used to validate the network, which called single crop strategy.

Estimating bounding box: We utilize class activation mapping (CAM) [50]

to estimate the bounding box of an object. First we compute CAM of an image, and next, we decide the foreground region of the image by binarizing the CAM with a specific threshold. The region with intensity over the threshold is set to 1, otherwise to 0. We use the threshold as a specific rate

of the maximum intensity of the CAM. We set to for all our experiments. From the binarized foreground map, the tightest box which can cover the largest connected region in the foreground map is selected to the bounding box for WSOL.

Evaluation metric: To measure the localization accuracy of models, we report top-1 localization accuracy (Loc), which is used for ImageNet localization challenge [30]. For top-1 localization accuracy, intersection-over-union (IoU) between the estimated bounding box and ground truth position is larger than , and, at the same time, the estimated class label should be correct. Otherwise, top-1 localization accuracy treats the estimation was wrong.

b.1 Cub200-2011

CUB-200-2011 dataset [42] contains over 11 K images with 200 categories of birds. We set the number of training epochs to . The learning rate for the last fully-connected layer and the other were set to and , respectively. The learning rate is decaying by the factor of at every epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to , , and . Table A1 shows that our model, ResNet-50 + CutMix, achieves localization accuracy on CUB200 dataset which outperforms other state-of-the-art WSOL methods [50, 47, 48].

b.2 ImageNet dataset

ImageNet-1K [30] is a large-scale dataset for general objects consisting of 13 M training samples and 50 K validation samples. We set the number of training epochs to . The learning rate for the last fully-connected layer and the other were set to and , respectively. The learning rate is decaying by the factor of at every epochs. We used SGD optimizer, and the minibatch size, momentum, weight decay were set to , , and . In Table A1, our model, ResNet-50 + CutMix, also achieves localization accuracy on ImageNet-1K dataset, which shows comparable performance compared with other state-of-the-art WSOL methods [50, 47, 48].

Method CUB200-2011 ImageNet-1K
Top-1 Loc (%) Top-1 Loc (%)
VGG16 + CAM* [50] - 42.80
VGG16 + ACoL* [47] 45.92 45.83
InceptionV3 + SPG* [48] 46.64 48.60
ResNet-50 49.41 46.30
ResNet-50 + Mixup 49.30 45.84
ResNet-50 + Cutout 52.78 46.69
ResNet-50 + CutMix 54.81 47.25
Table A1: Weakly supervised object localization results on CUB200-2011 and ImageNet-1K dataset. * denotes results reported in the original papers.

Appendix C Transfer Learning to Object Detection

We evaluate the models on the Pascal VOC 2007 detection benchmark [5] with 5 K test images over 20 object categories. For training, we use both VOC2007 and VOC2012 trainval (VOC07+12).

Finetuning on SSD222 [23]: The input image is resized to (SSD300) and we used the basic training strategy of the original paper such as data augmentation, prior boxes, and extra layers. Since the backbone network is changed from VGG16 to ResNet-50, the pooling location conv4_3 of VGG16 is modified to the output of layer2 of ResNet-50. For training, we set the batch size, learning rate, and training iterations to , , and K, respectively. The learning rate is decayed by the factor of at K and K iterations.

Finetuning on Faster-RCNN333 [29]: Faster-RCNN takes fully-convolutional structure, so we only modify the backbone from VGG16 to ResNet-50. The batch size, learning rate, training iterations are set to , , and K. The learning rate is decayed by the factor of at K iterations.

Appendix D Transfer Learning to Image Captioning

ResNet-50 (Baseline) 61.4 43.8 31.4 22.9 22.8 44.7 71.2
ResNet-50 + Mixup 61.6 44.1 31.6 23.2 22.9 47.9 72.2
ResNet-50 + Cutout 63.0 45.3 32.6 24.0 22.6 48.2 74.1
ResNet-50 + CutMix 64.2 46.3 33.6 24.9 23.1 49.0 77.6
Table A2: Image captioning results on MS-COCO dataset.

MS-COCO dataset [22] contains K trainval images and K test images. From the base model NIC444 [41], the backbone model is changed from GoogLeNet to ResNet-50. For training, we set batch size, learning rate, and training epochs to , , and , respectively. For evaluation, the beam size is set to for all the experiments. Image captioning results with various metrics are shown in Table A2.

Appendix E Robustness and Uncertainty

In this section, we describe the details of the experimental setting and evaluation methods.

e.1 Robustness

We evaluate the model robustness to adversarial perturbations, occlusion and in-between samples using ImageNet trained models. For the base models, we use ResNet-50 structure and follow the settings in Section 4.1.1. For comparison, we use ResNet-50 trained without any additional regularization or augmentation techniques, ResNet-50 trained by Mixup strategy, ResNet-50 trained by Cutout strategy and ResNet-50 trained by our proposed CutMix strategy.

Fast Gradient Sign Method (FGSM): We employ Fast Gradient Sign Method (FGSM) [8] to generate adversarial samples. For the given image , the ground truth label and the noise size , FGSM generates an adversarial sample as the following


where denotes a loss function, for example, cross entropy function. In our experiments, we set the noise scale .

Occlusion: For the given hole size , we make a hole with width and height equals to in the center of the image. For center occluded samples, we zeroed-out inside of the hole and for boundary occluded samples, we zeroed-out outside of the hole. In our experiments, we test the top-1 ImageNet validation accuracy of the models with varying hole size from to .

In-between class samples: To generate in-between class samples, we first sample pairs of images from the ImageNet validation set. For generating Mixup samples, we generate a sample from the selected pair and by . We report the top-1 accuracy on the Mixup samples by varying from to . To generate CutMix in-between samples, we employ the center mask instead of the random mask. We follow the hole generation process used in the occlusion experiments. We evaluate the top-1 accuracy on the CutMix samples by varing hole size from to .

Method TNR at TPR 95% AUROC Detection Acc. TNR at TPR 95% AUROC Detection Acc.
TinyImageNet TinyImageNet (resize)
Baseline 43.0 (0.0) 88.9 (0.0) 81.3 (0.0) 29.8 (0.0) 84.2 (0.0) 77.0 (0.0)
Mixup 22.6 (-20.4) 71.6 (-17.3) 69.8 (-11.5) 12.3 (-17.5) 56.8 (-27.4) 61.0 (-16.0)
Cutout 30.5 (-12.5) 85.6 (-3.3) 79.0 (-2.3) 22.0 (-7.8) 82.8 (-1.4) 77.1 (+0.1)
CutMix 57.1 (+14.1) 92.4 (+3.5) 85.0 (+3.7) 55.4 (+25.6) 91.9 (+7.7) 84.5 (+7.5)
LSUN (crop) LSUN (resize)
Baseline 34.6 (0.0) 86.5 (0.0) 79.5 (0.0) 34.3 (0.0) 86.4 (0.0) 79.0 (0.0)
Mixup 22.9 (-11.7) 76.3 (-10.2) 72.3 (-7.2) 13.0 (-21.3) 59.0 (-27.4) 61.8 (-17.2)
Cutout 33.2 (-1.4) 85.7 (-0.8) 78.5 (-1.0) 23.7 (-10.6) 84.0 (-2.4) 78.4 (-0.6)
CutMix 47.6 (+13.0) 90.3 (+3.8) 82.8 (+3.3) 62.8 (+28.5) 93.7 (+7.3) 86.7 (+7.7)
Baseline 32.0 (0.0) 85.1 (0.0 77.8 (0.0)
Mixup 11.8 (-20.2) 57.0 (-28.1) 61.0 (-16.8)
Cutout 22.2 (-9.8) 82.8 (-2.3) 76.8 (-1.0)
CutMix 60.1 (+28.1) 93.0 (+7.9) 85.7 (+7.9)
Uniform Gaussian
Baseline 0.0 (0.0) 89.2 (0.0) 89.2 (0.0) 10.4 (0.0) 90.7 (0.0) 89.9 (0.0)
Mixup 0.0 (0.0) 0.8 (-88.4) 50.0 (-39.2) 0.0 (-10.4) 23.4 (-67.3) 50.5 (-39.4)
Cutout 0.0 (0.0) 35.6 (-53.6) 59.1 (-30.1) 0.0 (-10.4) 24.3 (-66.4) 50.0 (-39.9)
CutMix 100.0 (+100.0) 99.8 (+10.6) 99.7 (+10.5) 100.0 (+89.6) 99.7 (+9.0) 99.0 (+9.1)
Table A3: Out-of-distribution (OOD) detection results on TinyImageNet, LSUN, iSUN, Gaussian noise and Uniform noise using CIFAR-100 trained models. All numbers are in percents; higher is better.

e.2 Uncertainty

Deep neural networks are often overconfident in their predictions. For example, deep neural networks produce high confidence number even for random noise [12]. One standard benchmark to evaluate the overconfidence of the network is Out-of-distribution (OOD) detection proposed by [12]. The authors proposed a threshold-baed detector which solves the binary classification task by classifying in-distribution and out-of-distribution using the prediction of the given network. Recently, a number of reserchs are proposed to enhance the performance of the baseline detector [21, 20] but in this paper, we follow only the baseline detector algorithm without any input enhancement and temperature scaling [21].

Setup: We compare the OOD detector performance using CIFAR-100 trained models described in Section 4.1.2. For comparison, we use PyramidNet-200 model without any regularization method, PyramidNet-200 model with Mixup, PyramidNet-200 model with Cutout and PyramidNet-200 model with our proposed CutMix.

Evaluation Metrics and Out-of-distributions: In this work, we follow the experimental setting used in [12, 21]

. To measure the performance of the OOD detector, we report the true negative rate (TNR) at 95% true positive rate (TPR), the area under the receiver operating characteristic curve (AUROC) and detection accuracy of each OOD detector. We use seven datasets for out-of-distribution: TinyImageNet (crop), TinyImageNet (resize), LSUN

[45] (crop), LSUN (resize), iSUN, Uniform noise and Gaussian noise.

Results: We report OOD detector performance to seven OODs in Table A3. Overall, CutMix outperforms baseline, Mixup and Cutout. Moreover, we find that even though Mixup and Cutout outperform the classification performance, Mixup and Cutout largely degenerate the baseline detector performance. Especially, for Uniform noise and Gaussian noise, Mixup and Cutout seriously impair the baseline performance while CutMix dramatically improves the performance. From the experiments, we observe that our proposed CutMix enhances the OOD detector performance while Mixup and Cutout produce more overconfident predictions to OOD samples than the baseline.