Pure Noise to the Rescue of Insufficient Data: Improving Imbalanced Classification by Training on Random Noise Images

12/16/2021
by   Shiran Zada, et al.
0

Despite remarkable progress on visual recognition tasks, deep neural-nets still struggle to generalize well when training data is scarce or highly imbalanced, rendering them extremely vulnerable to real-world examples. In this paper, we present a surprisingly simple yet highly effective method to mitigate this limitation: using pure noise images as additional training data. Unlike the common use of additive noise or adversarial noise for data augmentation, we propose an entirely different perspective by directly training on pure random noise images. We present a new Distribution-Aware Routing Batch Normalization layer (DAR-BN), which enables training on pure noise images in addition to natural images within the same network. This encourages generalization and suppresses overfitting. Our proposed method significantly improves imbalanced classification performance, obtaining state-of-the-art results on a large variety of long-tailed image classification datasets (CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, Places-LT, and CelebA-5). Furthermore, our method is extremely simple and easy to use as a general new augmentation tool (on top of existing augmentations), and can be incorporated in any training scheme. It does not require any specialized data generation or training procedures, thus keeping training fast and efficient

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

10/15/2020

Does Data Augmentation Benefit from Split BatchNorms

Data augmentation has emerged as a powerful technique for improving the ...
11/22/2018

Data Augmentation using Random Image Cropping and Patching for Deep CNNs

Deep convolutional neural networks (CNNs) have achieved remarkable resul...
05/01/2019

Fast AutoAugment

Data augmentation is an indispensable technique to improve generalizatio...
11/23/2020

KeepAugment: A Simple Information-Preserving Data Augmentation Approach

Data augmentation (DA) is an essential technique for training state-of-t...
12/09/2019

Selective Synthetic Augmentation with Quality Assurance

Supervised training of an automated medical image analysis system often ...
02/07/2020

Data augmentation with Möbius transformations

Data augmentation has led to substantial improvements in the performance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale annotated datasets play a vital role in the success of deep neural networks for visual recognition tasks. While popular benchmark datasets are usually well-balanced (e.g., CIFAR 

[krizhevsky2009learning], Places [zhou2017places], ImageNet [deng2009imagenet]), data in the real world often follows a long-tail distribution. Namely, most of the data belongs to several majority classes, while the rest is spread across a large number of minority classes [buda2018systematic, reed2001pareto, liu2019large]. Training on such imbalanced datasets results in models that are biased towards majority classes, demonstrating poor generalization on minority classes. There are two common approaches to compensate for class imbalance during training: (i) re-weighting the loss term so that prediction errors on minority samples are given higher penalties [huang2016learning, cui2019class, hong2021disentangling], and (ii) resampling the dataset to re-balance the class distribution during training [chawla2002smote, kim2020m2m, mullick2019generative]. This can be done by under-sampling majority classes [drummond2003c4], or by over-sampling of minority classes [shen2016relay, haixiang2017learning, kang2019decoupling].

However, re-weighting methods typically suffer from overfitting the minority classes [kim2020m2m]. Resampling techniques also suffer from well-known limitations: under-sampling majority classes may impair classification accuracy due to loss of information, while over-sampling leads to overfitting on minority classes [buda2018systematic]. Several methods have been proposed to alleviate these limitations, including augmentation-based methods [chu2020feature, mullick2019generative, liu2020deep], learning from majority classes [kim2020m2m] and evolutionary under-sampling [galar2013eusboost].

The data scarcity in minority classes thus poses a very challenging problem [kim2020m2m]. This is especially true in highly imbalanced datasets, where minority classes contain very few samples (e.g., 5 images per class, vs. thousands of images per class in majority classes). In such cases, overfitting is almost inevitable, even with extensive data augmentation, since the ability to produce a significant variety of new observations from just a few samples is extremely limited.

Figure 1: Method overview.  (Left) OPeN re-balances an imbalanced dataset with pure-noise images, in addition to oversampled natural images. In OPeN, we replace the standard Batch Normalization layer with DAR-BN. (Right) “Distribution Aware Routing BN” (DAR-BN) handles the distribution gap between natural images and pure-noise images, by normalizing them separately. The affine parameters learned on the natural input only, are used to correctly scale and shift the noise input.
Figure 2: Re-balanced dataset with OPeN. To balance the dataset, the original images (blue) are oversampled (green) together with additional pure random noise images (orange). The amount of pure noise added to each class is inversely proportional to its size.

In this work, we directly address this problem by taking a new perspective on data re-balancing for imbalanced classification. Unlike traditional resampling approaches, we do not restrict ourselves to training strictly on existing images and their augmentations, thus bypassing this limitation. Specifically, we propose generating pure random noise images and using them as additional training data (especially for minority classes). We show that training on pure noise images can suppress overfitting and encourage generalization, leading to state-of-the-art results on commonly used imbalanced classification benchmarks (Sec. 4). We further provide an intuitive explanation as to why this counter-intuitive approach works in practice (Sec. 3.3). To facilitate learning on pure noise images, which are out-of-distribution of natural images, we present a new batch normalization layer called Distribution-Aware Routing Batch Normalization (DAR-BN). Unlike standard Batch-Normalization (BN) [ioffe2015batch] that assumes that all inputs are drawn from the same distribution, DAR-BN is specifically designed to mitigate the distribution shift between two different input domains (namely, natural images and pure noise images).

We note that many previous works have used noise as a form of data augmentation to improve the accuracy and robustness of deep learning models 

[koziarski2017image, lopes2019improving]. These methods, however, show limited improvement when the training data is scarce [koziarski2017image]. This is due to the fact that applying small doses of additive/multiplicative noise to existing images produces samples in close vicinity to the original ones, thus limiting the data variability. Adding large amounts of noise, on the other hand, degrades models’ performance due to the large distribution shift from natural images. In contrast, in our method – OPeN (Oversampling with Pure Noise Images), the model is explicitly trained on pure noise images that are far-off the natural images manifold, while explicitly handling the distribution shift, thus promoting generalization.

Our contributions are therefore several fold:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex,leftmargin=*]

  • State-of-the-art results on multiple imbalanced classification benchmarks (CIFAR10-LT [cao2019learning],CIFAR100-LT [cao2019learning], ImageNet-LT [liu2019large], Places-LT [liu2019large], CelebA-5 [kim2020m2m]).

  • To our best knowledge, we are the first to successfully use pure-noise images for training deep image recognition models. We provide extensive empirical evidence to its improved generalization capabilities (and intuition why).

  • We introduce a new distribution-aware normalization layer (DAR-BN), that can bridge the distribution-gap between different input domains of neural networks. While in this work we used DAR-BN to bridge the gap between real and pure-noise images, it can be applied as a general BN layer for handling any pair of different input domains.

  • Our method is extremely simple to use as a general new augmentation tool (on top of existing augmentations), and can be incorporated in any training scheme. It does not require any specialized data generation or training procedures, thus keeping training fast and efficient.

2 Related Work

Imbalanced Classification:

Data resampling: Most data-based approaches for imbalanced classification aim to re-balance the dataset such that minority and majority classes are equally represented during training. This can be achieved by either over-sampling minority classes [chawla2002smote, wang2014hybrid, shen2016relay] or under-sampling majority classes [drummond2003c4, liu2008exploratory, galar2013eusboost]. More recent works address class re-balancing using GANs [mullick2019generative]

and semi-supervised learning 

[wei2021crest]. An oversampling framework related to our work is M2m [kim2020m2m]

, in which majority samples are “transferred” to minor classes using adversarial noise perturbations. Our method also belongs to the data re-balancing category, however, it does so by adding pure random noise images as additional training samples rather than using additive noise augmentations. We also note that our method does not require any optimized data creation procedure or using an auxiliary classifier, which allows a simple and efficient training process. Our OPeN framework belongs to the category of data resampling approaches, which are therefore most relevant to our work.

Loss re-weighting

: aims to compensate for data imbalance by adjusting the loss function, e.g., by assigning minority samples with higher loss weights than majority samples 

[buda2018systematic, cui2019class, ren2018learning]. Most recently, BALMS [ren2020balanced] and LADE [hong2021disentangling]

both suggested calibrating the predicted logits according to a prior distribution, by adjusting the softmax function and adding a regularization term to the loss, respectively.

Margin loss: using a loss function that pushes the decision boundary further away from minority classes samples  [zhang2017range, dong2018imbalanced]. For example,  [cao2019learning] presented the Label Distribution Aware Margin (LDAM) loss, which is combined with deferred re-weighting (DRW) training schedule for improved results.

Decoupled training: A recent line of work showing that separating the feature representation learning from the final classification task can be beneficial for imbalanced classification [kang2019decoupling, zhou2020bbn, wang2021contrastive]. E.g., the recently proposed MiSLAS method [zhong2021improving] suggested using a shifted batch normalization layer between the two stages of the decoupling framework, in addition to calibrating the final model predictions using a label-aware smoothing scheme.

Noise-Based Augmentation:
Augmenting training data with additive or multiplicative noise has long been in use for training visual recognition models [holmstrom1992using, bengio2011deep, ding2016convolutional]. The main motivation behind such augmentation techniques is improving the model robustness to noisy inputs and forcing it to not fixate on specific input features by randomly “occluding” parts of them [lopes2019improving]. While demonstrating some success in reducing overfitting [zur2009noise], these methods usually provide limited improvement to deep models as they tend to overfit to the specific type of noise used during training [yin2015noisy].
Another group of methods that uses additive noise are adversarial training techniques, which aim to “fool” a deep model by perturbing images with small, optimized noise [goodfellow2014explaining, kurakin2016adversarial]. In particular relevance to our work, M2m [kim2020m2m] suggests using adversarial noise to ”transfer” images from major classes to minor classes in an imbalanced classification setting. Similarly, AdvProp [xie2020adversarial] suggests utilizing adversarial examples for improving accuracy and robustness in a general (balanced) classification setting. They try to bridge the distribution gap between two types of inputs (real and adversarial images), for which they use an auxiliary batch normalization layer. In our work, however, the training data is enriched using pure noise images rather than adversarial examples. Additionally, AdvProp learns two completely separate sets of batch-norm parameters while in our proposed DAR-BN the affine parameters are learned only based on real images, and then applied to both data sources.

Normalization Layers:
Since the introduction of batch-normalization [ioffe2015batch], various extensions have been proposed to further improve normalization within deep networks, including layer-norm [ba2016layer], instance-norm [ulyanov2016instance] and group-norm [wu2018group]

. In common to all these layers is that they normalize activation maps based on a single set of statistics (i.e., mean and variance) for the entire training set. While this may work well when all data samples are from the same underlying distribution, it is sub-optimal when the data is multi-modal or originates from several different domains 

[xie2020adversarial]. Several recent works which relate to ours have addressed this issue: adaptive instance-normalization [huang2017arbitrary] was introduced for style transfer by adjusting the statistics of content and style inputs. Similarly, [li2018adaptive, xie2020adversarial] propose mitigating the domain shift by keeping separate sets of normalization terms for different domains. In our proposed DAR-BN layer we also use a different set of mean and variance parameters for real and noise images, but then use shared affine parameters to jointly scale them after normalization.

3 Imbalanced Classification using OPeN

Figs. 1,2 provide a schematic overview of our approach for imbalanced image classification, which is detailed next. is a long-tailed imbalanced dataset containing classes , where each class consist of training samples. For simplicity we assume and . In some cases, the ratio between the largest and the smallest class is a factor of 1000. While the training-set is class-imbalanced, the test-set is class-balanced, and therefore classification of minority classes (with only few samples) is of equal importance to that of majority classes. To compensate for the lack of training data in minority classes, we adopt an oversampling approach that levels the number of samples in each class. However, in contrast to common oversampling techniques [chawla2002smote, kim2020m2m], whose training images are solely based on original ones (i.e., their duplications and augmentations), we propose to use also pure random Gaussian noise images as additional training samples. As shown in Fig. 2, for each class , we balance the data by adding new training images (where is the largest class), out of which are pure noise images, and are real (oversampled) images. During training, we feed the network with mixed batches containing both natural images (with augmentations) and pure noise images. In Sec. 3.1 we further elaborate on this process. Since pure-noise images are out-of-distribution of natural images, we normalize them separately using a new distribution-aware normalization layer. This is explained in Sec. 3.2. Finally, Sec. 3.3 provides intuition why this improves accuracy of imbalanced classification, especially on minority classes.

3.1 Oversampling with Pure Noise Images (OPeN)

We define the representation-ratio of each class in as . By definition, minority classes have a smaller representation ratio than majority classes. Since standard oversampling results in overfitting of minority classes [buda2018systematic]

, we replace part of the oversampled images with pure random noise images, with the following probability:

(1)

where is the associated class label of image , is the representation ratio of class , and

is randomly sampled from a normal distribution using the mean and variance of all images in the dataset.

is a hyper-parameter defining the ratio between pure noise images and natural images. Each class in the dataset has a different number of samples, hence is prone to overfitting to a different extent. Eq. 1 adjusts the number of noise images added per class accordingly. Lower results in a higher probability to replace a sample from class with a pure random noise image, and vice versa for larger .

The pure random noise images are generated as follows. Let be the set of training images in the dataset

. We first compute the mean and standard deviation for each color channel

:

(2)

Noise images are then sampled from he following normal distribution and clipped to the feasible domain [0,1]:

(3)
(4)

At every epoch, we randomly sample new noise images, as this helps the network to avoid overfitting to specific noise images. A pseudo-code for OPeN is shown in

Algorithm 1.

Input: (i) Imbalanced dataset:
(ii) noise ratio: ; (iii) dataset statistics:,
Initialize
Balanced loader for using oversampling
Sample a batch from
for all  do
      Compute probability of replacing with noise
     
     
     if  then
         
         
     end if
end for
return
Algorithm 1 Oversampling with Pure Random Noise (OPeN)

3.2 Distribution-Aware Routing Batch Norm (DAR-BN)

Standard Batch Normalization [ioffe2015batch] is designed to handle the change in the distribution of inputs to layers in deep neural networks, also known as internal covariate shift. However, it assumes that all input samples are taken from the same or similar distributions. Therefore, when inputs originate from several different distributions, BN fails to properly normalize them [he2019data, xie2020adversarial, xie2019intriguing]. In our framework, OPeN uses pure random noise as additional training examples, which are clearly out of the distribution of natural images. As a result, the layer’s input consists of activation maps obtained both from natural images (where the similar distribution assumption holds) as well as pure noise images (whose distribution is very different from that of the natural images in the train or test datasets). We experimentally observe that using noise images with the standard BN layer, leads to a significant degradation in classification results (see Sec. 5), even below the baseline of not using noise at all. This may further suggest why this simple idea (of adding pure noise images as additional training examples) has not been previously proposed as a general tool to improve generalization of deep neural networks. To handle the significant distribution gap between random noise images and natural images, we introduce a new normalization layer called “DAR-BN”: Distribution-Aware Routing Batch Normalization.

We start by revisiting the standard Batch Normalization (BN) [ioffe2015batch], and then explain how we extend it to our proposed DAR-BN. Let denote an input to the normalization layer, where is the batch size, is the spatial dimension size, and

is the number of channels. The BN layer acts on each channel independently by first normalizing the input across the spatial and the batch dimensions, then applying an affine layer with trainable parameters. Formally, for each channel

:

(5)
(6)

where is an input channel (i.e., ) and are trainable parameters per channel. At inference time, the input is normalized using the running mean and running variance that were computed during training using an exponential moving average (EMA) of the batch statistics.

(7)
(8)

where is the momentum parameter. Then, at test time the data is normalized by the running mean and variance, i.e., we replace Eq. 5 with:

(9)

To handle the significant distribution shift between random noise images and natural images, we propose DAR-BN, an extension to the standard BN layer. To this goal, DAR-BN normalizes the noise activation maps and the natural activation maps separately. Specifically, assume where are activation maps of natural images and pure noise images in the batch, respectively. DAR-BN replaces Eq. 5 with:

(10)
(11)

Then, motivated by AdaBN [li2016revisiting]

(which is designed to handle the covariate shift for domain adaption/transfer learning), DAR-BN uses the affine parameters learned by the

natural activation maps in order to scale and shift the noise activation maps. Specifically, DAR-BN replaces Eq. 6 with:

(12)
(13)

Equation 13 is applied when the parameters remain fixed, such that no update is applied to these parameters in the back-propagation step due to the operation in Eq. 13. Finally, since at test time inputs are sampled only from the natural images domain, DAR-BN updates the batch statistics only using activation maps of natural images. Accordingly, equations Eqs. 8 and 7 are replaced with:

(14)
(15)

A pseudo-code of DAR-BN is found in Algorithm 2. Note that DAR-BN bridges the gap between different input domains. While here we used it to bridge the gap between real and pure-noise images, DAR-BN can be used as a general BN layer for handling any pair of different input domains.

Dataset # of classes Imbalance-ratio (IR) Largest class size Smallest class size # of samples
CIFAR-10-LT [cao2019learning] 10 {50 , 100} 5,000 {100 , 50} {13,996 , 12,406}
CIFAR-100-LT [cao2019learning] 100 {50 , 100} 500 {10 , 5} {12,608 , 10,847}
ImageNet-LT [liu2019large] 1,000 256 1,200 5 115,846
Places-LT [liu2019large] 365 996 4,980 5 62,500
CelebA-5 [kim2020m2m] 5 10.7 2423 227 6651
Table 1: Long-tailed datasets. Summary of the long-tailed datasets we used for evaluation. (see Sec. 4 below for a detailed explanation)
Input: (i) Batch of activation maps (per-channel) where is the channel activation map of example ; (ii) Function indicator satisfies: is an activation map of a pure noise image.
Initialize , , ,
splits
for all  in  do
      (split) Split the batch
     
     
     
     
     
     if  then
          Do not update as well as the batch statistics
         with no gradient update:
              
     else
          Update statistics according to the natural split
         
         
     end if
end for
Algorithm 2 Distribution-Aware Routing BN (DAR-BN)

3.3 Underlying Intuition

Training directly on random noise images may seem counter-intuitive. However, we claim it provides a unique regularization effect that can significantly improve generalization of minority classes.

Consider the average batch during training in the imbalanced classification setting described above. Each class is represented in the batch according to its relative size in the training set, i.e., its representation ratio,

. When backpropagating, the total gradient can be decomposed to the sum of

individual components, one per class. When applying conventional oversampling (i.e., using duplications and augmentations of original images), gradient components of minority classes will increase in magnitude, since they are now over-represented. However, their direction will remain relatively unchanged since oversampled images are usually similar to original ones, thus limiting generalization for these classes. This relates to another well-known drawback of such oversampling methods which tend to perform poorly when the number of samples in minority classes is very small, since the ability to synthesize new and varied samples for those classes is extremely limited [kim2020m2m].

In contrast, using the proposed OPeN resampling scheme alleviates both of these problems: (i) From a training point of view, oversampling with pure noise images also increases the magnitude of minority gradient components, but at the same time adds stochasticity to their direction. This stochasticity has a regularization effect on the training process, whose strength is inversely proportional to the class size. This way, overfitting of minority classes can be suppressed, and generalization is encouraged. (ii) By using random noise images, generation of new training samples is not limited by the variety of existing samples in the data. This way, we bypass the limitation posed by the small number of minority samples, and explicitly teach the network to handle inputs that are significantly out of its training-set distribution. In particular, the network learns to expect much higher variability and uncertainty in the test images of minority classes. Indeed, at test time, as our experiments suggest, this translates into increased generalization performance. We note that many previous works have used noise as a form of data augmentation. These methods, however, show slight improvement when training data is scarce [koziarski2017image], mostly since applying small doses of noise produces new images that are in close vicinity to original ones, thus providing limited data variability.

One can also understand how our method mitigates data imbalance from another perspective. Since noise inputs are completely random and are class independent, they in fact carry no information except for the class labels we assign to them. Consequently, a key effect of using noise images is on the prior class probabilities learnt by the network. Since in the proposed re-sampling scheme more noise images are assigned to minority classes, we hypothesize that the network learns to implicitly encode these prior probabilities and correct its predictions accordingly.

4 Experiments: Imbalanced Classification

We evaluate our method on five benchmark datasets for imbalanced classification: CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, Places-LT, and CelebA-5. We follow the evaluation protocol used in [liu2019large, kim2020m2m] for imbalanced classification tasks: The model is trained on the class-imbalanced training set, but then evaluated on a balanced class distribution test set. Our results (summarized in Table 2) exhibit state-of-the-art performance on all these datasets. Below is a detailed description of our experiments and results.

Imbalanced (Long-Tail) Datasets:

The Imbalance-Ratio (IR) of a longtail dataset is defined as IR=, where , are the number of training images in its largest/smallest class, respectively. The five long-tailed datasets we used are described below, and their information is summarized in Table 1.

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex,leftmargin=*]

  • CIFAR-10-LT & CIFAR-100-LT [cao2019learning]. The full CIFAR-10/100 datasets [krizhevsky2009learning] consist of 50,000 training images, and 10,000 test images (split uniformly into 10/100 classes, respectively). Their long-tailed versions, CIFAR-10/100-LT [cao2019learning], were constructed by an exponential decay sampling of the number of training images per class, while their corresponding test sets remain unchanged (i.e., uniform class distribution). We evaluate our method on the challenging dataset settings (IR=50, IR=100).

  • ImageNet-LT & Places-LT [liu2019large]. Both datasets are long-tail subset of the original large-scale (balanced) ImageNet [deng2009imagenet] and Places [zhou2017places]. The imbalanced datasets were constructed by sampling the original dataset following the Pareto distribution [reed2001pareto] with as its power value. The resulting long-tailed datasets have a smallest class of size 5, and a largest class of size 1280 for ImageNet-LT (IR=256), and 4980 for Places-LT (IR=996).

  • CelebA-5 [kim2020m2m]. The standard CelebA dataset [liu2015faceattributes] consists of face images with 40 binary attributes per image. CelebA-5 dataset was proposed in [mullick2019generative] by selecting samples from non-overlapping attributes of hair color (blonde, black, bald, brown, gray). Naturally, the resulting dataset is imbalanced (with IR=

    ), as human hair colors are not uniformly distributed. The images were then resized to

    pixels. Kim et al[kim2020m2m] constructed a smaller version of the imbalanced dataset by sampling each class with a ratio 1:20, preserving the IR.

Methods CIFAR-10-LT CIFAR-100-LT ImageNet-LT Places-LT CelebA-5
IR=100 IR=50 IR=100 IR=50
Empirical Risk Minimization (ERM) 0.2 0.4 0.5 0.4 51.1 29.9 78.6 0.1
Oversampling 0.4 0.4 0.3 0.2 49.0 38.1 76.4 0.2
LADM-DRW [cao2019learning] 77.1 81.1 42.1 46.7 - - -
M2m [kim2020m2m] 79.10.2 - 43.50.2 - 43.7 - 75.91.1
Balanced Meta-Softmax (BALMS) [ren2020balanced] - - - - 41.8 38.7 -
LADE [hong2021disentangling] - - 45.4 50.5 53.0 38.8 -
MisLAS [zhong2021improving] 82.1 85.7 47.0 52.3 52.7 40.4 -
OPeN (ours) 84.60.2 87.90.2 51.50.4 56.30.4 55.1 40.5 79.7 0.2
ERM + AutoAugment 0.3 0.2 0.4 0.4 52.2 29.2 79.3 0.5
BALMS [ren2020balanced] + AutoAugment 84.9 - 50.8 - - - -
OPeN (ours) + AutoAugment 86.10.1 89.20.2 54.20.5 59.80.5 56.1 39.6 80.90.4
Table 2: Results & Comparison on imbalanced benchmark datasets. Mean accuracy over all classes per dataset. OPeN outperforms all previous methods, obtaining state-of-the-art results on all datasets. Since AutoAugment was optimized on the full balanced CIFAR-10 and ImageNet datasets, we split the table into 2 parts (see Sec. 4.2). Methods in the top do not use AutoAugment. Methods in the bottom are with AutoAugment. OPeN achieves the highest results in both cases. Rows with denote results directly borrowed from the original papers. Rows with denote results directly borrowed from [zhong2021improving]. Missing results indicate datasets not evaluated in the cited papers.

4.1 Experimental Setup

Architectures & training: For CIFAR-10/100-LT datasets, we use WideResNet-28-10 [zagoruyko2016wide] as our default architecture and train for 200 epoch. For ImageNet-LT we use ResNext-50 [xie2017aggregated] with cosine classifier and train for 220 epochs. For CelebA-5[kim2020m2m], we use the same architecture and training parameters as CIFAR (trained here for 90 epochs). For Places-LT, we follow the procedure of [hong2021disentangling, ren2020balanced, kang2019decoupling] which use ImageNet pretrained ResNet152. We fine-tune it for additional 30 epochs. As in [cao2019learning, kim2020m2m], we too defer our method (OPeN) to the last phase of the training. This allows the network to learn an initial representation of the data with natural images only. Only when the learning rate decays (which is when the model is exposed to overfitting), do we add the Oversampling + Pure-Noise images. For ImageNet-LT, CIFAR-10-LT and CIFAR-100 LT, we defer OPeN to the last 40 epochs. For Places-LT and CelebA-5, OPeN is applied on the last 15 and 30 epochs, respectively. For more details on training parameters and the image augmentations employed, please see Appendix B.

Baseline methods. On top of comparing to recent leading methods [cao2019learning, hong2021disentangling, ren2020balanced, zhong2021improving, kim2020m2m], we also compare ourselves under the same training parameters, augmentations and architectures to the following baselines: (i) Empirical Risk Minimization (ERM): training without any re-balancing scheme); (ii) Oversampling: re-balancing the dataset by oversampling minority classes with augmentations; (iii) ERM + AutoAugment (AA)  [cubuk2019autoaugment].

Noise-ratio (): The noise-ratio was determined using the CIFAR validation set and set to . We simply applied the same value to all other datasets.

Randomization:

Since small datasets tend to present high variance, for CelebA-5, CIFAR-10-LT and CIFAR-100-LT, we repeat the experiments 4 times, reporting mean and standard error. For the rest of the datasets, for fair comparison, we use the same randomization seed across all experiments.

4.2 Results

Table 2 shows the results for imbalanced image classification. OPeN obtains SOTA results on all benchmark datasets. For example, on CIFAR-10-LT and CIFAR-100-LT with an imbalance ratio of 100, OPeN outperforms the previous SOTA method, MiSLAS [zhong2021improving] by 4.5% on both datasets. On ImageNet-LT, OPeN is higher than previous SOTA method, LADE [hong2021disentangling], by 2.1%. On CelebA-5, OPeN surpasses the previous SOTA method, M2m [kim2020m2m] by 3.8%. On Places-LT, OPeN achieves comparable results (0.1% better) to previous SOTA, MiSLAS [zhong2021improving].

The above results were obtained without using AutoAugment (AA) [cubuk2019autoaugment]. Using AutoAugment for training extremely small/longtailed subsets of CIFAR and ImageNet is unfair [azuri2021generative], since AutoAugment was optimized using the entire large and balanced CIFAR-10 and ImageNet datasets. However, since BALMS [ren2020balanced] report results only with AutoAugment, we evaluated our method also with that setting. OPeN with AA outperforms BALMS by 1.1% and 3.4% on CIFAR-10-LT and CIFAR-100-LT, respectively. We further note that OPeN obtains state-of-the-art results even without using AutoAugment. To further support the claim that using AutoAugment is problematic for evaluations on ImageNet-LT and CIFAR-10-LT, we note that AutoAugment has a negative effect on Places-LT, for which it was not optimized (compared to the same method without AA).

Generalization of minority classes. Besides improving the mean accuracy (reported in Table 2), finer exploration reveals that most of this overall improvement stems from a dramatic improvement in classification accuracy of minority classes, while preserving the accuracy of majority classes. Specifically, OPeN improves the accuracy of the 20 smallest classes of CIFAR100-LT (with IR=100, where minority classes have 5-12 samples) by 13.9% above baseline ERM training, from mean accuracy of 11.6% to 25.5%. OPeN also outperforms the baseline deferred oversampling [cao2019learning] (without noise images) by 4.3% on the same subset of minor classes. On CIFAR10-LT, OPeN improves generalization of the two smallest classes by 6.3% compared to deferred oversampling, and by 15.6% above ERM training. These findings provide empirical evidence to our hypothesis that adding pure noise to minority classes (as opposed to only augmenting the existing training images) significantly diminishes the overfitting problem and increases the generalization capabilities. Please see Appendix A for more detailed evaluations.

5 Ablation Studies & Observations

In this section we explore the added-value of training on pure-noise images under various different settings, and the importance of using DAR-BN for batch normalization.

Data augmentation.

In this ablation study, we explore the added value of pure noise when using OPeN with different types of data augmentation methods. We evaluate several augmentation techniques with increasing power on CIFAR-10-LT: (i) random horizontal flip, followed by random crop with padding of four pixels; (ii) Cutout 

[devries2017improved] (which zeros out a random fixed-size window in the image); (iii) SimCLR [chen2020simple] (which includes in addition to the horizontal flip and crop, also color distortion and Gaussian blur) followed by Cutout; (iv) AutoAugment [cubuk2019autoaugment] (which is optimized on the entire balanced CIFAR-10 and ImageNet datasets, and considered to be a highly-powerful augmentation).

Fig. 3 shows that OPeN provides a significant improvement over all four augmentation types, even when the optimal augmentation for that dataset (AutoAugment [cubuk2019autoaugment]) is used. This further supports our hypothesis that training on out-of-distribution pure noise images has a significant added value in suppressing overfitting, beyond augmentation of existing training images.

The impact of DAR-BN. Our distribution-aware normalization layer (DAR-BN) is an essential component for the success of our method, since it helps bridge the distribution gap between random pure noise images and natural images. Xie et al[xie2019intriguing, xie2020adversarial] already observed that natural images and adversarial images are drawn from two different domains. They addressed this using an “Auxiliary BN” layer, which separates adversarial examples and clean images into two separate standard BN layers, with two separate learnable sets of affine parameters. In contrast, in DAR-BN we use only one set of trainable parameters, which are learned by activation maps of natural images only, and use them to scale and shift both the natural activation maps and the activation maps of the pure noise. This normalization difference is important, since the test data in our case will contain only natural images and no pure noise images.

Table 3 compares the effect of plugging each of 3 different BN layers into OPeN: (i) Standard BN [ioffe2015batch], (ii) Auxiliary BN [xie2020adversarial], and (iii) DAR-BN (ours). Results show that DAR-BN outperforms other BN layers (surpassing standard BN by 3.2% and 2.3% on CIFAR-10/100-LT, respectively).

Figure 3: Ablation study: The added value of pure noise with respect to various augmentation methods. Mean accuracy on CIFAR-10-LT with IR=100. We compare OPeN to ERM baseline and to deferred oversampling[cao2019learning] (with same training parameters).

Pure Noise Images – a General Useful Augmentation. Our method and experiments are primarily focused on imbalanced classification. However, we observed that adding pure noise images is often effective as a general data enrichment method, which complements existing augmentation methods, even in standard balanced datasets. To use it as such, we simply add a fixed number of pure noise images to each class (e.g., some pre-defined percentage of the class size), and train the network using DAR-BN as described in Algorithm 2. We note that since our method does not modify existing training images, it can be easily applied in addition to any other augmentation technique.

While we did not perform extensive evaluations of this, we exemplify the potential power of training on pure-noise images (with DAR-BN) as an additional useful data augmentation method, on the two full (balanced) CIFAR datastes [krizhevsky2009learning]. To examine the power of this “complementary augmentation”, we measure its added value on top of successful and commonly used augmentation techniques: (i) Baseline augmentation: using random horizontal flip and random cropping (with crop size of 32 and padding of 4); (ii) AutoAugment [cubuk2019autoaugment] using the corresponding dataset policy; (iii) Our method: adding pure noise images (normalized with DAR-BN) in addition to AutoAugment.

We perform our experiments on the full (balanced) CIFAR-10 and CIFAR-100, using a Wide-ResNet-28-10 architecture [zagoruyko2016wide]. All models were trained for 200 epochs using the Adam optimizer with and

, and with a standard cross-entropy loss. Noise images were sampled from a Gaussian distribution with mean and variance of the corresponding training set and with a noise-to-real ratio of 1:4 in each batch. Our proposed method (AutoAugment complemented with pure-noise images) achieves the best classification accuracy on both datasets. Specifically:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex,leftmargin=*]

  • Improvement over the baseline augmentation:
          +2.38% on CIFAR-10,    +6.34% on CIFAR-100.

  • Improvement over AutoAugment:
          +0.9% on CIFAR-10,      +1.5% on CIFAR-100.

These results suggest that properly utilizing pure noise images (with our proposed DAR-BN), may serve as an additional useful augmentation method in general, without any elaborated data creation schemes. It has the potential to further improve classification accuracy, even when used on top of highly sophisticated augmentation methods such as AutoAugment (which was optimized for these specific datasets). Extensively verifying this observation on a large variety of datasets, architectures, and augmentation methods, is part of our future work.

Norm Layer CIFAR-10-LT CIFAR-100-LT
Standard BN [ioffe2015batch] 0.70 0.54
Auxiliary BN [xie2020adversarial] 0.16 0.06
DAR-BN (ours) 84.640.16 51.500.44
Table 3: Ablation study: Comparing different Batch-Norm layers. Mean accuracy on CIFAR-10/100-LT with IR=100. Each type of BN is plugged into OPeN (with same training parameters). DAR-BN outperforms the other normalization layers.

6 Conclusion

We present a new framework (OPeN) for imbalanced image classification: re-balance the training set by using pure noise images as additional training samples, along with a special distribution-aware normalization layer (DAR-BN). Our method achieves SOTA results on a large variety of imbalanced classification benchmarks. In particular, it significantly improves generalization of tiny classes with very few training images. Our method is extremely simple to use, and can be incorporated in any training scheme. While we developed DAR-BN to bridge the distribution gap between real and pure-noise images, it may potentially serve as a new BN layer for bridging the gap between other pairs of different input domains in neural-nets. Our work may open up new research directions for harnessing noise, as well as other types of out-of-distribution data, both for imbalanced classification, and for data enrichment in general.

7 Acknowledgements

This project received funding from the D. Dan and Betty Kahn Foundation.

References

Appendix A Generalization as a Function of the Class Size

This section provides more details on the improvement provided by OPeN to the generalization of small (minority) classes (extending the evaluation in Sec. 4.2). To this goal, we perform a finer evaluation over the classes in CIFAR-10-LT and CIFAR-100-LT datasets. We divide the classes (according to their sample size) into five non-overlapping groups of equal size, i.e., each group consists of 20% of the classes. For example, for CIFAR-100-LT, Group #1 consists of the twenty smallest classes in the training set, while Group #5 consists of the twenty largest classes. Similarly, for CIFAR-10-LT, each group consists of two classes.

Figure 4 shows the classification results using OPeN compared to the ERM baseline and deferred oversampling [cao2019learning], according to the classes division described above. OPeN provides a significant improvement over two methods on minority classes. We specifically note that on CIFAR-10-LT, OPeN improves the accuracy over ERM for Group #1 (the two smallest classes) by 15.6%, for Group #2 by 8.6% and for Group #3 by 4.6% while degrades the accuracy for Group #4 & #5 in less the 2%. This shows that using OPeN, besides improving the overall accuracy (discussed in Sec. 4.2), results in a more balanced classifier. These results support our claim that OPeN bypasses the limitation of using solely augmented images based on existing ones by employing out-of-distribution random images as additional training examples.

Appendix B Additional Details of Experimental Setup

This section details our experimental setup of the experiments we performed in Sec. 4.1, including the network architectures we used, the training parameters, and the employed image augmentations.

b.1 Architectures & Training

  • For CIFAR-10-LT and CIFAR-100-LT datasets [cao2019learning], we use WideResNet-28-10 [zagoruyko2016wide] as our default architecture. Following [kim2020m2m, cao2019learning], we train for 200 epochs with Cross-Entropy loss using SGD optimizer with momentum 0.9 and weight decay of 2e-4. We use a step learning rate decay with an initial learning rate of , then decay by a factor of 0.01 at epochs 160 and 180.

  • For CelebA-5[kim2020m2m], we use the same architecture and training parameters as for CIFAR but train here for 90 epochs, decay the learning rate by a factor of 0.1 at epochs 30 and 60.

  • For ImageNet-LT [liu2019large], we use ResNext-50 [xie2017aggregated] with a cosine classifier [gidaris2018dynamic] and train for 220 epochs using SGD optimizer with momentum 0.9 and weight decay 5e-4. We use a step learning rate decay with an initial learning rate of 5e-2, then decay by a factor of 0.1 at epochs 160 and 170.

  • For Places-LT [liu2019large], we follow the procedure of [hong2021disentangling, ren2020balanced, kang2019decoupling], which use ResNet152 pre-trained on ImageNet as a feature extractor. We use a randomly initialized cosine classifier on top of the backbone, and train the entire network end-to-end for additional 30 epochs using SGD optimizer with momentum 0.9 and weight decay 5e-4. The initial learning-rate is set to 5e-2 for the classifier and 1e-3 for the backbone, then decay the learning rate by a factor of 0.1 at epochs 10 and 15.

As in [cao2019learning, kim2020m2m], we too defer our method (OPeN) to the last phase of the training. This allows the network to learn an initial representation of the data with natural images only. Only when the learning rate decays (which is when the model is exposed to overfitting), we add the Oversampling + Pure-Noise images. For ImageNet-LT, CIFAR-10-LT, and CIFAR-100 LT, we defer OPeN to the last 40 epochs. For Places-LT, OPeN is applied on the last 15 epochs (at epoch 15). For CelebA-5, OPeN is applied on the last 30 epochs (at epoch 60).

b.2 Image Augmentations

In each experiment, the following data augmentations were used:

  • For the datasets with small images (i.e., CIFAR-10-LT [cao2019learning], CIFAR-100-LT [cao2019learning] and CelebA-5 [kim2020m2m]), we use random horizontal flip followed by random crop with padding of four pixels, then apply Cutout [devries2017improved] (which zeros out a random window in the image) and SimCLR [chen2020simple] (which includes ColorJitter, random Grayscale and random GaussianBlur).

  • For the datasets with a higher resolution images (i.e., ImageNet-LT [liu2019large] and Places-LT [liu2019large]), we apply random resize crop (with default parameters) to pixels followed by SimCLR and random rotation.

When AutoAugment [cubuk2019autoaugment] is employed (see Sec. 4), it replaces all above augmentations. We use CIFAR-10 policy for CIFAR-10-LT, CIFAR-100-LT and CelebA-5 datasets, and ImageNet policy for ImageNet-LT and Places-LT datasets.