Influence-Balanced Loss for Imbalanced Visual Classification

10/06/2021
by   Seulki Park, et al.
Seoul National University
0

In this paper, we propose a balancing training method to address problems in imbalanced data learning. To this end, we derive a new loss used in the balancing training phase that alleviates the influence of samples that cause an overfitted decision boundary. The proposed loss efficiently improves the performance of any type of imbalance learning methods. In experiments on multiple benchmark data sets, we demonstrate the validity of our method and reveal that the proposed loss outperforms the state-of-the-art cost-sensitive loss methods. Furthermore, since our loss is not restricted to a specific task, model, or training method, it can be easily used in combination with other recent re-sampling, meta-learning, and cost-sensitive learning methods for class-imbalance problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/10/2019

CRCEN: A Generalized Cost-sensitive Neural Network Approach for Imbalanced Classification

Classification on imbalanced datasets is a challenging task in real-worl...
03/31/2020

Deep Learning based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis

Deep learning is a state of the art method for a lot of applications. Th...
01/21/2022

To SMOTE, or not to SMOTE?

In imbalanced binary classification problems the objective metric is oft...
06/17/2021

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

Class-imbalanced data, in which some classes contain far more samples th...
02/03/2016

Discriminative Sparse Neighbor Approximation for Imbalanced Learning

Data imbalance is common in many vision tasks where one or more classes ...
09/28/2017

Introducing DeepBalance: Random Deep Belief Network Ensembles to Address Class Imbalance

Class imbalance problems manifest in domains such as financial fraud det...
01/21/2019

Dynamic Curriculum Learning for Imbalanced Data Classification

Human attribute analysis is a challenging task in the field of computer ...

Code Repositories

IB-Loss

[ICCV 2021] Influence-balanced Loss for Imbalanced Visual Classification


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the remarkable success of deep neural networks (DNNs) these days, many areas of computer vision suffer from highly imbalanced datasets. Many real-world data exhibit skewed distributions

[ref:data_coco, ref:data_iNat, ref:data_pascal_voc, data:liu_yu_cvpr2019, data:uci], in which the number of samples per class differs greatly. This imbalance between classes can be problematic, since the model trained on such imbalanced data tends to overfit the dominant (majority) classes [ref:japkowicz_stehphen_2002, ref:He_Garcia_2009, ref:buda_mazurowski_2018]. That is, while the overall performance appears to be satisfactory, the model performs poorly on minority classes. To overcome the class imbalance problem, extensive research has recently been conducted to improve the generalization performance by reducing the overwhelming influence of the dominant class on the model.

The research on imbalanced learning can be divided into three approaches: data-level approach, cost-sensitive re-weighting approach, and meta-learning approach. The data-level approach aims to directly balance the training data distributions via re-sampling (i.e., under-sampling or over-sampling) [ref:resample_chawla_smote02, ref:resample_hulse_icml07] or by generating synthetic samples [ref:sankha_gamo_iccv2019]

. Meanwhile, the cost-sensitive re-weighting approach aims to design new loss functions to re-weight samples by considering their importance

[ref:wang_hebert_nips17, ref:huang_tang_cvpr16, ref:lin_focal_loss_iccv17]. Finally, the meta-learning approach enhances the performance of the data-level and/or cost-sensitive re-weighting approach via meta-learning [ref:shu_metaweightnet_neurips2019, ref:liu_mesa_neurips20, ref:ren_bms_neurips2020]. Most recent data-level approaches require a heavy computational burden. Moreover, under-sampling can lose some valuable information, and over-sampling or data generation can cause overfitting on certain repetitive samples. The meta-learning approach requires additional unbiased data [ref:shu_metaweightnet_neurips2019] or a meta-sampler [ref:ren_bms_neurips2020], which is computationally expensive in practice. Therefore, our work focuses on the cost-sensitive re-weighting approach to design a new loss function that is simple but efficient.

The cost-sensitive re-weighting approach aims to assign class penalties to shift the decision boundary in a way that reduces the bias induced by the data imbalance. For this purpose, the most commonly adopted method is to re-weight samples inversely to the number of training samples in each class to assign more weights for the minority classes [ref:huang_tang_cvpr16, ref:wang_hebert_nips17, ref:cui_belongie_cvpr19]. These methods have focused on only global-level class distribution and assign the same fixed weight to all samples belonging to the same class. However, not all samples in a dataset play an equal role in determining the model parameters [cook_influence]. That is, some samples have greater influences on forming a decision boundary. Hence, each sample needs to be re-weighted differently according to its impact on the model.

Recently, numerous studies have been conducted in which each sample is considered to design sample-wise loss functions  [ref:dong_zhu_iccv2017, ref:lin_focal_loss_iccv17, ref:malisiewicz_iccv2011]

. Specifically, these methods down-weight well-classified samples and assign more weights to

hard examples, which yield high errors. This re-weighting might lead to the complete training when the high capacity of DNNs is sufficient to finally memorize the whole training data [ref:zhang_vinyals_iclr17, ref:arpit_memoriz_icml2017]. This implies that DNN is overfitted to hard samples, which are located at the overlapping region between the majority and minority classes. In the imbalanced data, most hard samples are majority samples that enforce the decision boundary to be complex and shift to the minority region.

To address the aforementioned problem, in this paper, we propose a loss-sensitive method to down-weight samples that cause overfitting of a DNN trained with highly imbalanced data. To this end, we derive a formula that measures how much each sample influences the complex and biased decision boundary. To derive the formula, we utilize the influence function [cook_influence], which has been widely used in robust statistics. Using the derived formula, we design a novel loss function, called influence-balanced (IB) loss, that adaptively assigns different weights to samples according to their influence on a decision boundary. Specifically, we re-weight the loss proportionally to the inverse of the influence of each sample. Our method is divided into two phases: standard training and fine-tuning for influence balancing. During the fine-tuning phase, the proposed IB loss alleviates the influence of the samples that cause overfitting of the decision boundary.

Through extensive experiments on multiple benchmark data sets, we demonstrate the validity of our method, and show that the proposed method outperforms the state-of-the-art cost-sensitive re-weighting methods. Furthermore, since our IB loss is not restricted to a specific task, model, or training method, it can be easily utilized in combination with other recent data-level algorithms and hybrid methods for class-imbalance problems.

The main contributions of this paper are as follows:

  • We discover that the existing loss-based loss methods can lead a decision boundary of DNNs to eventually overfit to the majority classes.

  • We design a novel influence-balanced loss function to re-weight samples more effectively in such a way that the overfitting of the decision boundary can be alleviated.

  • We demonstrate that simply substituting our proposed loss for the standard cross-entropy loss significantly improves the generalization performance on highly imbalanced data.

2 Related Work

2.1 Class Imbalance Learning

To solve the imbalanced learning problem, numerous studies have been conducted. The research can be divided into three approaches: data-level, cost-sensitive re-weighting, and meta-learning approaches.

Data-level approach. The data-level approach aims to directly balance the training data distributions by re-sampling (e.g., under-sampling the majority classes or over-sampling the minority classes) [ref:resample_chawla_smote02, ref:resample_hulse_icml07] or generating synthetic samples [ref:sankha_gamo_iccv2019]. However, under-sampling can lose some valuable information, and it is not applicable when the data imbalance between classes is significant. Although over-sampling or data generation could be effective, these methods are susceptible to overfitting to certain repetitive samples, and often require a longer training time.

Re-weighting approach. Cost-sensitive re-weighting methods assign different weights to samples to adjust their importance. Commonly used methods include re-weighting samples inversely proportional to the number of the class [ref:huang_tang_cvpr16, ref:wang_hebert_nips17] or the square root of class frequency [ref:Mahajan_Weiss_eccv2018]

. Instead of heuristically using the number of classes, Cui et al.

[ref:cui_belongie_cvpr19] proposed using the effective number of samples. While these methods can successfully assign more weights to the minority samples, they assign the same weights to all samples belonging to the same class, regardless of each importance.

To assign different weights to each sample according to its importance on the model, numerous methods were proposed for re-weighting samples based on their difficulties or losses [ref:lin_focal_loss_iccv17, ref:dong_zhu_iccv2017, ref:malisiewicz_iccv2011]. That is, these methods down-weight well-classified samples and assign more weights to hard examples. These re-weighting methods might cause DNNs to be overfitted to the hard examples, since the high capacity of DNNs is sufficient to memorize the training data in the end [ref:arpit_memoriz_icml2017]. In class imbalanced data, the hard examples are likely generated from the majority classes. As such, the minority samples are assigned smaller weights. Therefore, we need a more elaborate mean of re-weighting samples that can alleviate the overfitting to the majority samples. Meanwhile, Cao et al. [ref:cao_ldam_neurips2019] proposed label-distribution-aware margin loss to solve the overfitting to the minority classes by regularizing the margins.

Meta-learning approach. Recently, the meta-learning-based approach [ref:shu_metaweightnet_neurips2019, ref:liu_mesa_neurips20, ref:ren_bms_neurips2020] has emerged to enhance the performance of both approaches. Shu et al. [ref:shu_metaweightnet_neurips2019] proposed a meta-learning process to learn a weighting function, while Liu et al. [ref:liu_mesa_neurips20] proposed a re-sampling method by combining the advantage of ensemble learning and meta-learning. Furthermore, Ren et al. [ref:ren_bms_neurips2020] proposed the meta-sampler and a balanced softmax that accommodates the shift of the distributions between the training data and test data. Although these methods can achieve satisfactory performance, these methods are somewhat difficult to implement in practice. For example, meta-weight-net [ref:shu_metaweightnet_neurips2019] requires additional unbiased data for learning, and the meta-sampler in [ref:ren_bms_neurips2020]

is computationally expensive in practice. On the other hand, our proposed loss is simple to implement because it does not require a hyperparameter, a specially designed architecture, or additional learning for data re-sampling. Therefore, it is easy to use in collaboration with other methods.

2.2 Influence function.

The influence function was proposed to find the influential instance of a sample to a model, which has been studied for decades in robust statistics [ref:hampel1986robust, cook_influence]. Recently, attempts have been made to use influence function in deep neural networks [ref:FANN_2003, ref:Koh_Liang_2017]. For example, Koh and Liang [ref:Koh_Liang_2017] employed the influence function to understand DNNs. While the influence function is primarily used as a diagnostic tool after the training of a model, our work first attempts to apply it to a learning scheme, in which we design the influence-balanced loss by utilizing the influence function during training.

3 Method

To address the imbalanced data learning problem, our idea is to re-weight samples by their influences on a decision boundary to create a more generalized decision boundary. First, we present the key idea of our proposed method in Section 3.1. For the background, we briefly review the influence function in Section 3.2 and then derive the IB loss in Sections 3.3, 3.4, and 3.5. Finally, the training scheme is presented in Section 3.6.

(a) Original decision boundary.
(b) Proposed method.
Figure 1: Illustration of the key concept of our approach. The red and blue marks belong to the minority and majority classes, respectively, in binary classification. (a) The black border line represents an initial decision boundary formed on an imbalanced dataset. The black samples have greater influence on the decision boundary than do the blue samples, since the decision boundary would substantially change without the black samples. (b) Our proposed method aims to down-weight the samples (light blue samples) that have a large influence on the overfitted decision boundary (dotted line) to create a smoother decision boundary (the red line).

3.1 Key Idea of Proposed Method

In this section, we explain how the re-weighting of samples according to their influence can help to form a well-generalized decision boundary on class imbalance data. It is well known that the high capacity of DNNs is sufficient to finally memorize the entire training data [ref:zhang_vinyals_iclr17, ref:arpit_memoriz_icml2017]. This implies that DNN can be overfitted to samples that are located at the overlapping region between the majority and minority classes, as illustrated in Figure 1 (a). In the imbalanced data, many majority samples invade among sparse minority samples and become dominant in the overlapping area, thereby enforcing the decision boundary to be complex and shift to the minority region.

Furthermore, the black samples in Figure 1 (a) have a stronger influence on forming the decision boundary, as they support the decision boundary, which substantially changes when the samples are removed. Thus, it can be said that the dominant samples with high influence are likely to create a complex and biased decision boundary. As illustrated in Figure 1 (b), by down-weighting the highly influential samples, the decision boundary can be smoothed via fine-tuning. To this end, we derive an influence-balanced (IB) loss by employing the influence function [cook_influence], which measures the training sample’s influence on the model.

3.2 Influence Function

The influence function [cook_influence]

allows us to estimate the change in the model parameters when a sample is removed, without actually removing the data and retraining the model. Let

denote a model parameterized by with training data , where is the -th training sample, and is its label. Given the empirical risk , the optimal parameter after initial training is defined by .

During the fine-tuning phase, to address the imbalance issue, we re-weight loss proportionally to the inverse of the influence of a sample. The influence of a point can be approximated by the parameter change if the distribution of the training data at that point is slightly modified. A new parameter when removing the training point is derived as Then, under the assumption that for in the vicinity of , we can utilize the influence function in [ref:FANN_2003, ref:Koh_Liang_2017] to re-weight the sample-wise loss during the fine-tuning phase. The influence function is given by

(1)

where is the Hessian and is positive definite based by assumption that is strictly convex in a local convex basin around the optimal point .

3.3 Influence-balanced weighting factor

From , we derive the IB loss. Since

is a vector that requires heavy computation of the inverse Hessian, it is nearly impossible to directly use this. Therefore, we solve this problem by modifying

to a simple but effective influence-balanced weighting factor. First, since we need the relative influence of the training samples, not the absolute values, we can simply ignore the inverse Hessian in . This is because the inverse of hessian is commonly multiplied by all the training samples. Then, we design the IB weighting factor as follows:

(2)

Equation 2 turns out to be the magnitude of the gradient vector. Anand et al. [ref:anand_ranka_1993] revealed that the net error gradient vector is dominated by the major classes in the class imbalance problem. Hence, re-weighting samples by the magnitude of the gradient vector can successfully down-weight samples from dominant classes. In the Experiments section, we justify the choice of the L1 norm. In the following section, we demonstrate how the IB weighting factor can be used with the actual loss.

3.4 Influence-Balanced Loss

When using the softmax cross-entropy loss, Equation (2) can be further simplified. The cross-entropy loss is denoted by , where is a ground truth, and is the -th output of the model , with total classes. Since we are interested in the overfitting on the decision boundary of the model, we focus on the change in the last fully connected (FC) layer of a deep neural network. Let be a hidden feature vector, an input to the FC layer, and be the output denoted by , where is the softmax function. The weight matrix of the FC layer is denoted by .

Then, the gradient of the loss w.r.t. is computed as

(3)

The same results are obtained for the cross-entropy loss with a sigmoid function or a mean squared error (MSE) loss for regression. Then, IB weighting factor in (

2) is given by

(4)

of which inverse can be used for the re-weighting factor to down-weight an influential sample in fine-tuning to adjust the decision boundary that enhance the imbalanced data learning. Finally, the influence-balanced loss is given by

(5)

The proposed influence-balanced term constrains the decision boundary to not overfit to influential majority samples (see Figure 1(b)).

Input : training dataset .
Output : influence-balanced model .
Phase 1: Normal training
Initialize the model with random parameters .
for  to  do
       sample mini-batch from
      
       update
end for
Phase 2: Fine-tuning for influence balancing
for  to  do
       sample mini-batch from
      
       update
end for
Algorithm 1 Influence-Balanced Training

3.5 Influence-Balanced Class-wise Re-weighting

Moreover, we add a class-wise re-weighting term to the IB-loss in (5) as

(6)

where . Here, is the number of samples in the -th class in the training dataset, and normalization is performed to make have a similar scale for every class. is introduced as a hyper-parameter for an adjustment.

The class-wise re-weighting yields the following two effects. First, mitigates the bias of the decision boundary arising from the overall imbalanced distribution through the slow-down of the majority loss minimization. Second, further controls the sample-wise re-weighting depending on the class to which a highly influential sample belongs. That is, if the sample belongs to a majority class, further down-weights the sample because the decision boundary is likely to be overfitted by the majority sample. Meanwhile, if the sample belongs to a minority class, becomes smaller than that of a majority sample and does not down-weight the loss much, because the large influence of the minority sample is natural due to the data scarcity.

3.6 Influence-balanced Training Scheme

The influence-balanced training process comprises two phases: normal training and fine-tuning for balance. We refer to as the transition time from normal training to fine-tuning. During the normal training phase, the network is trained following any training scheme for the first epochs. Meanwhile, during the fine-tuning phase, the influence-balanced loss is applied to mitigate the overfitting of the decision boundary arising from the influential (noisy) majority samples. Since our IB loss during the fine-tuning phase alleviates the overfitting, it is advantageous to set as the epoch when the model has begun to converge to the local (global) minimum. Generally, it is recommended to set as half of the total training scheme. We present the performance change according to the number of training epochs during normal training in the Experiments section. As evident, our training does not require an additional training scheme or a specifically designed architecture. Thus, it can be utilized easily in any tasks suffering from imbalanced data. The pseudo-code of the training procedure is presented in Algorithm 1.

4 Experiments

4.1 Experimental Settings

Datasets.

We verified the effectiveness of our method on three commonly used benchmark datasets: CIFAR-10, CIFAR-100

[ref:data_cifar]

, Tiny ImageNet 

[data:tinyimagenet], and iNaturalist 2018 [ref:data_iNat]

. The CIFAR-10 and CIFAR-100 datasets consist of 50,000 training images and 10,000 test images with 10 and 100 classes, respectively. Meanwhile, Tiny ImageNet contains 200 classes for training, in which each class has 500 images. Its test set contains 10,000 images. Since CIFAR and Tiny ImageNet are evenly distributed, we have made these datasets imbalanced according to

[ref:cui_belongie_cvpr19, ref:buda_mazurowski_2018], respectively. Primarily, we investigate two common types of imbalance: (i) long-tailed imbalance [ref:cui_belongie_cvpr19] and (ii) step imbalance [ref:buda_mazurowski_2018]. In long-tailed imbalance, the number of training samples for each class decreases exponentially from the largest majority class to the smallest minority class. To construct long-tailed imbalanced datasets, the number of selected samples in the -th class was set to , where is the original number of the -th class. Meanwhile, in step imbalance, the classes are divided into two groups: the majority class group and minority class group. Every class within a group contains the same number of samples, and the class in the majority class group has many more samples than that in the minority class group. For evaluation, we used the original test set. The imbalance ratio is defined by . Thus, the imbalance ratio represents the degree of imbalance in the dataset. We evaluated the performance of our method under various imbalance ratios from 10 to 200.

The iNaturalist 2018 dataset is a large-scale real-world dataset containing 437,513 training images and 24,426 test images with 8,142 classes. iNaturalist 2018 exhibits long-tailed imbalance, whose imbalance ratio is 500. We used the official training and test splits in our experiments.

Baselines. We compared our algorithm with the following cost-sensitive loss methods: (1) Our baseline model, which is trained on the standard cross-entropy loss. Comparing our model with this baseline enables us to clearly understand how much our training scheme has improved the performance; (2) focal loss [ref:lin_focal_loss_iccv17], which increases the relative loss for hard samples and down-weights well-classified samples; (3) CB loss [ref:cui_belongie_cvpr19], which re-weights the loss inversely proportional to the effective number of samples; (4) LDAM loss [ref:cao_ldam_neurips2019], which regularizes the minority classes to have larger margins.

Since our IB loss can be easily combined with other methods, we employee two further variants. First, IB + CB uses the effective number in CB loss, instead of using in IB. Second, IB + focal uses focal loss during the fine-tuning phase, instead of using the cross-entropy loss. We demonstrate that combination with other methods can further improve the performance.

Implementation Details.

We used PyTorch

[ref:Pytorch] to implement and train all the models in the paper, and we used ResNet architecture [ref:resnet_cvpr2016]

for all datasets. For CIFAR datasets, we used randomly initialized ResNet-32. The networks were trained for 200 epochs with stochastic gradient descent (SGD) (momentum = 0.9). Following the training strategy in

[ref:cui_belongie_cvpr19, ref:cao_ldam_neurips2019], the initial learning rate was set to 0.1 and then decayed by 0.01 at 160 epochs and again at 180 epochs. Furthermore, we used a linear warm-up of the learning rate [ref:goyal_he_corr2017] in the first five epochs. Since our method uses a two-phase training schedule, we trained for the first 100 epochs with the standard cross-entropy loss, then fine-tuned the networks using the IB loss for the next 100 epochs. We trained the models for CIFAR on a single NVIDIA GTX 1080Ti with a batch size of 128. For Tiny ImageNet, we employed ResNet-18 and used the stochastic gradient descent with a momentum of 0.9, and weight decay of for training. The networks were initially trained for 50 epochs, and then fine-tuned for the subsequent 50 epochs with IB loss. The learning rate at the start was set to 0.1 and was dropped by a factor of 0.1 after 50 and 90 epochs. For iNaturalist 2018, we trained ResNet-50 with four GTX 1080Ti GPUs. The networks were initially trained for 50 epochs and then fine-tuned for the subsequent 150 epochs with IB loss. The learning rate at the start was set to 0.01 and was decreased by a factor of 0.1 after 30 and 180 epochs.

As a simple but important implementation trick, we added to to prevent numerical instability in inversion when the influence approaches zero. We discuss the influence of the hyperparameter () in the following section.

4.2 Analysis

To validate the proposed method, we conducted extensive experiments.

Is influence meaningful for re-weighting? First, to confirm whether influence can act as a meaningful clue of re-weighting for class imbalance learning, we compared the influences between a balanced dataset and an imbalanced dataset. For an imbalanced CIFAR-10, we used the long-tailed version of CIFAR-10 with the imbalance ratio , in which the largest class, ‘plane’ (i.e., class index 0), contains 5,000 samples, while the smallest class, ‘truck’ (i.e., class index 9), contains only 50 samples. We trained ResNet-32 with a standard cross-entropy loss for 200 epochs, as described in Implementation Details, on both the balanced (original) and imbalanced CIFAR-10. We plotted the influences of both classes in Figure 2. We scaled the influences to between 0 and 1 for each dataset. Since the minority class contains only 50 samples, we selected the highest 50 samples for comparison. As illustrated in Figure 2, there was little difference in the distributions of the influences between the classes in the balanced dataset. However, in the imbalanced dataset, the minority samples had significantly less influence on the model than did the majority samples. This result corroborates that majority samples greatly contribute to forming a decision boundary, and re-weighting their influences can improve the generalization of the model.

Figure 2: Comparison of Influences between balanced and imbalanced dataset. We plotted the influences of samples on ResNet-32 trained on the original CIFAR-10 and the imbalanced version of CIFAR-10. The solid and dashed lines represent the influences of the imbalanced data and balanced data, respectively. While there is little difference in the balanced dataset, it can be seen that the influence of the dominant class is much greater than that of the minor class in the imbalance dataset.

Magnitude of Influence. In Section 3.3, we used norm to compute the magnitude of the influences. We investigated performance variations depending on three vector norms to compute the magnitude of the gradient vector : , , . As indicated in Table 1, norm, which provides a distinctive change of influence around the equilibrium point, exhibits the best classification accuracy on CIFAR-10 with multiple imbalance ratios.

CIFAR-10 CIFAR-100
Imbalance ()
78.41 85.80 40.85 52.85
75.67 84.35 36.41 50.95
77.23 84.30 37.48 50.99
Table 1: Comparison of norms. Using norm yields the best performance.
Figure 3: Influence-balanced training scheme. We varied the training epochs for the normal training, , to determine the best transition time from the normal training to the influence-balance fine-tuning. We achieved the best performance when setting the transition time to the point when the training loss converges.

Timing for starting fine-tuning for balancing. Our training scheme is divided into two phases: normal training and fine-tuning for balancing. This must determine the transition time between normal training and fine-tuning for balancing. Hence, we investigated the results on how much the transition time affects the performance and determined the best transition time. For this, we experimented on the long-tailed version of CIFAR-10 with imbalance ratios of and . In Figure 3, the -axis represents the number of training epochs for the normal training phase. We varied the transition time, , from 0 to 120 while the total number of training epochs was fixed at 200. The solid line represents the classification accuracy earned by the models for each training schedule. To analyze the relationship between the convergence of the normal training phase and the transition timing, we plotted the standard cross-entropy loss without adopting the IB loss for the whole training epochs (dashed lines).

From Figure 3, it can be observed that the proposed method demonstrates robust performance regardless of the choice of transition time . Yet, the transition to fine-tuning after the 100th epoch yields the best performance when the training loss has converged. Since the influence function is derived from the loss minimization context [ref:Koh_Liang_2017], it is reasonable to begin the fine-tuning phase after the learning converges.

Effects of . As mentioned in Implementation Details, for all datasets, we added the hyperparameter () to to prevent numerical instability. To analyze the effects of the hyperparameter, we conducted experiments with the following denominators for the IB loss (5): (a) , (b) , (c) , and (d) . We iterated experiments three times with different random seeds on the long-tailed CIFAR-10 (). As presented in Table 2, setting to yields the best performance. Thus, we set as in all the experiments. However, when we did not use the IB weighting factor, the accuracy greatly decreased.

Epsiilon (a) IB+1e-8 (b) IB+1e-3 (c) IB+1e-2 (d) 1e-3
Accuracy
Table 2: Effects of .
Imbalanced CIFAR-10
Class plane car bird cat deer dog frog horse ship truck
Long-Tailed ()
#Training samples 5000 3237 2096 1357 878 568 368 238 154 100
Baseline (CE) 97.4 98.0 84.0 80.3 78.8 68.4 76.1 64.5 57.0 52.0
Focal [ref:lin_focal_loss_iccv17] 91.6 95.1 73.1 59.2 67.8 67.2 84.2 77.3 83.9 61.8
CB  [ref:cui_belongie_cvpr19] 92.9 96.3 79.2 75.1 82.4 69.9 75.0 69.1 73.6 66.8
LDAM [ref:cao_ldam_neurips2019] 96.9 98.5 82.9 74.7 82.8 69.0 78.5 69.9 65.3 66.0
LDAM-DRW [ref:cao_ldam_neurips2019] 94.8 97.8 82.6 72.3 85.3 73.0 82.0 76.7 75.8 72.4
IB 92.2 96.2 81.3 66.6 85.7 76.4 81.7 75.9 79.9 81.1
IB + CB 93.8 97.2 78.1 64.8 84.8 74.2 86.4 79.7 79.5 76.9
IB + Focal 90.9 96.1 81.7 69.0 82.0 75.7 85.2 77.5 80.2 76.8
Step-Imbalance ()
#Training samples 5000 5000 5000 5000 5000 100 100 100 100 100
Baseline (CE) 95.9 99.2 91.5 91.9 95.5 24.8 40.2 46.7 52.7 55.1
Focal [ref:lin_focal_loss_iccv17] 96.3 93.9 91.2 90.5 95.7 20.0 46.7 48.8 56.1 57.6
CB  [ref:cui_belongie_cvpr19] 87.4 96.3 76.8 77.0 85.7 34.6 61.5 56.5 68.7 63.8
LDAM [ref:cao_ldam_neurips2019] 96.4 98.5 91.1 90.2 94.6 28.3 50.3 57.0 56.2 64.4
LDAM-DRW [ref:cao_ldam_neurips2019] 94.5 97.2 88.0 84.5 94.3 50.4 69.9 71.4 74.6 76.0
IB 94.0 97.7 86.7 83.2 93.8 56.9 71.0 75.1 76.5 81.7
IB + CB 91.8 95.7 86.6 79.4 93.6 62.8 77.2 72.3 74.2 87.3
IB + Focal 91.2 96.4 83.3 77.1 92.0 64.8 78.0 74.4 83.5 83.1
Table 3: Class-wise classification accuracy (%) of ResNet-32 on imbalanced CIFAR-10 dataset. The number of test samples for each class is the same as 1000. The best results are marked in bold.

4.3 Comparison of Class-Wise Accuracy.

In this section, to validate that the performance improvement has actually resulted from the minority classes, not from the majority classes, we report the class-wise accuracy on both the long-tailed and the step-imbalanced CIFAR-10. We compare the proposed method with the state-of-the-art cost-sensitive loss methods. Since previous studies do not report the class-wise accuracy on the imbalanced CIFAR-10, we implemented the baseline methods  [ref:lin_focal_loss_iccv17, ref:cui_belongie_cvpr19, ref:cao_ldam_neurips2019]. For the implementation of LDAM [ref:cao_ldam_neurips2019], we used their official implementation code to reproduce the results.

The overall results are reported in Table 3. As presented in Table 3, existing methods exhibit severe performance degradation in the minority classes. That is, the reported improvements from the existing methods were attributed to the majority classes, not the minority classes. In contrast, the proposed IB loss exhibited a significant improvement in all the minority classes.

It is noteworthy that the performance improvement was not significant, especially on the step-imbalanced CIFAR-10 with the focal loss [ref:lin_focal_loss_iccv17] method. We argue that this demonstrates that most hard examples are majority samples in highly imbalanced data and that those samples enforce the decision boundary to be overfitted. In contrast, our proposed influence-balanced re-weighing can mitigate the influences of the majority samples that cause overfitting. As a result, it can achieve robust and superior performance for the minority classes with a very small number of samples.

Although using the influence-balanced loss alone can achieve significant enhancement for the classification of the minority classes, it is beneficial to combine it with other methods. For example, the results indicate that applying the influence-balanced loss with the focal loss can encourage the network to learn ‘good’ hard samples, while down-weighting the influential ones that induce overfitting.

Imbalanced CIFAR-10 Imbalanced CIFAR-100
Imbalance () 200 100 50 20 10 200 100 50 20 10
Long-Tailed
Baseline (CE) 66.28 70.87 78.22 82.43 86.49 33.54 38.05 43.71 51.21 56.96
Focal [ref:lin_focal_loss_iccv17] 65.29 70.38 76.71 82.76 86.66 35.62 38.41 44.32 51.95 55.78
CB  [ref:cui_belongie_cvpr19] 68.89 74.57 79.27 84.36 87.49 36.23 39.60 45.32 52.59 57.99
LDAM [ref:cao_ldam_neurips2019] - 73.35 - - 86.96 - 39.6 - - 56.91
LDAM-DRW [ref:cao_ldam_neurips2019] - 77.03 - - 88.16 - 42.04 - - 57.99
IB 73.96 78.26 81.70 85.8 88.25 37.31 42.14 46.22 52.63 57.13
IB + CB 73.69 78.04 81.54 85.42 88.09 37.06 41.31 46.16 52.74 56.78
IB + Focal 75.05 79.76 81.51 85.31 88.04 38.23 42.06 47.49 53.28 58.20
Step-Imbalance
Baseline (CE) 56.97 64.81 69.35 79.71 84.16 38.29 39.27 41.65 48.55 54.13
LDAM [ref:cao_ldam_neurips2019] - 66.58 - - 85.00 - 39.58 - - 56.27
LDAM-DRW [ref:cao_ldam_neurips2019] - 76.92 - - 87.81 - 45.36 - - 59.46
IB 72.15 76.53 81.66 85.41 87.72 39.66 45.39 48.93 53.57 57.96
IB + CB 69.96 75.97 82.09 85.27 88.01 39.69 45.27 48.80 53.42 57.86
IB + Focal 74.12 77.97 82.38 85.68 87.90 40.39 44.96 48.92 54.53 59.54
Table 4: Classification accuracy (%) of ResNet-32 on imbalanced CIFAR-10 and CIFAR-100 datasets. “” indicates that the results are copied from the original paper, and “” means that the results are from the experiments in CB [ref:cui_belongie_cvpr19]. The best results are marked in bold.

4.4 Comparison with State-of-the-Art

Experimental results on CIFAR. The overall classification accuracy is provided in Table 4. The model performance is reported on the unbiased test set as the same as the other methods. The results indicate that adopting the proposed influence-balanced loss significantly improves the generalization performance and outperforms the recent cost-sensitive loss methods. On multiple benchmark datasets, using IB loss alone could achieve the best performance. This suggests that it is effective for the robustness of the model to balance the influence of samples responsible for overfitting the decision boundary. When combined with other methods [ref:cui_belongie_cvpr19, ref:lin_focal_loss_iccv17], we could further improve the accuracy on multiple datasets. This indicates that our proposed method of down-weighting influential samples that induce overfitting can benefit other methods as well.

Experimental results on Tiny ImageNet. We evaluated our method on Tiny ImageNet. While we performed the experiments for the other baselines, the results of LDAM were copied from their original paper. As presented in Table 5, IB loss outperforms other baselines on Tiny ImageNet as well.

Experimental results on iNaturalist 2018. We evaluated our method on the large-scale real-world image data, iNaturalist 2018. We compared our method with the state-of-the-art loss-based methods. Table 6 reveals that simply balancing the influence of loss could achieve considerable improvement.

5 Conclusion

In this paper, we propose a novel influence-balanced loss to solve the overfitting of the majority classes in a class imbalance problem. A model trained on imbalanced class data is susceptible to overfitting due to the high capacity of DNN and the scarcity of samples in certain classes. Therefore, as learning progresses, existing methods are likely to produce undesirable results, such as assigning higher weights to samples from majority classes. Unlike the existing methods, IB loss can robustly assign weights because it directly focuses on a sample’s influence on the model. We conducted experiments to demonstrate that our method can improve generalization performance under a class imbalance setting. In addition, our method is easy to be implemented and integrated into existing methods. In the future, we plan to extend our method by incorporating data-level methods or other recent meta-learning methods.

Long-Tailed Step-Imbalance
Imbalance () 100 10 100 10
Baseline (CE) 38.52 36.62 36.74 51.11
Focal [ref:lin_focal_loss_iccv17] 38.95 54.02 38.24 41.77
CB [ref:cui_belongie_cvpr19] 41.37 54.82 37.35 54.3
LDAM* [ref:cao_ldam_neurips2019] 37.47 52.78 39.37 52.57
IB 42.65 57.22 41.13 54.83
Table 5: Class. accuracy (%) of ResNet-18 on Tiny ImageNet.
iNaturalist 2018
Method top1 top5
Baseline (CE) 57.30 79.48
Focal [ref:lin_focal_loss_iccv17] 58.03 78.65
CB [ref:cui_belongie_cvpr19] 61.12 81.03
LDAM [ref:cao_ldam_neurips2019] 64.58 83.52
IB 65.39 84.98
Table 6: Class. accuracy (%) of ResNet-50 on iNaturalist 2018.

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grants funded by the Korea government(MSIT) (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis) and (2017-0-00306, Multimodal sensor-based intelligent systems for outdoor surveillance robots).

References