Balanced Meta-Softmax for Long-Tailed Visual Recognition

Deep classifiers have achieved great success in visual recognition. However, real-world data is long-tailed by nature, leading to the mismatch between training and testing distributions. In this paper, we show that Softmax function, though used in most classification tasks, gives a biased gradient estimation under the long-tailed setup. This paper presents Balanced Softmax, an elegant unbiased extension of Softmax, to accommodate the label distribution shift between training and testing. Theoretically, we derive the generalization bound for multiclass Softmax regression and show our loss minimizes the bound. In addition, we introduce Balanced Meta-Softmax, applying a complementary Meta Sampler to estimate the optimal class sample rate and further improve long-tailed learning. In our experiments, we demonstrate that Balanced Meta-Softmax outperforms state-of-the-art long-tailed classification solutions on both visual recognition and instance segmentation tasks.


Balanced Activation for Long-tailed Visual Recognition

Deep classifiers have achieved great success in visual recognition. Howe...

Disentangling Label Distribution for Long-tailed Visual Recognition

The current evaluation protocol of long-tailed visual recognition trains...

Meta Feature Modulator for Long-tailed Recognition

Deep neural networks often degrade significantly when training data suff...

Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition from a Domain Adaptation Perspective

Object frequency in the real world often follows a power law, leading to...

Long-tailed Distribution Adaptation

Recognizing images with long-tailed distributions remains a challenging ...

Learning of Visual Relations: The Devil is in the Tails

Significant effort has been recently devoted to modeling visual relation...

Unbiased scalable softmax optimization

Recent neural network and language models rely on softmax distributions ...

Code Repositories


Code for the paper "Balanced Meta-Softmax for Long-Tailed Visual Recognition" on long-tailed visual recognition datasets

view repo

1 Introduction

Most real-world data comes with a long-tailed nature: a few high-frequency classes (or head classes) contributes to most of the observations, while a large number of low-frequency classes (or tail classes) are under-represented in data. Taking an instance segmentation dataset, LVIS (Gupta et al., 2019), for example, the number of instances in banana class can be thousands of times more than that of a bait class. In practice, the number of samples per class generally decreases from head to tail classes exponentially. Under the power law, the tails can be undesirably heavy. A model that minimizes empirical risk on long-tailed training datasets often underperforms on a class-balanced test dataset. As datasets are scaling up nowadays, the long-tailed nature poses critical difficulties to many vision tasks, e.g., visual recognition and instance segmentation.

An intuitive solution to long-tailed task is to re-balance the data distribution. Most state-of-the-art (SOTA) methods use the class-balanced sampling or loss re-weighting to “simulate" a balanced training set (Byrd and Lipton, 2018; Wang et al., 2017). However, they may under-represent the head class or have gradient issues during optimization. Cao et al. (Cao et al., 2019) introduced Label-Distribution-Aware Margin Loss (LDAM), from the perspective of the generalization error bound. Given fewer training samples, a tail class should have a higher generalization error bound during optimization. Nevertheless, LDAM is derived from the hinge loss, under a binary classification setup and is not suitable for multi-label classification.

We propose Balanced Meta-Softmax (BALMS) for long-tailed visual recognition. We first show that the Softmax function is intrinsically biased under the long-tailed scenario. We derive a Balanced Softmax function from the probabilistic perspective that explicitly models the test-time label distribution shift. Theoretically, we found that optimizing for the Balanced Softmax cross-entropy loss is equivalent to minimizing the generalization error bound. Balanced Softmax generally improves long-tailed classification performance on datasets with moderate imbalance ratios, e.g., CIFAR-10-LT (Krizhevsky, 2009) with a maximum imbalance factor of 200. However, for datasets with an extremely large imbalance factor, e.g., LVIS (Gupta et al., 2019)

with an imbalance factor of 26,148, the optimization process becomes difficult. Complementary to the loss function, we introduce the

Meta Sampler, which learns to re-sample for achieving high validation accuracy by meta-learning. The combination of Balanced Softmax and Meta Sampler could efficiently address long-tailed classification tasks with high imbalance factors.

We evaluate BALMS on both long-tailed image classification and instance segmentation on five commonly used datasets: CIFAR-10-LT (Krizhevsky, 2009), CIFAR-100-LT (Krizhevsky, 2009)

, ImageNet-LT 

(Liu et al., 2019), Places-LT Zhou et al. (2017) and LVIS (Gupta et al., 2019). On all datasets, BALMS outperforms state-of-the-art methods. In particular, BALMS outperforms all SOTA methods on LVIS, with an extremely high imbalanced factor, by a large margin.

We summarize our contributions as follows: 1) we theoretically analyze the incapability of Softmax function in long-tailed tasks; 2) we introduce Balanced Softmax function that explicitly considers the generalization error bound during optimization; 3) we present the Meta Sampler, a meta-learning based re-sampling strategy for long-tailed learning.

2 Related Works

Data Re-Balancing. Pioneer works focus on re-balancing during training. Specifically, re-sampling strategies (Byrd and Lipton, 2018; Kubat and Matwin, 1997; Chawla et al., 2002; Han et al., 2005; He and Garcia, 2009; Shen et al., 2016; Buda et al., 2018; Barandela et al., 2009) try to restore the true distributions from the imbalanced training data. Re-weighting, i.e., cost-sensitive learningWang et al. (2017); Huang et al. (2016, 2019); Khan et al. (2015); Mikolov et al. (2013), assigns a cost weight to the loss of each class. However, it is argued that over-sampling inherently overfits the tail classes and under-sampling under-represents head classes’ rich variations. Meanwhile, re-weighting tends to cause unstable training especially when the class imbalance is severe because there would be abnormally large gradients when the weights are very large.

Loss Function Engineering. Tan et al. (Tan et al., 2020) pointed out that randomly dropping some scores of tail classes in the Softmax function can effectively help, by balancing the positive gradients and negative gradients flowing through the score outputs. Cao et al. (Cao et al., 2019) showed that the generalization error bound could be minimized by increasing the margins of tail classes. Khan et al. (Hayat et al., 2019) modify the loss function based on Bayesian uncertainty. Li et al. (Li et al., 2019) proposes two novel loss functions to balance the gradient flow. Nevertheless, as we show in this paper, Softmax function, which is commonly adopted in these methods, is biased under the long-tailed scenarios. Our proposed method with Balanced Softmax addresses the issue and significantly improves the standard loss-based methods.

Meta-Learning. Many approaches (Jamal et al., 2020; Ren et al., 2018; Shu et al., 2019) have been proposed to tackle the long-tailed issue with meta-learning. Many of them focused on optimizing the weight-per-sample as a learnable parameter, which appears as a hyper-parameter in the re-weight approach. This group of methods requires a clean and unbiased dataset as a meta set, i.e., development set, which is usually a fixed subset of the training images and use bi-level optimization to estimate the weight parameter.

Decoupled Training. A few recent works Kang et al. (2020); Zhou et al. (2019) point out that decoupled training, a simple yet effective solution, could significantly improve the generalization issue on long-tailed datasets. The classifier is the only under-performed component when training in imbalanced datasets. However, in our experiments, we found this technique is not adequate for datasets with extremely high imbalance factors, e.g., LVIS (Gupta et al., 2019). Interestingly in our experiments, we observed that decoupled training is complementary to our proposed BALMS, and combining them results in additional improvements.

3 Balanced Meta-Softmax

The major challenge for long-tailed visual recognition is the mismatch between training data distribution and test data distribution. Let be the balanced testing set, where denotes a data point and denotes its label. Let be the number of classes, be the number of samples in class , where . Similarly, we denote the long-tailed training set as . Normally, we have . Specifically, for a tail class , , which makes the generalization under long-tailed scenarios extremely challenging.

We introduce Balanced Meta-Softmax (BALMS) for long-tailed visual recognition. It has two components: 1) a Balanced Softmax function that accommodates the label distribution shift between training and testing; 2) a Meta Sampler that learns to re-sample training set by meta-learning. We denote a feature extractor function as and a linear classifier’s weight as .

3.1 Balanced Softmax

Label Distribution Shift.

We begin by revisiting the multi-class Softmax regression, where we are generally interested in estimating the conditional probability

, which can be modeled as a multinomial distribution :


where is the indicator function and Softmax function maps a model’s class- output to the conditional probability .

From the Bayesian inference’s perspective,

can also be interpreted as:


where is in particular interest under the class-imbalanced setting. Assuming that all instances in the training dataset and the test dataset are generated from the same process , there could still be a discrepancy between training and testing given different label distribution and evidence . With a slight abuse of the notation, we re-define to be the conditional distribution on the balanced test set and define to be the conditional probability on the imbalanced training set. As a result, standard Softmax provides a biased estimation for .

Balanced Softmax. To eliminate the discrepancy between the posterior distributions of training and testing, we introduce the Balanced Softmax. We use the same model outputs to parameterize two conditional probabilities: for testing and for training.

Theorem 1.

Assume to be the desired conditional probability of the balanced data set, with the form , and to be the desired conditional probability of the imbalanced training set, with the form . If is expressed by the standard Softmax function of model output , then can be expressed as


We use the exponential family parameterization to prove Theorem 1. The proof can be found in the supplementary materials. Theorem 1 essentially shows that applying the following Balanced Softmax function can naturally accommodate the label distribution shifts between the training and test sets. We define the Balanced Softmax function as


We further investigate the improvement brought by the Balanced Softmax in the following sections.

Many vision tasks, e.g., instance segmentation, might use multiple binary logistic regressions instead of a multi-class Softmax regression. By virtue of Bayesian law, a similar strategy can be applied to the multiple binary logistic regressions. The detailed derivation is left in the supplementary materials.

Generalization Error Bound

Generalization error bound gives the upper bound of a model’s test error, given its training error. With dramatically fewer training samples, the tail classes have much higher generalization bounds than the head classes, which make high classification accuracy on tail classes unlikely. In this section, we show that optimizing Eqn. 4 is equivalent to minimizing the generalization upper bound.

Margin theory provides a bound based on the margins (Kakade et al., 2009). Margin bounds usually negatively correlate to the magnitude of the margin, i.e., larger margin leads to lower generalization error. Consequently, given a constraint on the sum of margins of all classes, there would be a trade-off between minority classes and majority classes (Cao et al., 2019).

Locating such an optimal margin for multi-class classification is non-trivial. The bound investigated in (Cao et al., 2019) was established for binary classification using hinge loss. Here, we try to develop the margin bound for the multi-class Softmax regression. Given the previously defined and , we derive by minimizing the margin bound. Margin bound commonly bounds the 0-1 error:


However, directly using the 0-1 error as the loss function is not ideal for optimization. Instead, negative log likelihood (NLL) is generally considered more suitable. With continuous relaxation of Eqn. 5, we have


where is any threshold, and is a standard negative log-likelihood with Softmax. This new error is still a counter, but describes how likely the test loss will be larger than a given threshold. Naturally, we define our margin for class to be


where is the set of all class samples. If we force a large margin during training, i.e., force the training loss to be much lower than , then will be reduced. The Theorem 2 in (Kakade et al., 2009) can then be directly generalized as

Theorem 2.

Let be any threshold, for all , with probability at least , we have


where is the error on the balanced test set, is used to hide constant terms and is some measure on complexity. With a constraint on , Cauchy-Schwarz inequality gives us the optimal .

The optimal suggests that we need larger for the classes with fewer samples. In other words, to achieve the optimal generalization ability, we need to focus on minimizing the training loss of the tail classes. Therefore, for each class , the desired training loss is

Corollary 2.1.

can be approximated by when:


We provide a sketch of proof to the corollary in supplementary materials. Notice that compared to Eqn. 4, we have an additional constant before . We empirically find that setting to leads to the optimal results, which may suggest that Eqn. 41 is not necessarily tight. To this point, the label distribution shift and generalization bound of multi-class Softmax regression lead us to the same loss form: Eqn. 4.

3.2 Meta Sampler

Although the Balanced Softmax theoretically optimizes the generalization error bound, given larger datasets with extremely imbalanced data distribution, the optimization is still challenging.

Class-balanced sampler (CBS) is used to tweak the mini-batches’ sampling process and fine-tune the classifier in the decoupled training setup Kang et al. (2020); Zhou et al. (2019). It potentially helps to simplify the optimization landscape by choosing class-balanced samples. However, in our experiments, we found that naively combining CBS with Balanced Softmax worsens the performance.

We first theoretically analyze the cause of the performance drop. When the linear classifier’s weight for class has converged, i.e., , we have:


where is the batch size and is the number of classes. Samples per class have been ensured to be by CBS. When the classification loss converges to 0, the conditional probability of the correct class is expected to be close to 1. For any positive sample and negative sample of class , we have and , when . Eqn. 11 can be rewritten as


where is the training set. The formal derivation of Eqn. 12 is in the supplementary materials.

Compared to the inverse loss weight, i.e., for class , combining Balanced Softmax with CBS leads to the over-balance problem, i.e., for class , which deviates from the optimal distribution.

Meta Sampler

To simplify the gradient descent process with CBS, we introduce a learnable version of it based on meta-learning, which is named Meta Sampler. We first define the empirical loss by sampling from dataset as for standard Softmax, and for Balanced Softmax, where is defined previously in Eqn. 4.

To estimate the optimal sample rates for different classes, we adopt a bi-level meta-learning strategy: we update the parameter of sample distribution in the inner loop and update the classifier parameters in the outer loop,


where is the sample rate for class , is the training set with class sample distribution , and is a meta set we introduce to supervise the outer loop optimization. We create the meta set by class-balanced sampling from the training set . Empirically, we found it sufficient for inner loop optimization. An intuition to this bi-level optimization strategy is that: we want to learn best sample distribution parameter such that the network, parameterized by , outputs best performance on meta dataset when trained by samples from .

We first compute the per-instance sample rate , where denotes the label class for instance and is total number of training samples, and sample a training batch from a parameterized multi-nomial distribution . Then we optimize the model in a meta-learning setup by

  1. sample a mini-batch given distribution and perform one step gradient descent to get a surrogate model parameterized by by .

  2. compute the of the surrogate model on the meta dataset and optimize the sample distribution parameter by with cross-entropy loss with standard Softmax

  3. update the model parameter with Balanced Softmax

However, sampling from a discrete distribution is not differentiable by nature. To allow end-to-end training for the sampling process, when forming the mini-batch , we apply the Gumbel-Softmax reparameterization trick (Jang et al., 2017). A detailed explanation can be found in the supplementary materials.

If we apply Meta Sampler with standard cross-entropy loss with Softmax, the convergence would be relatively slow when the class-imbalance factor is high. In addition, it might overfit the tail class and underfit the head class. In BALMS, the Balanced Softmax alleviates this issue because it naturally balances the distribution, as shown in Eqn. 12. Thus, the Meta Sampler could output a relatively more balanced sample distribution. We provide additional discussions in the experiments.

4 Experiments

4.1 Exprimental Setup

Datasets. We perform experiments on long-tailed image classification datasets, including CIFAR-10-LT (Krizhevsky, 2009), CIFAR-100-LT (Krizhevsky, 2009), ImageNet-LT (Liu et al., 2019) and Places-LT (Zhou et al., 2017) and one long-tailed instance segmentation dataset, LVIS (Gupta et al., 2019). We define the imbalance factor of a dataset as the number of training instances in the largest class divided by that of the smallest. Details of datasets are in Table 1.

Dataset #Classes Imbalance Factor
CIFAR-10-LT (Krizhevsky, 2009) 10 10-200
CIFAR-100-LT (Krizhevsky, 2009) 100 10-200
ImageNet-LT (Liu et al., 2019) 1,000 256
Places-LT (Zhou et al., 2017) 365 996
LVIS (Gupta et al., 2019) 1,230 26,148
Table 1: Details of long-tailed datatsets. For both CIFAR-10 and CIFAR-100, we report results with different imbalance factors.

Evaluation Setup. For classification tasks, after training on the long-tailed dataset, we evaluate the models on the corresponding balanced test/validation dataset and report top-1 accuracy. We also report accuracy on three splits of the set of classes: Many-shot (more than 100 images), Medium-shot (20

100 images), and Few-shot (less than 20 images). Notice that results on small datasets, i.e., CIFAR-LT 10/100, tend to show large variances, we report the mean and standard error under 3 repetitive experiments. We show details of long-tailed dataset generation in supplementary materials. For LVIS, we use official training and testing splits. Average Precision (AP) in COCO style 

(Lin et al., 2014) for both bounding box and instance mask are reported. Our implementation details can be found in the supplementary materials.

Imbalance Factor = 10 Imbalance Factor = 100 Imbalance Factor = 200
Figure 1: Experiment on CIFAR-100-LT. x-axis is the class labels with decreasing training samples and y-axis is the marginal likelihood . Balanced Softmax is more stable under a high imbalance factor compared to the Softmax baseline and SOTA method, Equalization Loss(EQL).

4.2 Long-Tailed Image Classification

We present the results for long-tailed image classification in Table 2 and Table 3. On all datasets, BALMS achieves SOTA performance compared with all end-to-end training and decoupled training methods. In particular, we notice that BALMS demonstrates a clear advantage under two cases: 1) When the imbalance factor is high. For example, on CIFAR-10 with an imbalance factor of 200, BALMS is higher than the SOTA method, LWS (Kang et al., 2020), by 3.4%. 2) When the dataset is large. BALMS achieves comparable performance with cRT on ImageNet-LT, which is a relatively small dataset, but it significantly outperforms cRT on a larger dataset, Places-LT.

In addition, we study the robustness of the proposed Balanced Softmax compared to standard Softmax and SOTA loss function for long-tailed problems, EQL (Tan et al., 2020). We visualize the marginal likelihood trained with a different loss given different imbalance factors in Fig. 1. Balanced Softmax clearly gives a smoother and more balanced likelihood under different imbalance factors.

Dataset CIFAR-10-LT CIFAR-100-LT
Imbalance Factor 200 100 10 200 100 10
End-to-end training
Softmax 71.2 0.3 77.4 0.8 90.0 0.2 41.0 0.3 45.3 0.3 61.9 0.1
CBW 72.5 0.2 78.6 0.6 90.1 0.2 36.7 0.2 42.3 0.8 61.4 0.3
CBS 68.3 0.3 77.8 2.2 90.2 0.2 37.8 0.1 42.6 0.4 61.2 0.3
Focal Loss (Lin et al., 2017) 71.8 2.1 77.1 0.2 90.3 0.2 40.2 0.5 43.8 0.1 60.0 0.6
Class Balance Loss (Cui et al., 2019) 72.6 1.8 78.2 1.1 89.9 0.3 39.9 0.1 44.6 0.4 59.8 1.1
LDAM (Cao et al., 2019) 71.2 0.3 77.2 0.2 90.2 0.3 41.0 0.3 45.4 0.1 62.0 0.3
Equalization Loss (Tan et al., 2020) 72.8 0.2 76.7 0.1 89.9 0.3 43.3 0.1 47.3 0.1 59.7 0.3
Decoupled training
cRT (Kang et al., 2020) 76.6 0.2 82.0 0.2 91.0 0.0 44.5 0.1 50.0 0.2 63.3 0.1
LWS (Kang et al., 2020) 78.1 0.0 83.7 0.0 91.1 0.0 45.3 0.1 50.5 0.1 63.4 0.1
BALMS 81.5 0.0 84.9 0.1 91.3 0.1 45.5 0.1 50.8 0.0 63.0 0.1
Table 2: Top 1 accuracy for CIFAR-10/100-LT. Softmax denotes the standard cross-entropy loss with Softmax, CBW denotes class-balanced weighting and CBS denotes class-balanced sampling. BALMS generally outperforms SOTA methods, especially when the imbalance factor is high. Note that for other methods, we reproduce higher accuracy than reported in original papers.
Dataset ImageNet-LT Places-LT
Accuracy Many Medium Few Overall Many Medium Few Overall
End-to-end training
Lifted Loss (Song et al., 2016) 35.8 30.4 17.9 30.8 41.1 35.4 24 35.2
Focal Loss (Lin et al., 2017) 36.4 29.9 16 30.5 41.1 34.8 22.4 34.6
Range Loss (Zhang et al., 2017) 35.8 30.3 17.6 30.7 41.1 35.4 23.2 35.1
OLTR (Liu et al., 2019) 43.2 35.1 18.5 35.6 44.7 37.0 25.3 35.9
Equalization Loss (Tan et al., 2020) - - - 36.4 - - - -
Decoupled training
cRT (Kang et al., 2020) - - - 41.8 42.0 37.6 24.9 36.7
LWS (Kang et al., 2020) - - - 41.4 40.6 39.1 28.6 37.6
BALMS 50.3 39.5 25.3 41.8 41.2 39.8 31.6 38.7
Table 3: Top 1 Accuracy on ImageNet-LT and Places-LT. We present results with ResNet-10 (Liu et al., 2019) for ImageNet-LT and ImageNet pre-trained ResNet-152 for Places-LT. Baseline results are taken from original papers. BALMS generally outperforms the SOTA models.

4.3 Long-Tailed Instance Segmentation

LVIS dataset is one of the most challenging datasets in the vision community. As suggested in Tabel 1, the dataset has a much higher imbalance factor compared to the rest (26148 vs. less than 1000) and contains many very few-shot classes. Compared to the image classification datasets, which are relatively small and have lower imbalance factors, the LVIS dataset gives a more reliable evaluation of the performance of long-tailed learning methods.

Since one image might contain multiple instances from several categories, we hereby use Meta Reweighter instead of Meta Sampler. As shown in Table 4, BALMS achieves the best results among all the approaches and outperform others by a large margin, especially on rare classes, where BALMS achieves an average precision of 19.6 while the best of the rest is 14.6. The results suggest that with the Balanced Softmax function and learnable Meta Reweighter, BALMS is able to give more balanced gradients and tackles the extremely imbalanced long-tailed tasks.

In particular, LVIS is composed of images of complex daily scenes with natural long-tailed categories. To this end, we believe BALMS is applicable to real-world long-tailed visual recognition challenges.

Softmax 23.7 27.3 24.0 13.6 24.0
Sigmoid 23.6 27.3 24.0 12.7 24.0
Focal Loss (Lin et al., 2017) 23.4 27.5 23.5 12.8 23.8
Class Balance Loss (Cui et al., 2019) 23.3 27.3 23.8 11.4 23.9
LDAM (Cao et al., 2019) 24.1 26.3 25.3 14.6 24.5
LWS (Kang et al., 2020) 23.8 26.8 24.4 14.4 24.1
Equalization Loss (Tan et al., 2020) 25.2 26.6 27.3 14.6 25.7
BALMS 27.0 27.5 28.9 19.6 27.6
Table 4: Results for LVIS dataset. denotes Average Precision of masks. denotes Average Precision of bounding box. , and denote Average Precision of masks on frequent classes, common classes and rare classes. BALMS significantly outperforms SOTA models given high imbalance factor in LVIS. Other methods are reproduced with higher AP than reported if given.
Dataset CIFAR-10-LT CIFAR-100-LT
Imbalance Factor 200 100 10 200 100 10
End-to-end training
(1) Softmax 71.2 0.3 77.4 0.8 90.0 0.2 41.0 0.3 45.3 0.3 61.9 0.1
(2) Balanced Softmax 71.6 0.7 78.4 0.9 90.5 0.1 41.9 0.2 46.4 0.7 62.6 0.3
(3) Balanced Softmax 79.0 0.8 83.1 0.4 90.9 0.4 45.9 0.3 50.3 0.3 63.1 0.2
Decoupled training
(4) Balanced Softmax + DT 72.2 0.1 79.1 0.2 90.2 0.0 42.3 0.0 46.1 0.1 62.5 0.1
(5) Balanced Softmax + DT + MS 76.2 0.4 81.4 0.1 91.0 0.1 44.1 0.2 49.2 0.1 62.8 0.2
(6) Balanced Softmax+DT 78.6 0.1 83.7 0.1 91.2 0.0 45.1 0.0 50.4 0.0 63.4 0.0
(7) Balanced Softmax+CBS+DT 80.6 0.1 84.8 0.0 91.2 0.1 42.0 0.0 47.4 0.2 62.3 0.0
(8) DT+MS 73.6 0.2 79.9 0.4 90.9 0.1 44.2 0.1 49.2 0.1 63.0 0.0
(9) Balanced Softmax+DT+MR 79.2 0.0 84.1 0.0 91.2 0.1 45.3 0.3 50.8 0.0 63.5 0.1
(10) BALMS 81.5 0.0 84.9 0.1 91.3 0.1 45.5 0.1 50.8 0.0 63.0 0.1
Table 5: Component Analysis on CIFAR-10/100-LT. DT: decoupled training. CBS: class-balanced sampling. MS: Meta Sampler. MR: Meta Reweighter. Balanced Softmax : the loss variant in Eqn. 10. Balanced Softmax and Meta Sampler both contribute to the final performance.

4.4 Component Analysis

We conduct an extensive component analysis on CIFAR-10/100-LT dataset to further understand the effect of each proposed component of BALMS. The results are presented in Table 5.

Balanced Softmax. Comparing (1), (2) with (3), and (5), (8) with (10), we observe that Balanced Softmax gives a clear improvement to the overall performance, under both end-to-end training and decoupled training setup. It successfully accommodates the distribution shift between training and testing. In particular, we observe that Balanced Softmax , which we derive in Eqn. 10, cannot yield ideal results, compared to our proposed Balanced Softmax in Eqn. 4.

Meta-Sampler. From (6), (7), (9) and (10), we observe that Meta-Sampler generally improves the performance, when compared with no Meta-Sampler, and variants of Meta-Sampler. We notice that the performance gain is larger with higher imbalance factor, which is consistent with our observation in LVIS experiments. In (9) and (10), Meta-Sampler generally outperforms the Meta-Reweighter and suggests the discrete sampling process gives a more efficient optimization process. Comparing (7) and (10), we can see Meta-Sampler addresses the over-balancing issue discussed in Section 3.2.

Decoupled Training. Comparing (2) with (4) and (3) with (6), decoupled training scheme and Balanced Softmax are two orthogonal components and we can benefit from both at the same time.

5 Conclusion

We have introduced BALMS for long-tail visual recognition tasks. BALMS tackles the distribution shift between training and testing, combining meta learning with generalization error bound theory: it optimizes a Balanced Softmax function which theoretically minimizes the generalization error bound; it improves the optimization in large long-tailed datasets by learning an effective Meta Sampler. BALMS generally outperforms SOTA methods on 4 image classification datasets and 1 instance segmentation dataset by a large margin, especially when the imbalance factor is high.

However, Meta Sampler is computationally expensive in practice and the optimization on large datasets is slow. In addition, the Balanced Softmax function only approximately guarantees a generalization error bound. Future work may extend current framework to a wider range of tasks, e.g., machine translation, and correspondingly design tighter bounds and computationally efficient meta-learning algorithms.

Broader Impact

Due to the Zipfian distribution of categories in real life, algorithms, and models with exceptional performance on research benchmarks may not remain powerful in the real world. BALMS, as a light-weight method, only adds minimal computational cost during training and is compatible with most of the existing works for visual recognition. As a result, BALMS could be beneficial to bridge the gap between research benchmarks and industrial applications for visual recognition.

However, there can be some potential negative effects. As BALMS empowers deep classifiers with stronger recognition capability on long-tailed distribution, the application of such a classification algorithm can be further extended to more real-life scenarios. We should be cautious about the misuse of the method proposed. Depending on the scenario, it might cause negative effect on democratic privacy, e.g., Person ReID, Detection and etc.


  • R. Barandela, E. Rangel, J. S. Sanchez, and F. J. Ferri (2009) Restricted decontamination for the imbalanced training sample problem.

    Iberoamerican Congress on Pattern Recognition

    21 (9), pp. 1263–1284.
    Cited by: §2.
  • M. Buda, A. Maki, and M. A. Mazurowski (2018)

    A systematic study of the class imbalance problem in convolutional neural networks

    Neural Networks 106, pp. 249–259. Cited by: §2.
  • J. Byrd and Z. C. Lipton (2018)

    What is the effect of importance weighting in deep learning?

    arXiv preprint arXiv:1812.03372. Cited by: §1, §2.
  • K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma (2019) Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1567–1578. Cited by: A.3 Proof to Theorem 2, §1, §2, §3.1, §3.1, Table 2, Table 4.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of Artificial Intelligence Research

    16, pp. 321–357.
    Cited by: §2.
  • Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie (2019) Class-balanced loss based on effective number of samples.

    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 9260–9269.
    Cited by: C.3 Training Details, D.2 Long-tailed Datasets Generation, Table 2, Table 4.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: D.2 Long-tailed Datasets Generation.
  • E. Grefenstette, B. Amos, D. Yarats, P. M. Htut, A. Molchanov, F. Meier, D. Kiela, K. Cho, and S. Chintala (2019) Generalized inner loop meta-learning. arXiv preprint arXiv:1910.01727. Cited by: C.2 Software.
  • A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: C.3 Training Details, D.2 Long-tailed Datasets Generation, Table 6, §1, §1, §1, §2, §4.1, Table 1.
  • H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing 16, pp. 321–357. Cited by: §2.
  • M. Hayat, S. Khan, S. W. Zamir, J. Shen, and L. Shao (2019) Gaussian affinity for max-margin class imbalanced learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9), pp. 1263–1284. Cited by: §2.
  • C. Huang, Y. Li, C. Change Loy, and X. Tang (2016) Learning deep representation for imbalanced classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5375–5384. Cited by: §2.
  • C. Huang, Y. Li, C. L. Chen, and X. Tang (2019)

    Deep imbalanced learning for face recognition and attribute prediction

    IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • M. A. Jamal, M. Brown, M. Yang, L. Wang, and B. Gong (2020) Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. ArXiv abs/2003.10780. Cited by: §2.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparametrization with gumbel-softmax. In Proceedings International Conference on Learning Representations 2017, Cited by: B.1 Meta Sampler, §3.2.
  • S. M. Kakade, K. Sridharan, and A. Tewari (2009) On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800. Cited by: A.3 Proof to Theorem 2, A.3 Proof to Theorem 2, §3.1, §3.1.
  • B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis (2020) Decoupling representation and classifier for long-tailed recognition. International Conference on Learning Representations abs/1910.09217. Cited by: C.3 Training Details, C.3 Training Details, C.3 Training Details, C.3 Training Details, §2, §3.2, §4.2, Table 2, Table 3, Table 4.
  • S. Khan, M. Bennamoun, F. Sohel, and R. Togneri (2015)

    Cost-sensitive learning of deep feature representations from imbalanced data

    IEEE Transactions on Neural Networks and Learning Systems PP. Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: C.3 Training Details, C.3 Training Details.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report . Cited by: Table 6, §1, §1, §4.1, Table 1.
  • M. Kubat and S. Matwin (1997) Addressing the curse of imbalanced training sets: one-sided selection. In

    In Proceedings of the Fourteenth International Conference on Machine Learning

    pp. 179–186. Cited by: §2.
  • B. Li, Y. Liu, and X. Wang (2019) Gradient harmonized single-stage detector. In AAAI Conference on Artificial Intelligence, Cited by: §2.
  • T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. Cited by: Table 2, Table 3, Table 4.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. ArXiv abs/1405.0312. Cited by: §4.1.
  • Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2532–2541. Cited by: D.2 Long-tailed Datasets Generation, Table 6, §1, §4.1, Table 1, Table 3.
  • I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    In International Conference on Learning Representations, Cited by: C.3 Training Details, C.3 Training Details.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. Cited by: §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. Cited by: C.2 Software.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In ICML, Cited by: §2.
  • L. Shen, Z. Lin, and Q. Huang (2016)

    Relay backpropagation for effective learning of deep convolutional neural networks

    In European conference on computer vision, pp. 467–482. Cited by: §2.
  • J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-weight-net: learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, pp. 1917–1928. Cited by: §2.
  • H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition (CVPR), Cited by: Table 3.
  • J. Tan, C. Wang, B. Li, Q. Li, W. Ouyang, C. Yin, and J. Yan (2020) Equalization loss for long-tailed object recognition. ArXiv abs/2003.05176. Cited by: C.3 Training Details, §2, §4.2, Table 2, Table 3, Table 4.
  • Y. Wang, D. Ramanan, and M. Hebert (2017) Learning to model the tail. In Advances in Neural Information Processing Systems, pp. 7029–7039. Cited by: §1, §2.
  • X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao (2017) Range loss for deep face recognition with long-tailed training data. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 5419–5428. Cited by: Table 3.
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 6, §1, §4.1, Table 1.
  • B. Zhou, Q. Cui, X. Wei, and Z. Chen (2019) BBN: bilateral-branch network with cumulative learning for long-tailed visual recognition. ArXiv abs/1912.02413. Cited by: §2, §3.2.

Appendix A: Proofs and Derivations

A.1 Proof to Theorem 1

Observe that the conditional probability of a categorical distribution can be parameterized as an exponential family. It gives us a standard Softmax function as an inverse parameter mapping


and also a canonical link function:


We begin by adding a term to both sides of Eqn. 15,




Substitute Eqn. 19 back to Eqn. 17, we have


Recall that




Finally, bring Eqn. 22 back to Eqn. 20


A.2 Derivation for the Multiple Binary Logistic Regression variant

Definition. Multiple Binary Logisitic Regression uses binary logistic regression to do multi-class classification. Same as Softmax regression, the predicted label is the class with the maximum model output,


The only difference is that is expressed by a logistic function of


and the loss function sums up binary classification loss on all classes




Setup. By the virtue of Bayes Rule, and can be decomposed as


and for and ,


Derivation. Again, we introduce the exponential family parameterization and have the following link function for


Bring the decomposition Eqn. 28 and Eqn.29 into the link function above


Simplify the above equation


Substitute the in to the equation above




Finally, we have


A.3 Proof to Theorem 2

Setup. Firstly, we define as,


where and is previously defined in submission. However, does not have a specific semantic meaning as it is defined only to keep consistent with notations in Kakade et al. [2009].

Let be the 0-1 loss on example from class


and be the 0-1 margin loss on example from class


Let denote the empirical variant of .

Proof. For any and with probability at least , for all , and , Theorem 2 in Kakade et al. [2009] directly gives us


where and denotes the empirical Rademacher complexity of function family . By applying Cao et al. [2019]’s analysis on the empirical Rademacher complexity and union bound over all classes, we have the generalization error bound for the loss on a balanced test set




is a low-order term of . To minimize the generalization error bound Eqn. 41, we essentially need to minimize


By constraining the sum of as , we can directly apply Cauchy-Schwarz inequality to solve the optimal


A.4 Proof to Corollary 2.1

Preliminary. Notice that can not be achieved, because and implies


The equation above contradicts with the definition that sum of should be exactly equal to 1. To solve the contradiction, we introduce a term , such that


To justify the new term , we recall the definition of error


If we tweak the threshold with the term


As is not a function of , the value of will not be affected by the tweak. Thus, instead of looking for that minimizes the generalization bound for , we are in fact looking for that minimizes generalization bound for

Proof. In this section, we show that in the corollary is an approximation of .


where for some in between and , is close to a constant when the model converges. Although the approximation holds under some constraints, we show that it approximately minimizes the generalization bound derived in the last section.

A.5 Derivation for Eqn.12

Gradient for positive samples: