BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

12/05/2019 ∙ by Boyan Zhou, et al. ∙ 0

Our work focuses on tackling the challenging but natural visual recognition task of long-tailed data distribution (, a few classes occupy most of the data, while most classes have rarely few samples). In the literature, class re-balancing strategies (, re-weighting and re-sampling) are the prominent and effective methods proposed to alleviate the extreme imbalance for dealing with long-tailed problems. In this paper, we firstly discover that these re-balancing methods achieving satisfactory recognition accuracy owe to that they could significantly promote the classifier learning of deep networks. However, at the same time, they will unexpectedly damage the representative ability of the learned deep features to some extent. Therefore, we propose a unified Bilateral-Branch Network (BBN) to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately. In particular, our BBN model is further equipped with a novel cumulative learning strategy, which is designed to first learn the universal patterns and then pay attention to the tail data gradually. Extensive experiments on four benchmark datasets, including the large-scale iNaturalist ones, justify that the proposed BBN can significantly outperform state-of-the-art methods. Furthermore, validation experiments can demonstrate both our preliminary discovery and effectiveness of tailored designs in BBN for long-tailed problems. Our method won the first place in the iNaturalist 2019 large scale species classification competition, and our code is open-source and available at https://github.com/Megvii-Nanjing/BBN

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 10

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Real-world large-scale datasets often display the phenomenon of long-tailed distributions. The extreme imbalance causes tremendous challenges on the classification accuracy, especially for the tail classes. Class re-balancing strategies can yield better classification performance for long-tailed problems. In this paper, we reveal that the mechanism of these strategies is to significantly promote classifier learning but will unexpectedly damage the representative ability of the learned deep features to some extent. As conceptually demonstrated, after re-balancing, the decision boundary (i.e., black solid arc) tends to accurately classify the tail data (i.e., red squares). However, the intra-class distribution of each class becomes more separable. Quantitative results are presented in Figure 5, and more analyses can be found in the supplementary materials.
Figure 2: Top-1 error rates of different manners for representation learning and classifier learning on two long-tailed datasets CIFAR-100-IR50 and CIFAR-10-IR50 [ldam]. “CE” (Cross-Entropy), “RW” (Re-Weighting) and “RS” (Re-Sampling) are the conducted learning manners. As observed, when fixing the representation (comparing error rates of three blocks in the vertical direction), the error rates of classifiers trained with RW/RS are reasonably lower than CE. While, when fixing the classifier (comparing error rates of in the horizontal direction), the representations trained with CE surprisingly get lower error rates than those with RW/RS. Experimental details can be found in Section 3.

With the advent of research on deep Convolutional Neural Networks (CNNs), the performance of image classification has witnessed incredible progress. The success is undoubtedly inseparable to available and high-quality large-scale datasets,

e.g.

, ImageNet ILSVRC 2012 

[imagenet], MS COCO [coco] and Places Database [zhou2017places], etc

. In contrast with these visual recognition datasets exhibiting roughly uniform distributions of class labels, real-world datasets always have skewed distributions with

a long tail [kendall1948advanced, van2017devil], i.e., a few classes (a.k.a. head class) occupy most of the data, while most classes (a.k.a. tail class) have rarely few samples, cf. Figure 1

. Moreover, more and more long-tailed datasets reflecting the realistic challenges are constructed and released by the computer vision community in very recent years,

e.g., iNaturalist [cui2018large], LVIS [Gupta2019LVIS] and RPC [wei2019rpc]

. When dealing with such visual data, deep learning methods are not feasible to achieve outstanding recognition accuracy due to both the data-hungry limitation of deep models and also the extreme class imbalance trouble of long-tailed data distributions.

In the literature, the prominent and effective methods for handling long-tailed problems are class re-balancing strategies, which are proposed to alleviate the extreme imbalance of the training data. Generally, class re-balancing methods are roughly categorized into two groups, i.e., re-sampling [shen2016relay, buda2018systematic, japkowicz2002class, buda2018systematic, he2009learning, byrd2019effect, drummond2003c4, more2016survey, chawla2002smote] and cost-sensitive re-weighting [huang2016learning, wang2017learning, cb-focal, ren18l2rw]. These methods can adjust the network training, by re-sampling the examples or re-weighting the losses of examples within mini-batches, which is in expectation closer to the test distributions. Thus, class re-balancing is effective to directly influence the classifier weights updating of deep networks, i.e., promoting the classifier learning. That is the reason why re-balancing could achieve satisfactory recognition accuracy on long-tailed data.

Figure 3: Framework of our Bilateral-Branch Network (BBN). It consists of three key components: 1) The conventional learning branch takes input data from a uniform sampler, which is responsible for learning universal patterns of original distributions. While, 2) the re-balancing branch

takes inputs from a reversed sampler and is designed for modeling the tail data. The output feature vectors

and of two branches are aggregated by 3) our cumulative learning strategy for computing training losses. “GAP” is short for global average pooling.

However, although re-balancing methods have good eventual predictions, we argue that these methods still have adverse effects, i.e., they will also unexpectedly damage the representative ability of the learned deep features (i.e., the representation learning) to some extent. In concretely, re-sampling has the risks of over-fitting the tail data (by over-sampling) and also the risk of under-fitting the whole data distribution (by under-sampling), when data imbalance is extreme. For re-weighting, it will distort the original distributions by directly changing or even inverting the data presenting frequency.

As a preliminary of our work, by conducting validation experiments, we justify our aforementioned argumentation. Specifically, to figure out how re-balancing strategies work, we divide the training process of deep networks into two stages, i.e., to separately conduct the representation learning and the classifier learning. At the former stage for representation learning, we employ plain training (conventional cross-entropy), re-weighting and re-sampling as three learning manners to obtain their corresponding learned representations. Then, at the latter stage for classifier learning, we first fix the parameters of representation learning (i.e., backbone layers) converged at the former stage and then retrain the classifiers of these networks (i.e., fully-connected layers) from scratch, also with the three aforementioned learning manners. In Figure 5, the prediction error rates on two benchmark long-tailed datasets [ldam], i.e., CIFAR-100-IR50 and CIFAR-10-IR50, are reported. Obviously, when fixing the representation learning manner, re-balancing methods reasonably achieve lower error rates, indicating they can promote classifier learning. On the other side, by fixing the classifier learning manner, plain training on original imbalanced data can bring better results according to its better features. Also, the worse results of re-balancing methods prove that they will hurt feature learning.

Therefore, in this paper, for exhaustively improving the recognition performance of long-tailed problems, we propose a unified Bilateral-Branch Network (BBN) model to take care of both representation learning and classifier learning simultaneously. As shown in Figure 3, our BBN model consists of two branches, termed as the “conventional learning branch” and the “re-balancing branch”. In general, each branch of BBN separately performs its own duty for representation learning and classifier learning, respectively. As the name suggests, the conventional learning branch equipped with the typical uniform sampler w.r.t. the original data distribution is responsible for learning universal patterns for recognition. While the re-balancing branch coupled with a reversed sampler is designed to model the tail data. After that, the predicted outputs of these bilateral branches are aggregated in the cumulative learning part by an adaptive trade-off parameter .

is automatically generated by the “Adaptor” according to the number of training epochs, which adjusts the whole BBN model to firstly learn the universal features from the original distribution and then pay attention to the tail data gradually. More importantly,

could further control the parameter updating of each branch, which, for example, avoids damaging the learned universal features when emphasizing the tail data at the later periods of training.

In experiments, empirical results on four benchmark long-tailed datasets show that our model obviously outperforms existing state-of-the-art methods. Moreover, extensive validation experiments and ablation studies can prove the aforementioned preliminary discovery and also validate the effectiveness of our tailored designs for long-tailed problems.

The main contributions of this paper are as follows:

  • [itemsep=-0.2em, leftmargin=1em]

  • We explore the mechanism of the prominent class re-balancing methods for long-tailed problems, and further discover that these methods can significantly promote classifier learning and meanwhile will affect the representation learning w.r.t. the original data distribution.

  • We propose a unified Bilateral-Branch Network (BBN) model to take care of both representation learning and classifier learning for exhaustively boosting long-tailed recognition. Also, a novel cumulative learning strategy is developed for adjusting the bilateral learnings and coupled with our BBN model’s training.

  • We evaluate our model on four benchmark long-tailed visual recognition datasets, and our proposed model consistently achieves superior performance over previous competing approaches.

2 Related work

Class re-balancing strategies: Re-sampling methods as one of the most important class re-balancing strategies could be divided into two types: 1) Over-sampling by simply repeating data for minority classes [shen2016relay, buda2018systematic, byrd2019effect] and 2) under-sampling by abandoning data for dominant classes [japkowicz2002class, buda2018systematic, he2009learning]. But sometimes, with re-sampling, duplicated tailed samples might lead to over-fitting upon minority classes [chawla2002smote, cb-focal], while discarding precious data will certainly impair the generalization ability of deep networks.

Re-weighting

methods are another series of prominent class re-balancing strategies, which usually allocate large weights for training samples of tail classes in loss functions 

[huang2016learning, wang2017learning]. However, re-weighting is not capable of handling the large-scale, real-world scenarios of long-tailed data and tends to cause optimization difficulty [mikolov2013distributed]. Consequently, Cui et al. [cb-focal] proposed to adopt the effective number of samples [cb-focal] instead of proportional frequency. Thereafter, Cao et al. [ldam] explored the margins of the training examples and designed a label-distribution-aware loss to encourage larger margins for minority classes.

In addition, recently, some two-stage fine-tuning strategies [ldam, cui2018large, ouyang2016factors] were developed to modify re-balancing for effectively handling long-tail problems. Specifically, they separated the training process into two single stages. In the first stage, they trained networks as usual on the original imbalanced data and only utilized re-balancing at the second stage to fine-tune the network with a small learning rate.

Beyond that, other methods of different learning paradigms were also proposed to deal with long-tailed problems, e.g., metric learning [zhang2017range, huang2016learning], meta-learning [Liu_2019_CVPR]

, and knowledge transfer learning 

[wang2017learning, Zhong_2019_CVPR], which are not within the scope of this paper.

Mixup: Mixup [mixup] was a general data augmentation algorithm, i.e., convexly combining random pairs of training images and their associated labels, to generate additional samples when training deep networks. Also, manifold mixup [manifoldmixup] conducted mixup operations on random pairs of samples in the manifold feature space for augmentation. The mixed ratios in mixup were sampled from the -distribution to increase the randomness of augmentation. Although mixup is clearly far from our unified end-to-end trainable model, in experiments, we still compared with a series of mixup algorithms to validate our effectiveness.

3 How class re-balancing strategies work?

In this section, we attempt to figure out the working mechanism of these class re-balancing methods. More concretely, we divide a deep classification model into two essential parts: 1) the feature extractor (i.e., frontal base/backbone network) and 2) the classifier (i.e., last fully-connected layers). Accordingly, the learning process of a deep classification network could be separated into representation learning and classifier learning. Since class re-balancing strategies could boost the classification accuracy by altering the training data distribution closer to test and paying more attention to the tail classes, we propose a conjecture that the way these strategies work is to promote classifier learning significantly but might damage the universal representative ability of the learned deep features due to distorting original distributions.

In order to justify our conjecture, we design a two-stage experimental fashion to separately learn representations and classifiers of deep models. Concretely, in the first stage, we train a classification network with plain training (i.e., cross-entropy) or re-balancing methods (i.e., re-weighting/re-sampling) as learning manners. Then, we obtain different kinds of feature extractors corresponding to these learning manners. When it comes to the second stage, we fix the parameters of the feature extractors learned in the former stage, and retrain classifiers from scratch with the aforementioned learning manners again. In principle, we design these experiments to fairly compare the quality of representations and classifiers learned by different manners by following the control variates method.

The CIFAR [cifar] datasets are a collection of images that are commonly used to assess computer vision approaches. Previous work [cb-focal, ldam] created long-tailed versions of CIFAR datasets with different imbalance ratios, i.e., the number of the most frequent class divided by the least frequent class, to evaluate the performance. In this section, following [ldam], we also use long-tailed CIFAR-10/CIFAR-100 as the test beds.

As shown in Figure 5, we conduct several contrast experiments to validate our conjecture on the CIFAR-100-IR50 (long-tailed CIFAR-100 with imbalance ratio 50). As aforementioned, we separate the whole network into two parts: feature extractor and classifier. Then, we apply three manners for the feature learning and the classifier learning respectively according to our two-stage training fashion. Thus, we can obtain nine groups of results based on different permutations: (1) Cross-Entropy (CE): We train the networks as usual on the original imbalanced data with the conventional cross-entropy loss. (2) Re-Sampling (RS): We first sample a class uniformly and then collect an example from that class by sampling with replacement. By repeating this process, a balanced mini-batch data is obtained. (3) Re-Weighting (RW): We re-weight all the samples by the inverse of the sample size of their classes. The error rate is evaluated on the validation set. As shown in Figure 5, we have the observations from two perspectives:

  • [itemsep=-0.2em, leftmargin=1em]

  • Classifiers: When we apply the same representation learning manner (comparing error rates of three blocks in the vertical direction), it can be reasonably found that RW/RS always achieve lower classification error rates than CE, which owes to their re-balancing operations adjusting the classifier weights updating to match test distributions.

  • Representations: When applying the same classifier learning manner (comparing error rates of three blocks in the horizontal direction), it is a bit of surprise to see that error rates of CE blocks are consistently lower than error rates of RW/RS blocks. The findings indicate that training with CE achieves better classification results since it obtains better features. The worse results of RW/RS reveal that they lead to inferior discriminative ability of the learned deep features.

Furthermore, as shown in Figure 5 (left), by employing CE on the representation learning and employing RS on the classifier learning, we can achieve the lowest error rate on the validation set of CIFAR-100-IR50. Additionally, to evaluate the generalization ability for representations produced by three manners, we utilize pre-trained models trained on CIFAR-100-IR50 as feature extractor to obtain the representations of CIFAR-10-IR50, and then perform the classifier learning experiments as the same as aforementioned. As shown in Figure 5 (right), on CIFAR-10-IR50, it can have the identical observations, even in the situation that the feature extractor is trained on another long-tailed dataset.

4 Methodology

4.1 Overall framework

As shown in Figure 3, our BBN consists of three main components. Concretely, we design two branches for representation learning and classifier learning, termed “conventional learning branch” and “re-balancing branch”, respectively. Both branches use the same residual network structure [he2016deep] and share all the weights except for the last residual block. Let denote a training sample and is its corresponding label, where is the number of classes. For the bilateral branches, we apply uniform and reversed samplers to each of them and obtain two samples and as the input data, where is for the conventional learning branch and is for the re-balancing branch. Then, two samples are fed into their own corresponding branch to acquire the feature vectors and by global average pooling.

Furthermore, we also design a specific cumulative learning strategy for shifting the learning “attention” between two branches in the training phase. In concretely, by controlling the weights for and with an adaptive trade-off parameter , the weighted feature vectors and will be sent into the classifiers and

respectively and the outputs will be integrated together by element-wise addition. The output logits are formulated as:

(1)

where is the predicted output, i.e., . For each class

, the softmax function calculates the probability of the class by:

(2)

Then, we denote

as the cross-entropy loss function and the output probability distribution as

. Thus, the weighted cross-entropy classification loss of our BBN model is illustrated as:

(3)

and the whole network is end-to-end trainable.

4.2 Proposed bilateral-branch structure

In this section, we elaborate the details of our unified bilateral-branch structure shown in Figure 3. As aforementioned, the proposed conventional learning branch and re-balancing branch do perform their own duty (i.e., representation learning and classifier learning, respectively). There are two unique designs for these branches.

Data samplers. The input data for the conventional learning branch comes from a uniform sampler, where each sample in the training dataset is sampled only once with equal probability in a training epoch. The uniform sampler retains the characteristics of original distributions, and therefore benefits the representation learning. While, the re-balancing branch aims to alleviate the extreme imbalance and further improve the classification performance on tail classes [van2017devil], whose input data comes from a reversed sampler. For the reversed sampler, the sampling possibility of each class is proportional to the reciprocal of its sample size, i.e., the more samples in a class, the smaller sampling possibility the class has. In formulations, let denote that the number of samples for the class is and the maximum sample number of all the classes is . There are three sub-procedures to construct the reversed sampler: 1) Calculate the sampling possibility for the class according to the number of samples as:

(4)

where ; 2) Randomly sample a class according to ; 3) Uniformly pick up a sample from the class with replacement. By repeating this reversed sampling process, training data of a mini-batch is obtained.

Weights sharing. In BBN, both branches economically share the same residual network structure as illustrated in Figure 3. We use ResNets [he2016deep] as our backbone network, e.g., ResNet-32 and ResNet-50. In details, two branch networks, except for the last residual block, share the same weights. There are two benefits for sharing weights: On the one hand, the well-learned representation by the conventional learning branch can benefit the learning of the re-balancing branch. On the other hand, sharing weights will largely reduce computational complexity in the inference phase.

4.3 Proposed cumulative learning strategy

Cumulative learning strategy is proposed to shift the learning focus between the bilateral branches by controlling the weights for features produced by two branches and the classification loss . It is designed to first learn the universal patterns and then pay attention to the tail data gradually. In the training phase, the feature of the conventional learning branch will be multiplied by and the feature of the re-balancing branch will be multiplied by , where is automatically generated according to the training epoch. Concretely, the number of total training epochs is denoted as and the current epoch is . is calculated by:

(5)

where will gradually decrease as the training epochs increasing.

Datasets Long-tailed CIFAR-10 Long-tailed CIFAR-100
Imbalance ratio 100 50 10 100 50 10
CE 29.64 25.19 13.61 61.68 56.15 44.29
Focal [focalloss] 29.62 23.28 13.34 61.59 55.68 44.22
Mixup [mixup] 26.94 22.18 12.90 60.46 55.01 41.98
Manifold Mixup [manifoldmixup] 27.04 22.05 12.97 61.75 56.91 43.45
Manifold Mixup (two samplers) 26.90 20.79 13.17 63.19 57.95 43.54
CE-DRW [ldam] 23.66 20.03 12.44 58.49 54.71 41.88
CE-DRS [ldam] 24.39 20.19 12.62 58.39 54.52 41.89
CB-Focal [cb-focal] 25.43 20.73 12.90 60.40 54.83 42.01
LDAM-DRW [ldam] 22.97 18.97 11.84 57.96 53.38 41.29
Our BBN 20.18 17.82 11.68 57.44 52.98 40.88
Table 1: Top-1 error rates of ResNet-32 on long-tailed CIFAR-10 and CIFAR-100. (Best results are marked in bold)

In intuition, we design the adapting strategy for based on the motivation that discriminative feature representations are the foundation for learning robust classifiers. Although representation learning and classifier learning deserve equal attentions, the learning focus of our BBN should gradually change from feature representations to classifiers, which can exhaustively improve long-tailed recognition accuracy. With decreasing, the main emphasis of BBN turns from the conventional learning branch to the re-balancing branch. Different from two-stage fine-tuning strategies [ldam, cui2018large, ouyang2016factors], our ensures that both branches for different goals can be constantly updated in the whole training process, which could avoid the affects on one goal when it performs training for the other goal.

In experiments, we also provide the qualitative results of this intuition by comparing different kinds of adaptors, cf. Section 5.5.2.

4.4 Inference phase

During inference, the test sample is fed into both branches and two features and are obtained. Because both branches are equally important, we simply fix to in the test phase. Then, the equally weighted features are fed to their corresponding classifiers (i.e., and ) to obtain two prediction logits. Finally, both logits are aggregated by element-wise addition to return the classification results.

5 Experiments

5.1 Datasets and empirical settings

Long-tailed CIFAR-10 and CIFAR-100. Both CIFAR-10 and CIFAR-100 contain 60,000 images, 50,000 for training and 10,000 for validation with category number of 10 and 100, respectively. For fair comparisons, we use the long-tailed versions of CIFAR datasets as the same as those used in [ldam] with controllable degrees of data imbalance. We use an imbalance factor to describe the severity of the long tail problem with the number of training samples for the most frequent class and the least frequent class, e.g., . Imbalance factors we use in experiments are 10, 50 and 100.

iNaturalist 2017 and iNaturalist 2018. The iNaturalist species classification datasets are large-scale real-world datasets that suffer from extremely imbalanced label distributions. The 2017 version of iNaturalist contains 579,184 images with 5,089 categories and the 2018 version is composed of 437,513 images from 8,142 categories. Note that, besides the extreme imbalance, the iNaturalist datasets also face the fine-grained problem [wei2019deep, Zhao2017]. In this paper, the official splits of training and validation images are utilized for fair comparisons.

5.2 Implementation details

Implementation details on CIFAR. For long-tailed CIFAR-10 and CIFAR-100 datasets, we follow the data augmentation strategies proposed in [he2016deep]: randomly crop a patch from the original image or its horizontal flip with

pixels padded on each side. We train the ResNet-32 

[he2016deep]

as our backbone network for all experiments by standard mini-batch stochastic gradient descent (SGD) with momentum of

, weight decay of . We train all the models on a single NVIDIA 1080Ti GPU for epochs with batch size of . The initial learning rate is set to and the first five epochs is trained with linear warm-up [goyal2017accurate] learning rate schedule. The learning rate is decayed at the and epochs by for our BBN, respectively.

Implementation details on iNaturalist. For fair comparisons, we utilize ResNet-50 [he2016deep] as our backbone network in all experiments on iNaturalist 2017 and iNaturalist 2018. We follow the same training strategy in [goyal2017accurate] with batch size of on four GPUs of NVIDIA 1080Ti. We firstly resize the image by setting the shorter side to pixels and then take a crop from it or its horizontal flip. During training, we decay the learning rate at the and epoch by for our BBN, respectively.

5.3 Comparison methods

In experiments, we compare our BBN model with three groups of methods:

  • [itemsep=-0.2em, leftmargin=1em]

  • Baseline methods. We employ plaining training with cross-entropy loss and focal loss [focalloss] as our baselines. Note that, we also conduct experiments with a series of mixup algorithms [mixup, manifoldmixup] for comparisons.

  • Two-stage fine-tuning strategies. To prove the effectiveness of our cumulative learning strategy, we also compare with the two-stage fine-tuning strategy proposed in previous state-of-the-art [ldam]. We train networks with cross-entropy (CE) on imbalanced data in the first stage, and then conduct class re-balancing training in the second stage. “CE-DRW” and “CE-DRS” refer to the two-stage baselines using re-weighting and re-sampling at the second stage.

  • State-of-the-art methods. For state-of-the-art methods, we compare with the recently proposed LDAM [ldam] and CB-Focal [cb-focal] which achieve good classification accuracy on these four aforementioned long-tailed datasets.

Datasets iNaturalist 2018 iNaturalist 2017
CE 42.84 45.38
CE-DRW [ldam] 36.27 40.48
CE-DRS [ldam] 36.44 40.12
CB-Focal [cb-focal] 38.88 41.92
LDAM-DRW* [ldam] 32.00
LDAM-DRW [ldam] 35.42 39.49
LDAM-DRW [ldam] () 33.88 38.19
Our BBN 33.71 36.61
Our BBN () 30.38 34.25
Table 2: Top-1 error rates of ResNet-50 on large-scale long-tailed datasets iNaturalist 2018 and iNaturalist 2017. Our method outperforms the previous state-of-the-arts by a large margin, especially with scheduler. “*” indicate original results in that paper.

5.4 Main results

5.4.1 Experimental results on long-tailed CIFAR

We conduct extensive experiments on long-tailed CIFAR datasets with three different imbalanced ratios: , and . Table 1 reports the error rates of various methods. We demonstrate that our BBN consistently achieves the best results across all the datasets, when comparing other comparison methods, including the two-stage fine-tuning strategies (i.e., CE-DRW/CE-DRS), the series of mixup algorithms (i.e., mixup, manifold mixup and manifold mixup with two samplers as the same as ours), and also previous state-of-the-arts (i.e., CB-Focal [cb-focal] and LDAM-DRW [ldam]).

Especially for long-tailed CIFAR-10 with imbalanced ratio (an extreme imbalance case), we get 20.18% error rate which is 2.79% lower than LDAM-DRW [ldam]. Additionally, it can be found from that table, the two-stage fine-tuning strategies (i.e., CE-DRW/CE-DRS) are effective, since they could obtain comparable or even better results comparing with state-of-the-art methods.

5.4.2 Experimental results on iNaturalist

Table 2 shows the results on two large-scale long-tailed datasets, i.e., iNaturalist 2018 and iNaturalist 2017. As shown in that table, the two-stage fine-tuning strategies (i.e., CE-DRW/CE-DRS) also perform well, which have consistent observations with those on long-tailed CIFAR. Compared with other methods, on iNaturalist, our BBN still outperform competing approaches and baselines. Besides, since iNaturalist is large-scale, we also conduct network training with the scheduler. Meanwhile, for fair comparisons, we further evaluate the previous state-of-the-art LDAM-DRW [ldam] with the training scheduler. It is obviously to see that, with scheduler, our BBN achieves significantly better results than BBN without scheduler. Additionally, compared with LDAM-DRW (), we achieve +3.50% and +3.94% improvements on iNaturalist 2018 and iNaturalist 2017, respectively. In addition, even though we do not use scheduler, our BBN can still get the best results. For a detail, we conducted the experiments based on LDAM [ldam] with the source codes provided by the authors, but failed to reproduce the results reported in that paper.

5.5 Ablation studies

5.5.1 Different samplers for the re-balancing branch

For better understanding our proposed BBN model, we conduct experiments on different samplers utilized in the re-balancing branch. We present the error rates of models trained with different samplers in Table 3. For clarity, the uniform sampler maintains the original long-tailed distribution. The balanced sampler assigns the same sampling possibility to all classes, and construct a mini-batch training data obeying balanced label distribution. As shown in that table, the reversed sampler (our proposal) achieves considerably better performance than the uniform and balanced samplers, which indicates that the re-balancing branch of BBN should pay more attentions to the tail classes by enjoying the reversed sampler.

Sampler Error rate
Uniform sampler 21.31
Balanced sampler 21.06
Reversed sampler (Ours) 17.82
Table 3: Ablation studies for different samplers for the re-balancing branch of BBN on long-tailed CIFAR-10-IR50.

5.5.2 Different cumulative learning strategy

To facilitate the understanding of our proposed cumulative learning strategy, we explore several different strategies to generate the adaptive trade-off parameter on CIFAR-10-IR50. Specifically, we test with both progress-relevant/irrelevant strategies, cf. Table 4. For clarity, progress-relevant strategies adjust with the number of training epochs, e.g., linear decay, cosine decay, etc. Progress-irrelevant strategies include the equal weight or generate from a discrete distribution (e.g., the -distribution).

Adaptor Error rate
Equal weight 21.56
-distribution 21.75
Parabolic increment 22.70
Linear decay 18.55
Cosine decay 18.04
Parabolic decay (Ours) 17.82
Table 4: Ablation studies of different adaptor strategies of BBN on long-tailed CIFAR-10-IR50.

As shown in Table 4, the decay strategies (i.e., linear decay, cosine decay and our parabolic decay) for generating can yield better results than the other strategies (i.e., equal weight, -distribution and parabolic increment). These observations prove our motivation that the conventional learning branch should be learned firstly and then the re-balancing branch. Among these strategies, the best way for generating is the proposed parabolic decay approach. In addition, the parabolic increment, where re-balancing are attended before conventional learning, performs the worst, which validate our proposal from another perspective.

5.6 Validation experiments of our proposals

5.6.1 Evaluations of feature quality

It is proven in Section 3 that learning with vanilla CE on original data distribution can obtain good feature representations. In this subsection, we further explore the representation quality of our proposed BBN by following the empirical settings in Section 3. Concretely, given a BBN model trained on CIFAR-100-IR50, firstly, we fix the parameters of representation learning of two branches. Then, we separately retrain the corresponding classifiers from scratch of two branches also on CIFAR-100-IR50. In the final, classification error rates are tested on these two branches, independently.

As shown in Table 5, the feature representations obtained by the conventional learning branch of BBN (“BBN-CB”) achieves comparable performance with CE, which indicates that our proposed BBN greatly preserves the representation capacity learned from the original long-tailed dataset. Note that, the re-balancing branch of BBN (“BBN-RB”) also gets better performance than RW/RS and it possibly owes to the parameters sharing design of our model.

Representation learning manner Error rate
CE 58.62
RW 63.17
RS 63.71
BBN-CB 58.89
BBN-RB 61.09
Table 5: Feature quality evaluation for different learning manners.

5.6.2 Visualization of classifier weights

Denote as a set of classifiers for all classes, where indicates the weight vector for the class . Previous work [guo2017one] has shown that the value of -norm for different classes can demonstrate the preference of a classifier, i.e., the classifier with the largest -norm tends to judge one example as belonging to its class . Following [guo2017one], we visualize the -norm of these classifiers.

Figure 4: -norm of classifier weights for different learning manners. Specifically, “BBN-ALL” indicates the -norm of the combination of and in our model.

in the legend is the standard deviation of

-norm for ten classes.

As shown in Figure 4, we visualize the -norm of ten classes trained on CIFAR-10-IR50. For our BBN, we visualize the classifier weights of the conventional learning branch (“BBN-CB”) and the classifier weights of the re-balancing branch (“BBN-RB”), also their combined classifier weights (“BBN-ALL”). Additionally, the visualization results on classifiers trained with these learning manners in Section 3, i.e., CE, RW and RS, are also provided.

Obviously, the -norm of ten classes’ classifiers for our proposed model (i.e., “BBN-ALL”) are basically equal, and their standard deviation is the smallest one. For the classifiers trained by other learning manners, the distribution of the -norm of CE is consistent with the long-tailed distribution. The -norm distribution of RW/RS looks a bit flat, but their standard deviations are larger than ours. It gives an explanation why our BBN can outperform these methods. Additionally, by separately analyzing our model, its conventional learning branch (“BBN-CB”) has a similar -norm distribution with CE’s, which justifies its duty is focusing on universal feature learning. The -norm distribution of the re-balancing branch (“BBN-RB”) has a reversed distribution w.r.t. original long-tailed distribution, which reveals it is able to model the tail.

6 Conclusions

In this paper, for studying long-tailed problems, we explored how class re-balancing strategies influenced representation learning and classifier learning of deep networks and revealed that they can promote classifier learning significantly but also damage representation learning to some extent. Motivated by this, we proposed a Bilateral-Branch Network (BBN) with a specific cumulative learning strategy to take care of both representation learning and classifier learning for exhaustively improving the recognition performance of long-tailed tasks. By conducting extensive experiments, we proved that our BBN could achieve the best results on long-tailed benchmarks, including the large-scale iNaturalist. In the future, we attempt to tackle the long-tailed detection problems with our BBN model.

Supplementary Materials

In the supplementary materials, we provide more experimental results and analyses of our proposed BBN model, including:

  1. Additional experiments of different manners for representation and classifier learning (cf. Section 3 and Figure 2 of the paper) on large-scale datasets iNaturalist 2017 and iNaturalist 2018;

  2. Affects of re-balancing strategies on the compactness of learned features;

  3. Comparisons between the BBN model and ensemble methods;

  4. Coordinate graph about different kinds of adaptor strategies for generating ;

  5. Learning algorithm of our proposed BBN model.

Appendix A Additional experiments of different manners for representation and classifier learning (cf. Section 3 and Figure 2 of the paper) on large-scale datasets iNaturalist 2017 and iNaturalist 2018

In this section, following Section 3 of our paper, we conduct experiments on large-scale datasets, i.e., iNaturalist 2017 [van2018inaturalist] and iNaturalist 2018, to further justify our conjecture (i.e., the working mechanism of these class re-balancing strategies is to promote classifier learning significantly but might damage the universal representative ability of the learned deep features due to distorting original distributions.) Specifically, the representation learning stages are conducted on iNaturalist 2017. Then, to also evaluate the generalization ability for learned representations, classifier learning stages are performed on not only iNaturalist 2017 but also iNaturalist 2018.

As shown in Figure 5, we can also have the observations from two perspectives on these large-scale long-tailed datasets:

  • Classifiers: When we apply the same representation learning manner (comparing error rates of three blocks in the vertical direction), it can be reasonably found that RW/RS always achieve lower classification error rates than CE, which owes to their re-balancing operations adjusting the classifier weights updating to match test distributions.

  • Representations: When applying the same classifier learning manner (comparing error rates of three blocks in the horizontal direction), it is a bit of surprise to see that error rates of CE blocks are consistently lower than error rates of RW/RS blocks. The findings indicate that training with CE achieves better classification results since it obtains better features. The worse results of RW/RS reveal that they lead to inferior discriminative ability of the learned deep features.

These observations are consistent with those on long-tailed CIFAR datasets, which can further demonstrate our discovery of Section 3 in the paper.

Figure 5: Top-1 error rates of different manners for representation learning and classifier learning on two large-scale long-tailed datasets iNaturalist 2017 and iNaturalist 2018. “CE” (Cross-Entropy), “RW” (Re-Weighting) and “RS” (Re-Sampling) are the conducted learning manners.

Appendix B Affects of re-balancing strategies on the compactness of learned features

To further prove our conjecture that re-balancing strategies could damage the universal representations, we measure the compactness of intra-class representations on CIFAR-10-IR50 [cifar] for verification.

Concretely, for each class, we firstly calculate a centroid vector by averaging representations of this class. Then, distances between these representations and their centroid are computed and then averaged as a measurement for the compactness of intra-class representations. If the averaged distance of a class is small, it implies that representations of this class gather closely in the representation space. We normalize the -norm of representations to in the training stage for avoiding the impact of feature scales. We report results based on representations learned with cross-entropy (CE), re-weighting (RW) and re-sampling (RS), respectively.

As shown in Figure 6, the averaged distances of re-balancing strategies are obviously larger than conventional training, especially for the head classes. That is to say, the compactness of learned features of re-balancing strategies are significantly worse than conventional training. These observation can further validate the statements in Figure 1 in the paper (i.e., for re-balancing strategies, “the intra-class distribution of each class becomes more separable”) and also the discovery of Section 3 in the paper (i.e., re-balancing “might damage the universal representative ability of the learned deep features to some extent”).

Figure 6: The histogram about the measurement for the compactness of intra-class representations on the CIFAR-10-IR50 dataset. Especially for head classes, representations trained with CE gather more closely than those trained with RW/RS, since the representations of each class are closer to their centroid. The vertical axis is the averaged distance between learned features of each class and their corresponding centroid (The less, the better).

Appendix C Comparisons between the BBN model and ensemble methods

In the following, we compare our BBN model with ensemble methods to prove the effectiveness of our proposed model. Results on CIFAR-10-IR50 [cifar], CIFAR-100-IR50 [cifar], iNaturalist 2017 [van2018inaturalist] and iNaturalist 2018 are provided in Table 6 for comprehensiveness.

As known, ensemble techniques are frequently utilized to boost performances of machine learning tasks. We train three classification models with uniform data sampler, balanced data sampler and reversed data sampler, respectively. For mimicking our bi-branch network design and considering fair comparisons, we provide classification error rates of (1) an ensemble of models learned with uniform sampler and balanced sampler, as well as (2) another ensemble of models learned with uniform sampler and reversed sampler.

As shown in Table 6, our BBN model achieves consistently lower error rates than ensemble models on all datasets. Additionally, compared to ensemble models, our proposed BBN model can yield better performance with limited increase of network parameters thanks to its sharing weights design (cf. Ln. 486-496 in the paper).

Methods CIFAR-10-IR50 CIFAR-100-IR50 iNaturalist 2017 iNaturalist 2018
Uniform sampler + Balanced sampler
19.41 55.10 39.53 36.20
Uniform sampler + Reversed sampler
19.38 54.93 40.02 36.66
BBN (Ours) 17.82 52.98 36.61 33.74
Table 6: Classification error rates of our proposed BBN model and ensemble methods.

Appendix D Coordinate graph about different kinds of adaptor strategies for generating

As shown in Figure 7, we provide a coordinate graph to present how the varies with the progress of network training. The adaptor strategies shown in the figure are the same as those in Table 4 of the paper except the -distribution for its randomness.

Figure 7: Different kinds of adaptor strategies for generating . The horizontal axis indicates current epoch ratio and the vertical axis denotes the value of . (Best viewed in color)

Appendix E Learning algorithm of our proposed BBN model

In the following, we provide the detailed learning algorithm of our proposed BBN. In Algorithm 1, for each training epoch , we firstly assign a value to by the adaptor proposed in Eq. (5) of the paper. Then, we sample training samples by the uniform sampler and reversed sampler, respectively. Feeding samples into our network, we can obtain two independent feature vectors and . Subsequently, we calculate the output logits and the prediction possibility according to Eq. (1) and Eq. (2) in the paper. Finally, the classification loss function is calculated based on the Eq. (3) in the paper and we update model parameters by optimizing this loss function.

Require : Training Dataset = ; denotes obtaining a sample from chosen by a uniform sampler; denotes obtaining a sample from chosen by a reversed sampler; denotes extracting the feature representation from a CNN; and denote the model parameters of the conventional learning and re-balancing branch, respectively; and present the weights of classifiers (i.e., the last fully connected layer) of the conventional learning and re-balancing branch, respectively.

1:  for  = 1 to  do
2:     
3:     
4:     
5:     
6:     
7:     
8:     
9:     
10:     Update model parameters by minimizing
11:  end for
Algorithm 1 Learning algorithm of our proposed BBN

References