BBN
None
view repo
Our work focuses on tackling the challenging but natural visual recognition task of longtailed data distribution (, a few classes occupy most of the data, while most classes have rarely few samples). In the literature, class rebalancing strategies (, reweighting and resampling) are the prominent and effective methods proposed to alleviate the extreme imbalance for dealing with longtailed problems. In this paper, we firstly discover that these rebalancing methods achieving satisfactory recognition accuracy owe to that they could significantly promote the classifier learning of deep networks. However, at the same time, they will unexpectedly damage the representative ability of the learned deep features to some extent. Therefore, we propose a unified BilateralBranch Network (BBN) to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately. In particular, our BBN model is further equipped with a novel cumulative learning strategy, which is designed to first learn the universal patterns and then pay attention to the tail data gradually. Extensive experiments on four benchmark datasets, including the largescale iNaturalist ones, justify that the proposed BBN can significantly outperform stateoftheart methods. Furthermore, validation experiments can demonstrate both our preliminary discovery and effectiveness of tailored designs in BBN for longtailed problems. Our method won the first place in the iNaturalist 2019 large scale species classification competition, and our code is opensource and available at https://github.com/MegviiNanjing/BBN
READ FULL TEXT VIEW PDF
The longtail distribution of the visual world poses great challenges fo...
read it
Longtailed relation classification is a challenging problem as the head...
read it
This paper considers deep visual recognition on longtailed data, with t...
read it
Today, scene graph generation(SGG) task is largely limited in realistic
...
read it
In this paper, a levelwise mixture model (LMM) is developed by embeddin...
read it
As the class size grows, maintaining a balanced dataset across many clas...
read it
Longtail recognition tackles the natural nonuniformly distributed data...
read it
None
With the advent of research on deep Convolutional Neural Networks (CNNs), the performance of image classification has witnessed incredible progress. The success is undoubtedly inseparable to available and highquality largescale datasets,
e.g., ImageNet ILSVRC 2012
[imagenet], MS COCO [coco] and Places Database [zhou2017places], etc. In contrast with these visual recognition datasets exhibiting roughly uniform distributions of class labels, realworld datasets always have skewed distributions with
a long tail [kendall1948advanced, van2017devil], i.e., a few classes (a.k.a. head class) occupy most of the data, while most classes (a.k.a. tail class) have rarely few samples, cf. Figure 1. Moreover, more and more longtailed datasets reflecting the realistic challenges are constructed and released by the computer vision community in very recent years,
e.g., iNaturalist [cui2018large], LVIS [Gupta2019LVIS] and RPC [wei2019rpc]. When dealing with such visual data, deep learning methods are not feasible to achieve outstanding recognition accuracy due to both the datahungry limitation of deep models and also the extreme class imbalance trouble of longtailed data distributions.
In the literature, the prominent and effective methods for handling longtailed problems are class rebalancing strategies, which are proposed to alleviate the extreme imbalance of the training data. Generally, class rebalancing methods are roughly categorized into two groups, i.e., resampling [shen2016relay, buda2018systematic, japkowicz2002class, buda2018systematic, he2009learning, byrd2019effect, drummond2003c4, more2016survey, chawla2002smote] and costsensitive reweighting [huang2016learning, wang2017learning, cbfocal, ren18l2rw]. These methods can adjust the network training, by resampling the examples or reweighting the losses of examples within minibatches, which is in expectation closer to the test distributions. Thus, class rebalancing is effective to directly influence the classifier weights updating of deep networks, i.e., promoting the classifier learning. That is the reason why rebalancing could achieve satisfactory recognition accuracy on longtailed data.
However, although rebalancing methods have good eventual predictions, we argue that these methods still have adverse effects, i.e., they will also unexpectedly damage the representative ability of the learned deep features (i.e., the representation learning) to some extent. In concretely, resampling has the risks of overfitting the tail data (by oversampling) and also the risk of underfitting the whole data distribution (by undersampling), when data imbalance is extreme. For reweighting, it will distort the original distributions by directly changing or even inverting the data presenting frequency.
As a preliminary of our work, by conducting validation experiments, we justify our aforementioned argumentation. Specifically, to figure out how rebalancing strategies work, we divide the training process of deep networks into two stages, i.e., to separately conduct the representation learning and the classifier learning. At the former stage for representation learning, we employ plain training (conventional crossentropy), reweighting and resampling as three learning manners to obtain their corresponding learned representations. Then, at the latter stage for classifier learning, we first fix the parameters of representation learning (i.e., backbone layers) converged at the former stage and then retrain the classifiers of these networks (i.e., fullyconnected layers) from scratch, also with the three aforementioned learning manners. In Figure 5, the prediction error rates on two benchmark longtailed datasets [ldam], i.e., CIFAR100IR50 and CIFAR10IR50, are reported. Obviously, when fixing the representation learning manner, rebalancing methods reasonably achieve lower error rates, indicating they can promote classifier learning. On the other side, by fixing the classifier learning manner, plain training on original imbalanced data can bring better results according to its better features. Also, the worse results of rebalancing methods prove that they will hurt feature learning.
Therefore, in this paper, for exhaustively improving the recognition performance of longtailed problems, we propose a unified BilateralBranch Network (BBN) model to take care of both representation learning and classifier learning simultaneously. As shown in Figure 3, our BBN model consists of two branches, termed as the “conventional learning branch” and the “rebalancing branch”. In general, each branch of BBN separately performs its own duty for representation learning and classifier learning, respectively. As the name suggests, the conventional learning branch equipped with the typical uniform sampler w.r.t. the original data distribution is responsible for learning universal patterns for recognition. While the rebalancing branch coupled with a reversed sampler is designed to model the tail data. After that, the predicted outputs of these bilateral branches are aggregated in the cumulative learning part by an adaptive tradeoff parameter .
is automatically generated by the “Adaptor” according to the number of training epochs, which adjusts the whole BBN model to firstly learn the universal features from the original distribution and then pay attention to the tail data gradually. More importantly,
could further control the parameter updating of each branch, which, for example, avoids damaging the learned universal features when emphasizing the tail data at the later periods of training.In experiments, empirical results on four benchmark longtailed datasets show that our model obviously outperforms existing stateoftheart methods. Moreover, extensive validation experiments and ablation studies can prove the aforementioned preliminary discovery and also validate the effectiveness of our tailored designs for longtailed problems.
The main contributions of this paper are as follows:
[itemsep=0.2em, leftmargin=1em]
We explore the mechanism of the prominent class rebalancing methods for longtailed problems, and further discover that these methods can significantly promote classifier learning and meanwhile will affect the representation learning w.r.t. the original data distribution.
We propose a unified BilateralBranch Network (BBN) model to take care of both representation learning and classifier learning for exhaustively boosting longtailed recognition. Also, a novel cumulative learning strategy is developed for adjusting the bilateral learnings and coupled with our BBN model’s training.
We evaluate our model on four benchmark longtailed visual recognition datasets, and our proposed model consistently achieves superior performance over previous competing approaches.
Class rebalancing strategies: Resampling methods as one of the most important class rebalancing strategies could be divided into two types: 1) Oversampling by simply repeating data for minority classes [shen2016relay, buda2018systematic, byrd2019effect] and 2) undersampling by abandoning data for dominant classes [japkowicz2002class, buda2018systematic, he2009learning]. But sometimes, with resampling, duplicated tailed samples might lead to overfitting upon minority classes [chawla2002smote, cbfocal], while discarding precious data will certainly impair the generalization ability of deep networks.
Reweighting
methods are another series of prominent class rebalancing strategies, which usually allocate large weights for training samples of tail classes in loss functions
[huang2016learning, wang2017learning]. However, reweighting is not capable of handling the largescale, realworld scenarios of longtailed data and tends to cause optimization difficulty [mikolov2013distributed]. Consequently, Cui et al. [cbfocal] proposed to adopt the effective number of samples [cbfocal] instead of proportional frequency. Thereafter, Cao et al. [ldam] explored the margins of the training examples and designed a labeldistributionaware loss to encourage larger margins for minority classes.In addition, recently, some twostage finetuning strategies [ldam, cui2018large, ouyang2016factors] were developed to modify rebalancing for effectively handling longtail problems. Specifically, they separated the training process into two single stages. In the first stage, they trained networks as usual on the original imbalanced data and only utilized rebalancing at the second stage to finetune the network with a small learning rate.
Beyond that, other methods of different learning paradigms were also proposed to deal with longtailed problems, e.g., metric learning [zhang2017range, huang2016learning], metalearning [Liu_2019_CVPR]
, and knowledge transfer learning
[wang2017learning, Zhong_2019_CVPR], which are not within the scope of this paper.Mixup: Mixup [mixup] was a general data augmentation algorithm, i.e., convexly combining random pairs of training images and their associated labels, to generate additional samples when training deep networks. Also, manifold mixup [manifoldmixup] conducted mixup operations on random pairs of samples in the manifold feature space for augmentation. The mixed ratios in mixup were sampled from the distribution to increase the randomness of augmentation. Although mixup is clearly far from our unified endtoend trainable model, in experiments, we still compared with a series of mixup algorithms to validate our effectiveness.
In this section, we attempt to figure out the working mechanism of these class rebalancing methods. More concretely, we divide a deep classification model into two essential parts: 1) the feature extractor (i.e., frontal base/backbone network) and 2) the classifier (i.e., last fullyconnected layers). Accordingly, the learning process of a deep classification network could be separated into representation learning and classifier learning. Since class rebalancing strategies could boost the classification accuracy by altering the training data distribution closer to test and paying more attention to the tail classes, we propose a conjecture that the way these strategies work is to promote classifier learning significantly but might damage the universal representative ability of the learned deep features due to distorting original distributions.
In order to justify our conjecture, we design a twostage experimental fashion to separately learn representations and classifiers of deep models. Concretely, in the first stage, we train a classification network with plain training (i.e., crossentropy) or rebalancing methods (i.e., reweighting/resampling) as learning manners. Then, we obtain different kinds of feature extractors corresponding to these learning manners. When it comes to the second stage, we fix the parameters of the feature extractors learned in the former stage, and retrain classifiers from scratch with the aforementioned learning manners again. In principle, we design these experiments to fairly compare the quality of representations and classifiers learned by different manners by following the control variates method.
The CIFAR [cifar] datasets are a collection of images that are commonly used to assess computer vision approaches. Previous work [cbfocal, ldam] created longtailed versions of CIFAR datasets with different imbalance ratios, i.e., the number of the most frequent class divided by the least frequent class, to evaluate the performance. In this section, following [ldam], we also use longtailed CIFAR10/CIFAR100 as the test beds.
As shown in Figure 5, we conduct several contrast experiments to validate our conjecture on the CIFAR100IR50 (longtailed CIFAR100 with imbalance ratio 50). As aforementioned, we separate the whole network into two parts: feature extractor and classifier. Then, we apply three manners for the feature learning and the classifier learning respectively according to our twostage training fashion. Thus, we can obtain nine groups of results based on different permutations: (1) CrossEntropy (CE): We train the networks as usual on the original imbalanced data with the conventional crossentropy loss. (2) ReSampling (RS): We first sample a class uniformly and then collect an example from that class by sampling with replacement. By repeating this process, a balanced minibatch data is obtained. (3) ReWeighting (RW): We reweight all the samples by the inverse of the sample size of their classes. The error rate is evaluated on the validation set. As shown in Figure 5, we have the observations from two perspectives:
[itemsep=0.2em, leftmargin=1em]
Classifiers: When we apply the same representation learning manner (comparing error rates of three blocks in the vertical direction), it can be reasonably found that RW/RS always achieve lower classification error rates than CE, which owes to their rebalancing operations adjusting the classifier weights updating to match test distributions.
Representations: When applying the same classifier learning manner (comparing error rates of three blocks in the horizontal direction), it is a bit of surprise to see that error rates of CE blocks are consistently lower than error rates of RW/RS blocks. The findings indicate that training with CE achieves better classification results since it obtains better features. The worse results of RW/RS reveal that they lead to inferior discriminative ability of the learned deep features.
Furthermore, as shown in Figure 5 (left), by employing CE on the representation learning and employing RS on the classifier learning, we can achieve the lowest error rate on the validation set of CIFAR100IR50. Additionally, to evaluate the generalization ability for representations produced by three manners, we utilize pretrained models trained on CIFAR100IR50 as feature extractor to obtain the representations of CIFAR10IR50, and then perform the classifier learning experiments as the same as aforementioned. As shown in Figure 5 (right), on CIFAR10IR50, it can have the identical observations, even in the situation that the feature extractor is trained on another longtailed dataset.
As shown in Figure 3, our BBN consists of three main components. Concretely, we design two branches for representation learning and classifier learning, termed “conventional learning branch” and “rebalancing branch”, respectively. Both branches use the same residual network structure [he2016deep] and share all the weights except for the last residual block. Let denote a training sample and is its corresponding label, where is the number of classes. For the bilateral branches, we apply uniform and reversed samplers to each of them and obtain two samples and as the input data, where is for the conventional learning branch and is for the rebalancing branch. Then, two samples are fed into their own corresponding branch to acquire the feature vectors and by global average pooling.
Furthermore, we also design a specific cumulative learning strategy for shifting the learning “attention” between two branches in the training phase. In concretely, by controlling the weights for and with an adaptive tradeoff parameter , the weighted feature vectors and will be sent into the classifiers and
respectively and the outputs will be integrated together by elementwise addition. The output logits are formulated as:
(1) 
where is the predicted output, i.e., . For each class
, the softmax function calculates the probability of the class by:
(2) 
Then, we denote
as the crossentropy loss function and the output probability distribution as
. Thus, the weighted crossentropy classification loss of our BBN model is illustrated as:(3) 
and the whole network is endtoend trainable.
In this section, we elaborate the details of our unified bilateralbranch structure shown in Figure 3. As aforementioned, the proposed conventional learning branch and rebalancing branch do perform their own duty (i.e., representation learning and classifier learning, respectively). There are two unique designs for these branches.
Data samplers. The input data for the conventional learning branch comes from a uniform sampler, where each sample in the training dataset is sampled only once with equal probability in a training epoch. The uniform sampler retains the characteristics of original distributions, and therefore benefits the representation learning. While, the rebalancing branch aims to alleviate the extreme imbalance and further improve the classification performance on tail classes [van2017devil], whose input data comes from a reversed sampler. For the reversed sampler, the sampling possibility of each class is proportional to the reciprocal of its sample size, i.e., the more samples in a class, the smaller sampling possibility the class has. In formulations, let denote that the number of samples for the class is and the maximum sample number of all the classes is . There are three subprocedures to construct the reversed sampler: 1) Calculate the sampling possibility for the class according to the number of samples as:
(4) 
where ; 2) Randomly sample a class according to ; 3) Uniformly pick up a sample from the class with replacement. By repeating this reversed sampling process, training data of a minibatch is obtained.
Weights sharing. In BBN, both branches economically share the same residual network structure as illustrated in Figure 3. We use ResNets [he2016deep] as our backbone network, e.g., ResNet32 and ResNet50. In details, two branch networks, except for the last residual block, share the same weights. There are two benefits for sharing weights: On the one hand, the welllearned representation by the conventional learning branch can benefit the learning of the rebalancing branch. On the other hand, sharing weights will largely reduce computational complexity in the inference phase.
Cumulative learning strategy is proposed to shift the learning focus between the bilateral branches by controlling the weights for features produced by two branches and the classification loss . It is designed to first learn the universal patterns and then pay attention to the tail data gradually. In the training phase, the feature of the conventional learning branch will be multiplied by and the feature of the rebalancing branch will be multiplied by , where is automatically generated according to the training epoch. Concretely, the number of total training epochs is denoted as and the current epoch is . is calculated by:
(5) 
where will gradually decrease as the training epochs increasing.
Datasets  Longtailed CIFAR10  Longtailed CIFAR100  

Imbalance ratio  100  50  10  100  50  10 
CE  29.64  25.19  13.61  61.68  56.15  44.29 
Focal [focalloss]  29.62  23.28  13.34  61.59  55.68  44.22 
Mixup [mixup]  26.94  22.18  12.90  60.46  55.01  41.98 
Manifold Mixup [manifoldmixup]  27.04  22.05  12.97  61.75  56.91  43.45 
Manifold Mixup (two samplers)  26.90  20.79  13.17  63.19  57.95  43.54 
CEDRW [ldam]  23.66  20.03  12.44  58.49  54.71  41.88 
CEDRS [ldam]  24.39  20.19  12.62  58.39  54.52  41.89 
CBFocal [cbfocal]  25.43  20.73  12.90  60.40  54.83  42.01 
LDAMDRW [ldam]  22.97  18.97  11.84  57.96  53.38  41.29 
Our BBN  20.18  17.82  11.68  57.44  52.98  40.88 
In intuition, we design the adapting strategy for based on the motivation that discriminative feature representations are the foundation for learning robust classifiers. Although representation learning and classifier learning deserve equal attentions, the learning focus of our BBN should gradually change from feature representations to classifiers, which can exhaustively improve longtailed recognition accuracy. With decreasing, the main emphasis of BBN turns from the conventional learning branch to the rebalancing branch. Different from twostage finetuning strategies [ldam, cui2018large, ouyang2016factors], our ensures that both branches for different goals can be constantly updated in the whole training process, which could avoid the affects on one goal when it performs training for the other goal.
In experiments, we also provide the qualitative results of this intuition by comparing different kinds of adaptors, cf. Section 5.5.2.
During inference, the test sample is fed into both branches and two features and are obtained. Because both branches are equally important, we simply fix to in the test phase. Then, the equally weighted features are fed to their corresponding classifiers (i.e., and ) to obtain two prediction logits. Finally, both logits are aggregated by elementwise addition to return the classification results.
Longtailed CIFAR10 and CIFAR100. Both CIFAR10 and CIFAR100 contain 60,000 images, 50,000 for training and 10,000 for validation with category number of 10 and 100, respectively. For fair comparisons, we use the longtailed versions of CIFAR datasets as the same as those used in [ldam] with controllable degrees of data imbalance. We use an imbalance factor to describe the severity of the long tail problem with the number of training samples for the most frequent class and the least frequent class, e.g., . Imbalance factors we use in experiments are 10, 50 and 100.
iNaturalist 2017 and iNaturalist 2018. The iNaturalist species classification datasets are largescale realworld datasets that suffer from extremely imbalanced label distributions. The 2017 version of iNaturalist contains 579,184 images with 5,089 categories and the 2018 version is composed of 437,513 images from 8,142 categories. Note that, besides the extreme imbalance, the iNaturalist datasets also face the finegrained problem [wei2019deep, Zhao2017]. In this paper, the official splits of training and validation images are utilized for fair comparisons.
Implementation details on CIFAR. For longtailed CIFAR10 and CIFAR100 datasets, we follow the data augmentation strategies proposed in [he2016deep]: randomly crop a patch from the original image or its horizontal flip with
pixels padded on each side. We train the ResNet32
[he2016deep]as our backbone network for all experiments by standard minibatch stochastic gradient descent (SGD) with momentum of
, weight decay of . We train all the models on a single NVIDIA 1080Ti GPU for epochs with batch size of . The initial learning rate is set to and the first five epochs is trained with linear warmup [goyal2017accurate] learning rate schedule. The learning rate is decayed at the and epochs by for our BBN, respectively.Implementation details on iNaturalist. For fair comparisons, we utilize ResNet50 [he2016deep] as our backbone network in all experiments on iNaturalist 2017 and iNaturalist 2018. We follow the same training strategy in [goyal2017accurate] with batch size of on four GPUs of NVIDIA 1080Ti. We firstly resize the image by setting the shorter side to pixels and then take a crop from it or its horizontal flip. During training, we decay the learning rate at the and epoch by for our BBN, respectively.
In experiments, we compare our BBN model with three groups of methods:
[itemsep=0.2em, leftmargin=1em]
Baseline methods. We employ plaining training with crossentropy loss and focal loss [focalloss] as our baselines. Note that, we also conduct experiments with a series of mixup algorithms [mixup, manifoldmixup] for comparisons.
Twostage finetuning strategies. To prove the effectiveness of our cumulative learning strategy, we also compare with the twostage finetuning strategy proposed in previous stateoftheart [ldam]. We train networks with crossentropy (CE) on imbalanced data in the first stage, and then conduct class rebalancing training in the second stage. “CEDRW” and “CEDRS” refer to the twostage baselines using reweighting and resampling at the second stage.
Stateoftheart methods. For stateoftheart methods, we compare with the recently proposed LDAM [ldam] and CBFocal [cbfocal] which achieve good classification accuracy on these four aforementioned longtailed datasets.
Datasets  iNaturalist 2018  iNaturalist 2017 

CE  42.84  45.38 
CEDRW [ldam]  36.27  40.48 
CEDRS [ldam]  36.44  40.12 
CBFocal [cbfocal]  38.88  41.92 
LDAMDRW* [ldam]  32.00  – 
LDAMDRW [ldam]  35.42  39.49 
LDAMDRW [ldam] ()  33.88  38.19 
Our BBN  33.71  36.61 
Our BBN ()  30.38  34.25 
We conduct extensive experiments on longtailed CIFAR datasets with three different imbalanced ratios: , and . Table 1 reports the error rates of various methods. We demonstrate that our BBN consistently achieves the best results across all the datasets, when comparing other comparison methods, including the twostage finetuning strategies (i.e., CEDRW/CEDRS), the series of mixup algorithms (i.e., mixup, manifold mixup and manifold mixup with two samplers as the same as ours), and also previous stateofthearts (i.e., CBFocal [cbfocal] and LDAMDRW [ldam]).
Especially for longtailed CIFAR10 with imbalanced ratio (an extreme imbalance case), we get 20.18% error rate which is 2.79% lower than LDAMDRW [ldam]. Additionally, it can be found from that table, the twostage finetuning strategies (i.e., CEDRW/CEDRS) are effective, since they could obtain comparable or even better results comparing with stateoftheart methods.
Table 2 shows the results on two largescale longtailed datasets, i.e., iNaturalist 2018 and iNaturalist 2017. As shown in that table, the twostage finetuning strategies (i.e., CEDRW/CEDRS) also perform well, which have consistent observations with those on longtailed CIFAR. Compared with other methods, on iNaturalist, our BBN still outperform competing approaches and baselines. Besides, since iNaturalist is largescale, we also conduct network training with the scheduler. Meanwhile, for fair comparisons, we further evaluate the previous stateoftheart LDAMDRW [ldam] with the training scheduler. It is obviously to see that, with scheduler, our BBN achieves significantly better results than BBN without scheduler. Additionally, compared with LDAMDRW (), we achieve +3.50% and +3.94% improvements on iNaturalist 2018 and iNaturalist 2017, respectively. In addition, even though we do not use scheduler, our BBN can still get the best results. For a detail, we conducted the experiments based on LDAM [ldam] with the source codes provided by the authors, but failed to reproduce the results reported in that paper.
For better understanding our proposed BBN model, we conduct experiments on different samplers utilized in the rebalancing branch. We present the error rates of models trained with different samplers in Table 3. For clarity, the uniform sampler maintains the original longtailed distribution. The balanced sampler assigns the same sampling possibility to all classes, and construct a minibatch training data obeying balanced label distribution. As shown in that table, the reversed sampler (our proposal) achieves considerably better performance than the uniform and balanced samplers, which indicates that the rebalancing branch of BBN should pay more attentions to the tail classes by enjoying the reversed sampler.
Sampler  Error rate 

Uniform sampler  21.31 
Balanced sampler  21.06 
Reversed sampler (Ours)  17.82 
To facilitate the understanding of our proposed cumulative learning strategy, we explore several different strategies to generate the adaptive tradeoff parameter on CIFAR10IR50. Specifically, we test with both progressrelevant/irrelevant strategies, cf. Table 4. For clarity, progressrelevant strategies adjust with the number of training epochs, e.g., linear decay, cosine decay, etc. Progressirrelevant strategies include the equal weight or generate from a discrete distribution (e.g., the distribution).
Adaptor  Error rate  

Equal weight  21.56  
distribution  21.75  
Parabolic increment  22.70  
Linear decay  18.55  
Cosine decay  18.04  
Parabolic decay (Ours)  17.82 
As shown in Table 4, the decay strategies (i.e., linear decay, cosine decay and our parabolic decay) for generating can yield better results than the other strategies (i.e., equal weight, distribution and parabolic increment). These observations prove our motivation that the conventional learning branch should be learned firstly and then the rebalancing branch. Among these strategies, the best way for generating is the proposed parabolic decay approach. In addition, the parabolic increment, where rebalancing are attended before conventional learning, performs the worst, which validate our proposal from another perspective.
It is proven in Section 3 that learning with vanilla CE on original data distribution can obtain good feature representations. In this subsection, we further explore the representation quality of our proposed BBN by following the empirical settings in Section 3. Concretely, given a BBN model trained on CIFAR100IR50, firstly, we fix the parameters of representation learning of two branches. Then, we separately retrain the corresponding classifiers from scratch of two branches also on CIFAR100IR50. In the final, classification error rates are tested on these two branches, independently.
As shown in Table 5, the feature representations obtained by the conventional learning branch of BBN (“BBNCB”) achieves comparable performance with CE, which indicates that our proposed BBN greatly preserves the representation capacity learned from the original longtailed dataset. Note that, the rebalancing branch of BBN (“BBNRB”) also gets better performance than RW/RS and it possibly owes to the parameters sharing design of our model.
Representation learning manner  Error rate 

CE  58.62 
RW  63.17 
RS  63.71 
BBNCB  58.89 
BBNRB  61.09 
Denote as a set of classifiers for all classes, where indicates the weight vector for the class . Previous work [guo2017one] has shown that the value of norm for different classes can demonstrate the preference of a classifier, i.e., the classifier with the largest norm tends to judge one example as belonging to its class . Following [guo2017one], we visualize the norm of these classifiers.
As shown in Figure 4, we visualize the norm of ten classes trained on CIFAR10IR50. For our BBN, we visualize the classifier weights of the conventional learning branch (“BBNCB”) and the classifier weights of the rebalancing branch (“BBNRB”), also their combined classifier weights (“BBNALL”). Additionally, the visualization results on classifiers trained with these learning manners in Section 3, i.e., CE, RW and RS, are also provided.
Obviously, the norm of ten classes’ classifiers for our proposed model (i.e., “BBNALL”) are basically equal, and their standard deviation is the smallest one. For the classifiers trained by other learning manners, the distribution of the norm of CE is consistent with the longtailed distribution. The norm distribution of RW/RS looks a bit flat, but their standard deviations are larger than ours. It gives an explanation why our BBN can outperform these methods. Additionally, by separately analyzing our model, its conventional learning branch (“BBNCB”) has a similar norm distribution with CE’s, which justifies its duty is focusing on universal feature learning. The norm distribution of the rebalancing branch (“BBNRB”) has a reversed distribution w.r.t. original longtailed distribution, which reveals it is able to model the tail.
In this paper, for studying longtailed problems, we explored how class rebalancing strategies influenced representation learning and classifier learning of deep networks and revealed that they can promote classifier learning significantly but also damage representation learning to some extent. Motivated by this, we proposed a BilateralBranch Network (BBN) with a specific cumulative learning strategy to take care of both representation learning and classifier learning for exhaustively improving the recognition performance of longtailed tasks. By conducting extensive experiments, we proved that our BBN could achieve the best results on longtailed benchmarks, including the largescale iNaturalist. In the future, we attempt to tackle the longtailed detection problems with our BBN model.
In the supplementary materials, we provide more experimental results and analyses of our proposed BBN model, including:
Additional experiments of different manners for representation and classifier learning (cf. Section 3 and Figure 2 of the paper) on largescale datasets iNaturalist 2017 and iNaturalist 2018;
Affects of rebalancing strategies on the compactness of learned features;
Comparisons between the BBN model and ensemble methods;
Coordinate graph about different kinds of adaptor strategies for generating ;
Learning algorithm of our proposed BBN model.
In this section, following Section 3 of our paper, we conduct experiments on largescale datasets, i.e., iNaturalist 2017 [van2018inaturalist] and iNaturalist 2018, to further justify our conjecture (i.e., the working mechanism of these class rebalancing strategies is to promote classifier learning significantly but might damage the universal representative ability of the learned deep features due to distorting original distributions.) Specifically, the representation learning stages are conducted on iNaturalist 2017. Then, to also evaluate the generalization ability for learned representations, classifier learning stages are performed on not only iNaturalist 2017 but also iNaturalist 2018.
As shown in Figure 5, we can also have the observations from two perspectives on these largescale longtailed datasets:
Classifiers: When we apply the same representation learning manner (comparing error rates of three blocks in the vertical direction), it can be reasonably found that RW/RS always achieve lower classification error rates than CE, which owes to their rebalancing operations adjusting the classifier weights updating to match test distributions.
Representations: When applying the same classifier learning manner (comparing error rates of three blocks in the horizontal direction), it is a bit of surprise to see that error rates of CE blocks are consistently lower than error rates of RW/RS blocks. The findings indicate that training with CE achieves better classification results since it obtains better features. The worse results of RW/RS reveal that they lead to inferior discriminative ability of the learned deep features.
These observations are consistent with those on longtailed CIFAR datasets, which can further demonstrate our discovery of Section 3 in the paper.
To further prove our conjecture that rebalancing strategies could damage the universal representations, we measure the compactness of intraclass representations on CIFAR10IR50 [cifar] for verification.
Concretely, for each class, we firstly calculate a centroid vector by averaging representations of this class. Then, distances between these representations and their centroid are computed and then averaged as a measurement for the compactness of intraclass representations. If the averaged distance of a class is small, it implies that representations of this class gather closely in the representation space. We normalize the norm of representations to in the training stage for avoiding the impact of feature scales. We report results based on representations learned with crossentropy (CE), reweighting (RW) and resampling (RS), respectively.
As shown in Figure 6, the averaged distances of rebalancing strategies are obviously larger than conventional training, especially for the head classes. That is to say, the compactness of learned features of rebalancing strategies are significantly worse than conventional training. These observation can further validate the statements in Figure 1 in the paper (i.e., for rebalancing strategies, “the intraclass distribution of each class becomes more separable”) and also the discovery of Section 3 in the paper (i.e., rebalancing “might damage the universal representative ability of the learned deep features to some extent”).
In the following, we compare our BBN model with ensemble methods to prove the effectiveness of our proposed model. Results on CIFAR10IR50 [cifar], CIFAR100IR50 [cifar], iNaturalist 2017 [van2018inaturalist] and iNaturalist 2018 are provided in Table 6 for comprehensiveness.
As known, ensemble techniques are frequently utilized to boost performances of machine learning tasks. We train three classification models with uniform data sampler, balanced data sampler and reversed data sampler, respectively. For mimicking our bibranch network design and considering fair comparisons, we provide classification error rates of (1) an ensemble of models learned with uniform sampler and balanced sampler, as well as (2) another ensemble of models learned with uniform sampler and reversed sampler.
As shown in Table 6, our BBN model achieves consistently lower error rates than ensemble models on all datasets. Additionally, compared to ensemble models, our proposed BBN model can yield better performance with limited increase of network parameters thanks to its sharing weights design (cf. Ln. 486496 in the paper).
Methods  CIFAR10IR50  CIFAR100IR50  iNaturalist 2017  iNaturalist 2018  


19.41  55.10  39.53  36.20  

19.38  54.93  40.02  36.66  
BBN (Ours)  17.82  52.98  36.61  33.74 
As shown in Figure 7, we provide a coordinate graph to present how the varies with the progress of network training. The adaptor strategies shown in the figure are the same as those in Table 4 of the paper except the distribution for its randomness.
In the following, we provide the detailed learning algorithm of our proposed BBN. In Algorithm 1, for each training epoch , we firstly assign a value to by the adaptor proposed in Eq. (5) of the paper. Then, we sample training samples by the uniform sampler and reversed sampler, respectively. Feeding samples into our network, we can obtain two independent feature vectors and . Subsequently, we calculate the output logits and the prediction possibility according to Eq. (1) and Eq. (2) in the paper. Finally, the classification loss function is calculated based on the Eq. (3) in the paper and we update model parameters by optimizing this loss function.
Comments
There are no comments yet.