For many real-world visual datasets, visual concepts often occur with a long-tailed distribution, that is, some “head” classes have abundant examples, while only a few samples are available for “tail” classes [liu2019large, khan2019striking, cui2019class]
. Such an imbalanced data distribution poses a great challenge for training a deep neural network since standard stochastic gradient descent (SGD)[bottou2010large] tends to make the network ignore the tail classes due to the chance of sampling from a given tail class can be much lower compared to samples from head classes.
A straightforward solution to the above issue seems to increase the chance of sampling from tail classes by balancing the sampling ratio per class. For example, one can force data within a mini-batch to be evenly sampled from each class. However, this naive solution may lead to a side-effect of making head classes under-represented and increase the risk of over-fitting a deep neural network. Therefore, there is a dilemma in deciding whether or not to balance the training samples from each class. Existing approaches tackle this dilemma by carefully designing re-sampling strategies [haixiang2017learning, chawla2002smote, tahir2009multiple], class-dependent cost functions [huang2016learning, mahajan2018exploring], or feature regularisation approaches [zhang2017range, liu2019large] etc. In this work, we propose an embarrassingly simple approach without any bells and whistles, yet surprisingly, this simple approach achieves the state-of-the-art performance.
Our solution is based on the fact that a deep neural network can be decomposed into a feature extractor part and a classifier part, and these two parts can be trained with different strategies. Specifically, to prevent the training process from being dominated by head classes, we train the whole network with the class-balanced sampling (CBS) strategy. To avoid the risk of over-fitting, for the feature extractor part, we also create an auxiliary training task, that is, to learn a classifier under the regular random sampling (RRS) scheme. In this way, the feature extractor part, which consists of most parameters of a deep network, is trained with both sampling strategies. Therefore, it can take full advantage of all the training data. In addition to using different sampling strategies to create the auxiliary task, we also explore to use self-supervised learning as an additional auxiliary task to further enhance the representation learning, and we show that this can lead to promising results. In summary, the main contributions of our method are as follows:
We propose a simple-yet-effective learning approach to address the dilemma of balancing the head and tail classes for long-tailed visual recognition.
We propose to utilise self-supervised learning as an additional auxiliary task to improve the generalisation of image features. To our best knowledge, this is the first work that applies self-supervised learning to long-tailed visual recognition.
We conduct comprehensive experiments on two long-tailed datasets to evaluate the effectiveness of the proposed approach. The experimental results demonstrate that our method outperforms state-of-the-art solutions.
2 Related Works
2.1 Imbalanced Classification
We roughly group the related works of imbalanced image classification into four categories, including the re-sampling of training data, learning discriminative features, cost-sensitive learning and transfer learning.
Re-sampling aims at alleviating the negative effect of skewed distribution by artificially balancing the samples among all classes. In general, there are two types of commonly used strategies, namely over-sampling and under-sampling[haixiang2017learning]
. Over-sampling focuses on augmenting the minority class either by duplicating the samples or generating synthetic data via interpolation[chawla2002smote]. On the other hand, the under-sampling scheme achieves data balance by discarding part of majority classes [tahir2009multiple]. In [wang2019dynamic], a curriculum learning approach is proposed to dynamically adjust the sampling from imbalance to balance with hard example mining.
Learning Discriminative Features. Instead of reshaping the data distribution, some methods tackle the imbalanced image classification by generating discriminative features. Metric learning approaches, including pair-wise contrastive loss [sun2014deep], triplet loss [schroff2015facenet] and quintuplet loss [huang2016learning] etc. accompanied by hard-mining, can be used to explore the sample relationships within the input batch. Range loss [zhang2017range] and centric loss [wen2016discriminative] learn discriminative feature space by constraining class prototypes. In [hayatgaussian], authors propose a max-margin loss to consider both classification and clustering performances. Most recently, Liu et. al. [liu2019large] propose to utilise the memory network to train class prototypes and further enhance image features. Their method achieves state-of-the-art performances on the long-tailed open-world recognition.
Cost-Sensitive & Transfer Learning. Cost-sensitive learning addresses the imbalanced distribution by adjusting costs for misclassifications on classes with respect to sample frequencies. Normally, the inverse of class frequencies [huang2016learning] and its smoothed versions [mahajan2018exploring]
are used to re-weight the loss function. In[khan2019striking]
, authors propose to eliminate the decision boundary bias by incorporating Bayes uncertainty estimation, while in[cui2019class], the effective number of samples are calculated to construct a balanced loss. Another line of works focus on transferring the knowledge [ouyang2016factors, DBLP:conf/cvpr/GidarisK18] from head to tail classes through different learning algorithms, such as meta-learning [wang2017learning] and unequal training [zhong2019unequal] etc.
2.2 Auxiliary & Self-Supervised Learning.
Auxiliary learning is designed for assisting the primary task by simultaneously optimising relevant auxiliary tasks. It has been adopted to benefit various tasks, such as speech recognition [liebel2018auxiliary], image classification [liu2019self] and depth estimation [mahjourian2018unsupervised] etc. Our work can also be regarded as an auxiliary learning approach, where an auxiliary classification task is introduced to alleviate over-fitting.
Self-supervised learning is one type of unsupervised feature learning, which defines proxy tasks to inject self-supervision signals for representation learning. The image itself contains abundant structural information to be explored [kolesnikov2019revisiting], such as predicting low-level visual cues, or relative spatial locations of patches. In [gidaris2019boosting, su2019does], self-supervised learning is employed to assist the few-shot image classification. We have a similar motivation that self-supervised learning can be used to improve the generalisation of image features.
The extremely imbalanced data distribution poses a challenge for training a deep network, which can be boiled down to a dilemma of balancing the training of head and tail classes. In this section, we first describe this dilemma that motivates our method. Then we propose our simple-yet-effective solution to this dilemma, followed by its further extension.
3.1 A Dilemma of Balancing Head & Tail Training
Formally, let be the set of images, be the label set, where . For a multi-class classification problem, the standard training objective takes the following form:
where is the loss function and is the to-be-learned classifier. In a long-tailed training set, the number of images per class
varies from abundant (more than thousands) to rare (a few shots). In this case, the total loss is dominated by losses from classes with many samples. In the context of deep learning, Eq.1 is usually optimised by using the stochastic gradient descent (SGD) method. During each training iteration, a batch of samples, which are randomly drawn from the whole training set, are fed into the neural network. The chance of sampling a tail class sample can be meagre because of its low proportion in the training set. Consequently, the loss incurred from the tail-class samples are usually ignored during training. The blue curve in Fig. 2 shows the training loss of one tail class under the regular random sampling. During the course of training, we cannot see a significant decrease in its value.
A straightforward solution to this issue is to adjust the chance of sampling a tail class sample. Inspired by [shen2016relay], we can adopt a class-balanced sampling strategy (CBS): each sampled mini-batch consists of data from classes with those classes being randomly sampled from the total list of classes, and for each sampled class, we then randomly sample the same amount () of images. The sampling details are shown in Algorithm. 1. By using this strategy, the classifier can focus more on tail classes. The red curve in Fig. 2 shows the impact of this strategy on the tail class. As seen, the training loss for the same tail class significantly decreases during training.
However, class-balanced sampling strategy comes at the cost of ill-fitting. To train a deep neural network, sufficient (and diverse) samples are necessary to guarantee good generalisation performance. In class-balanced sampling, the impact of tail classes becomes more prominent. Thus, compared to the regular random sampling, tail classes have a stronger influence in learning all the parameters in the deep neural network. This is risky since the number of samples in tail classes is too small to train so many parameters. On the other hand, CBS over-samples tail classes. It, in effect, under-samples head classes, which makes them under-represented and further increases the risk of over-fitting the network. The red curve in Fig. 3 shows the impact of class-balanced strategy on one head class. As seen, CBS leads to a larger training loss than that in regular random sampling.
In summary, extremely imbalanced data distribution leads to a dilemma in training a network: increasing the influence of tail classes during training, e.g. through class-balanced sampling, is necessary since otherwise the tail classes will be ignored. However, on the other hand, it increases the risk of over-fitting a deep neural network.
3.2 An Embarrassingly Simple Solution
Existing methods tackle this dilemma by carefully choosing sampling ratio per class [haixiang2017learning, chawla2002smote, tahir2009multiple], designing class-dependent cost functions [huang2016learning, mahajan2018exploring], or feature regularisation schemes [zhang2017range, liu2019large] etc.
In this paper, we propose a much simpler approach to solve this dilemma. Our solution is based on the fact that a deep neural network can be decomposed into a feature extractor and a classifier . This decomposition allows us to adopt different training strategies for those two parts: the classifier is trained with the loss introduced in the CBS scheme; the feature extractor is trained with the loss introduced in the CBS scheme and also a regular random sampling (RRS) scheme.
Specifically, our method is realized by constructing another classifier in addition to the original one . Both and are attached to the same feature extractor and trained jointly, as shown in Figure 1. The key difference between and is that we train without using class-balanced sampling but regular random sampling. In such a design, the classifier is solely affected by CBS training, while the feature extractor is learned from losses from both CBS and RRS schemes. Therefore, the head classes information compromised by CBS can be recovered through the gradient back-propagated from . In this sense, the feature extractor can take advantage of the information of the full dataset and the over-fitting issue can be avoided. Note that the feature extractor consists of the majority of the model parameters, and the classifier only involves much fewer parameters. Thus if the feature extractor does not over-fit the training data, the entire model is less likely to have the over-fitting issue.
Due to the auxiliary training task in , head classes tend to have a stronger influence on the training of the feature extractor. One may question if this brings any side-effects as in the case of training the entire classification model with the regular random sampling strategy. We argue that the side-effects may not be severe since the feature extractor is more robust to the mismatching between the training target and deployment target, e.g., feature extractor trained on a set of classes can be reused for the classification of other sets of classes.
After training, is discarded, and only is used for classification. Therefore our method does not use more parameters at the test time.
3.3 Extension: Exploring A Better Feature Representation
The key idea of our approach is to use an auxiliary training task to learn a robust feature representation, which overcomes the side-effect of CBS. In the above discussion, is trained with the standard loss function and regular random sampling strategy. In this section, we propose to use self-supervised learning to enhance the feature representation. Self-supervised learning is initially proposed for learning feature representations from a large-scale unlabelled dataset. Since the class information is unavailable, self-supervised learning usually relies on a pretext task as the surrogate training objective. Optimising towards a carefully chosen pretext objective can result in a representation that benefits downstream tasks.
At first glance, it seems unnecessary to use self-supervised learning in a fully-supervised setting since we already have the ground-truth class labels. However, this view may not be accurate, which can be explained as follows: (1) The tail classes alone do not have sufficient training samples to properly train the deep network and in our approach the feature extractor part is (mainly) trained from samples in the head classes. In other words, our method is essentially built on the assumption that features (mainly) trained for head classes can also be useful for tail classes. (2) The primary goal of traditional supervised learning loss is to encourage a feature representation that supports a good separation of samples from different classes. Although we empirically observed that features trained for separating samples from class set A can generalise to class set B, it is unclear if using the supervised training towards A is optimal for achieving good performance on B or A+B. (3) Self-supervised training is another existing feature learning method known to be able to learn feature presentations with cross-class generalisation capability, and it does not rely on the definition of visual classes. We postulate that it may be complementary to supervised training in feature representation learning, and we expect that using both training strategies will lead to better cross-class generalisation.
Motivated by the above consideration, we create another classifier attached to the shared feature extractor and employ a self-supervised training task as an additional auxiliary task. Specifically, we use the self-supervised method proposed in [gidaris2018unsupervised], where each input image is randomly rotated by one of four angles and the pretext task is to predict the rotation angle given the rotated input image.
3.4 Training & Prediction
As we address the dilemma of balancing head and tail classes by auxiliary learning framework, we describe the full training and prediction details as follows: the ResNet models are employed as the feature extractor , which is shared among tasks, while fully-connected layers are utilised as classifiers , and for each task. The standard unweighted cross-entropy loss function is used in each task to compute the differences between predictions and targets.
For each forward pass, two mini-batches are sampled by class-balanced and regular sampler respectively and then concatenated as one batch. The primary classification task accepts the class-balanced mini-batch as input to make predictions, while the random sampled one is sent to the regular random sampling auxiliary task. For self-supervised auxiliary task, the rotations are applied on the whole batch .
The final loss of the deep network is computed as the weighted sum of the respective loss from each task:
During the test, only the feature extractor and the classifier from the primary task are retained for prediction. Therefore, compared to the standard classification network, no extra parameters are used in the test model.
We present experimental results in this section and investigate the effectiveness of the proposed model through extensive ablation studies. The proposed model is evaluated on the long-tailed version (-LT) of two benchmark datasets: ImageNet-LT and Places-LT [liu2019large]. Both LT datasets are constructed by sampling from original datasets (ImageNet-2012 [deng2009imagenet] and Places-2 [zhou2016places]) under the Pareto distribution with =6. The ImageNet-LT contains =185,846 images from =1000 classes, among which 115,846/ 20,000/ 50,000 images are used for training/ validation/ test. The number of images per class ranges from minimal 5 to maximum 1280. For Places-LT, there are 106,300 images from 365 categories, with training, validation and test splits of 62,500/ 7,300/ 36,500 images. The imbalance is more severe than ImageNet-LT with ranges from 5 to 4980. For both datasets, the test sets are made balanced.
|Backbone||Top-1 Accuracy (%)|
|Lifted Loss [DBLP:conf/cvpr/SongXJS16]||35.8||30.4||17.9||30.8|
|Focal Loss [DBLP:conf/iccv/LinGGHD17]||36.4||29.9||16||30.5|
|Range Loss [DBLP:conf/iccv/ZhangFWLQ17]||35.8||30.3||17.6||30.7|
4.2 Evaluation Metrics
Following the evaluation protocols in [liu2019large], both overall top-1 classification accuracy and shot-wise accuracy are computed. The overall accuracy is computed over all classes, while the test set is split into three sub-sets for evaluating shot-wise accuracy, namely many-shot (classes with ), low-shot () and medium-shot (). The shot-wise accuracy aims at monitoring the behaviours of the proposed model on the different portions of the imbalanced distribution.
4.3 Compared Methods
We compare our models with state-of-the-art methods, as well as two straightforward baselines:
RRS-Only: This baseline is equivalent to the regular supervised training for image classification tasks, with regular randomly sampled batches as inputs.
CBS-Only: This baseline is only to apply CBS to train the whole network without any auxiliary task.
Ours: CBS+ 111 indicates the task is an auxiliary one. is the basic version of our method, which uses regular random sampling branch to construct auxiliary learning task. CBS++ is the extended method with both regular random sampling and self-supervised learning as auxiliary tasks.
SOTA: Various state-of-the-art methods belonging to different categories are used for comparisons. In particular, we compare against methods based on metric learning, including lifted structure loss [DBLP:conf/cvpr/SongXJS16], triplet Loss [DBLP:journals/corr/HofferA14] and proxy static loss [movshovitz2017no]; methods based on hard example mining and feature regularisation, including focal loss [DBLP:conf/iccv/LinGGHD17], range loss [DBLP:conf/iccv/ZhangFWLQ17], few-shot learning without forgetting (FSLwF) [DBLP:conf/cvpr/GidarisK18] and memory enhancement (MemoryNet) [liu2019large]; as well as cost-sensitive loss [haixiang2017learning]. For fair comparisons, we adopt the same feature extractor as [liu2019large] in all compared models, that is, i.e., ResNet as backbones and followed with average-pooling and a 512-dim Fc layer. More details can be found in supplementary materials.
4.4 Experimental Results
Three groups of experiments are conducted. To make a fair comparison with results reported in [liu2019large], the ResNet-10 trained from scratch and the pre-trained ResNet-152 with frozen convolutional features are used for ImageNet-LT and Places-LT respectively. We also train a ResNet-50 from scratch on Places-LT. We use the same batch size for both samplers, and ensure that each sampled class has images in class-balanced sampling. We set
= 0.5:1:1 in our extended model. SGD with the initial learning rate 0.1 is used for optimisation, and the learning rate decays by a factor of ten in every ten epochs. The experimental results are shown in Tab.1-3.
|Backbone||Top-1 Accuracy (%)|
|Triplet Loss [DBLP:journals/corr/HofferA14]||23.5||24.7||8.9||20.9|
|Focal Loss [DBLP:conf/iccv/LinGGHD17]||23.4||23.9||8.5||20.5|
|Proxy Loss [movshovitz2017no]||23.1||24.0||8.2||20.4|
|Backbone||Top-1 Accuracy (%)|
|Lifted Loss [DBLP:conf/cvpr/SongXJS16]||41.1||35.4||24.0||35.2|
|Focal Loss [DBLP:conf/iccv/LinGGHD17]||41.1||34.8||22.4||34.6|
|Range Loss [DBLP:conf/iccv/ZhangFWLQ17]||41.1||35.4||23.2||35.1|
|Dataset: Backbone||ImageNet-LT: ResNet-10||Places-LT: ResNet-50|
|Top-1 Accuracy (%)|
First, we discuss models trained from scratch in Tab. 1 and 2. As expected, the standard supervised training RRS-Only achieves high accuracy on the many-shot split, yet extremely low performances on other two splits. Using CBS-only, on the other extreme, achieves reasonable performance on medium-shot and low-shot splits, but performs poorly on many-shot. This result validates that directly applying CBS makes the head-classes under-represented. As seen, our CBS+ method significantly improves the accuracy of medium and low-shot splits compared to the baseline RRS-Only. Compared to CBS-Only, it effectively prevents the under-representation of head classes, results in superior performances on the many-shot split. By further comparing CBS+ against other existing methods, we find the proposed method achieves the overall best performance. For example, we outperform the second best method, MemoryNet [liu2019large], by around 2% in overall accuracy despite the latter uses more complicated algorithm designs. Also, it is observed that CBS+ excels at each split. This clearly shows that the proposed method is excellent in balancing the head and tail classes.
To demonstrate the performance gains of proposed CBS+ over two baselines RRS-Only and CBS-Only in each class, we compute the class-wise accuracy gain, which is defined as: ACC-ACC, where ACC and are the prediction accuracy of the proposed method and the to-be-compared baselines respectively. The result of ImageNet-LT is shown in Fig. 4. (The performance gains of Places-LT can be found in supplementary materials.) The left y-axis indicates the accuracy gains and the right axis is the class frequencies. The bar charts correspond to the comparisons against RRS-Only (blue) and CBS-Only (red), while the green curve is the sorted distribution w.r.t class frequencies. It is clear that CBS+ leads to improved performance for most classes, especially for head classes when comparing against CBS-Only, and tail classes when comparing against RRS-Only.
Furthermore, from Tab. 1 and 2 we can also see that adding self-supervised auxiliary task can bring additional improvements. This is an interesting observation since it somehow contradicts to the common belief that self-supervised learning loss is only useful for learning feature representations in the unsupervised setting.
As for models with pre-trained backbones in Tab. 3, the proposed CBS+ still outperforms other methods, but by a relatively smaller margin compared to ones trained from scratch. The possible reason lies in the pre-training step on ImageNet. With the pre-training, a good feature representation can be already obtained before fine-tuning on the target dataset. This in effect reduces the number of to-be-learned parameters and alleviates the risk of over-fitting. Also, since the convolutional features are fixed, we omit self-supervised learning as the additional auxiliary task.
|Dataset: Backbone||ImageNet-LT: ResNet-10||Places-LT: ResNet-50|
|Top-1 Accuracy (%)|
|Dataset: Backbone||ImageNet-LT: ResNet-10||Places-LT: ResNet-50|
|Top-1 Accuracy (%)|
4.5 Ablation Study
In this section, we conduct extensive ablation studies to investigate the effectiveness of the proposed auxiliary tasks. To observe results without the influence of pre-trained weights from additional datasets, all models are trained from scratch unless otherwise stated.
Auxiliary Learning vs Stage-Wise Training.
We first investigate the necessity of joint training of the primary task and auxiliary task. To verify this, we compare the proposed method against stage-wise training, which firstly trains the network with regular random sampling strategy and then fine-tunes with the CBS strategy. We denote this baseline as ftRRS+CBS and make comparisons in Tab. 4.
As seen, this stage-wise training strategy can achieve improved accuracy on medium and low-shot splits compared to RRS-Only. This improvement can be largely attributed to the CBS strategy adopted at the fine-tuning stage. Also, compared to CBS-Only, ftRRS+CBS achieves a better result on the many-shot split. This may be because the head class information has already been incorporated during the first stage training. However, when compared with CBS+, the stage-wise strategy still achieves the inferior result, which clearly shows the benefit of joint training.
Sampling Strategy : Weight Ratio.
Our method needs to set weights for different loss terms. In this section, we also explore the impact of different CBS:RRS weight ratios : in the CBS+RRS model. We test different ratios ranging from 0:1 to 1:0, where 0:1 is equivalent to the RRS-Only baseline, and 1:0 is equivalent to the CBS-Only baseline. As observed in Tab. 5, in general, the model is not very sensitive to the ratio as long as two tasks are jointly trained. The highest overall performance strikes when : = 0.5:1, which means that we should let RRS plays a major role in training the feature representation. This observation supports our motivation of introducing the RRS auxiliary branch.
Self-Supervised Learning or Data Augmentation Strategy?
Given its simplicity and effectiveness, rotation-based self-supervised learning [gidaris2018unsupervised] is adopted as the auxiliary task to enhance feature learning. One may suspect that the benefit of introducing this task is essentially a data augmentation strategy. To investigate this, we report the performance of directly using rotations for data augmentation. Specifically, we feed the four versions of rotated images into the CBS+ model to ensure the same inputs as in CBS++. As Tab. 6 shows, rotation-based data augmentation cannot bring any significant improvements over the baseline. This validates that the benefit of using rotation-based self-supervised learning cannot be simply explained by the rotation augmentation.
In this paper, we address the long-tailed recognition by fitting it into a simple-yet-effective auxiliary learning framework. The dilemma of balancing the head and tail classes training is analysed, and the class-balanced sampling strategy is adopted in the primary task to tackle the unfair training of tail classes, while the regular random sampling is used in the auxiliary task to jointly prevent the ill-fitting during feature learning. To further enhance the feature representation, self-supervised learning is explored as an additional auxiliary task. Comparisons against state-of-the-art methods and extensive ablation studies verify the effectiveness of proposed models.