Log In Sign Up

Flexible Sampling for Long-tailed Skin Lesion Classification

by   Lie Ju, et al.
Monash University

Most of the medical tasks naturally exhibit a long-tailed distribution due to the complex patient-level conditions and the existence of rare diseases. Existing long-tailed learning methods usually treat each class equally to re-balance the long-tailed distribution. However, considering that some challenging classes may present diverse intra-class distributions, re-balancing all classes equally may lead to a significant performance drop. To address this, in this paper, we propose a curriculum learning-based framework called Flexible Sampling for the long-tailed skin lesion classification task. Specifically, we initially sample a subset of training data as anchor points based on the individual class prototypes. Then, these anchor points are used to pre-train an inference model to evaluate the per-class learning difficulty. Finally, we use a curriculum sampling module to dynamically query new samples from the rest training samples with the learning difficulty-aware sampling probability. We evaluated our model against several state-of-the-art methods on the ISIC dataset. The results with two long-tailed settings have demonstrated the superiority of our proposed training strategy, which achieves a new benchmark for long-tailed skin lesion classification.


page 1

page 2

page 3

page 4


Out-of-Distribution Detection for Long-tailed and Fine-grained Skin Lesion Images

Recent years have witnessed a rapid development of automated methods for...

Relational Subsets Knowledge Distillation for Long-tailed Retinal Diseases Recognition

In the real world, medical datasets often exhibit a long-tailed data dis...

Long-Tailed Multi-Label Retinal Diseases Recognition Using Hierarchical Information and Hybrid Knowledge Distillation

In the real world, medical datasets often exhibit a long-tailed data dis...

Striking the Right Balance with Uncertainty

Learning unbiased models on imbalanced datasets is a significant challen...

Difficulty-Net: Learning to Predict Difficulty for Long-Tailed Recognition

Long-tailed datasets, where head classes comprise much more training sam...

Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision

Existing long-tailed recognition methods, aiming to train class-balance ...

Improving Long-Tailed Classification from Instance Level

Data in the real world tends to exhibit a long-tailed label distribution...

1 Introduction

In the real world, since the conditions of patients are complex and there is a wide range of disease categories [4], most of the medical datasets tend to exhibit a long-tailed distribution characteristics [10]. For instance, in dermatology, skin lesions can be divided into different subtypes according to their particular clinical characteristics [5]. On the popular ISIC dataset [9], the ratio of majority classes and minority classes exceeds one hundred, which presents a highly-imbalanced class distribution. Deep classification models trained on such a long-tailed distribution might not recognize those minority classes well, since there are only a few samples available [18]. Improving the recognition accuracy of those rare diseases is highly desirable for the practical applications.

Most of existing methods for long-tailed learning can be roughly divided into three main categories: information augmentation, module improvement [22] and class-balancing. The information augmentation-based [21, 18] methods focused on building a shared memory between head and tail but failed to tackle extremely imbalanced conditions since there are still more head samples for a mini-batch sampling. The module improvement-based methods [20, 10] required multiple experts for the final decision and introduces higher computational costs. The class-balancing methods [12, 23] attempted to re-balance the class distribution, which becomes the mainstream model for various long-tailed classification tasks. However, most existing methods ignored or underestimated the differences of the class-level learning difficulties. In other words, we believe that it is crucial to dynamically handle different classes according to their learning difficulties.

Figure 1: An illustration of learning difficulty. RS denotes re-sampling. The sub-figure (a) shows two kinds of lesions melanoma and vascular lesion from the majority and minority classes respectively. The sub-figure (b) shows that naively over-sampling those minority classes can only achieve marginal performance gains on VASC but seriously hurts the recognition accuracy on MEL.

In Fig. 1, we illustrate our motivation with a pilot experiment on ISIC. Melanoma (MEL) and vascular lesion (VASC)

are two representative skin lesions from the majority and minority classes, respectively. The former class has 2560 images for training while the latter only contains 96 training samples. It can be seen that most samples from VASC resemble one another and present a pink color. Intuitively, this kind of lesions is easier to be classified by deep models since the intra-class diversity is relatively small. By contrast, although MEL has more samples for training, various shapes and textures make the learning difficulty higher. Fig. 


(b) shows the performance of two models trained with different loss functions (cross-entropy (CE) loss vs. re-sampling (RS)) for the class-balancing constraint. It can be seen that over-sampling the tailed classes with less intra-class variety leads to marginal performance gains on VASC but a significant performance drop on MEL, also known as a severe marginal effect.

From this perspective, we propose a novel curriculum learning-based framework named Flexible Sampling through considering the varied learning difficulties of different classes. Specifically, first, we initially sample a subset of data from the original distribution as the anchor points

, which are nearest with their corresponding prototypes. Then, the subset is used to pre-train a less-biased inference model to estimate the learning difficulties of different classes, which can be further utilized as the class-level sampling probabilities. Finally, a curriculum sampling module is designed to dynamically query most suitable samples from the rest data with the uncertainty estimation. In this way, the model is trained via the easy examples first and the hard samples later, preventing the unsuitable guidance at the early training stage 


Overall, our contributions can be summarized as follows: (1) We advocate that the training based on the learning difficulty can filter the sub-optimal guidance at the early stage, which is crucial for the long-tailed skin lesion classification task. (2) Via considering the varied class distribution, the proposed flexible sampling framework achieves dynamic adjustments of sampling probabilities according to the class-level learning status, which leads to better performance. (3) Extensive experimental results demonstrate that our proposed model achieves new state-of-the-art long-tailed classification performance and outperforms other public methods on the ISIC dataset.

2 Methodology

The overall pipeline of our proposed flexible sampling framework is shown in Fig. 2

. First, we use self-supervised learning techniques to train a model to obtain balanced representation from original long-tailed distribution. Then, we sample a subset from the training data using the prototypes (mean class features) as anchor points. Next, this subset is used to train an inference model for the estimation of learning difficulties. Finally, a novel curriculum sampling module is proposed to dynamically query new instances from the unsampled pool according to the current learning status. An uncertainty-based selection strategy is used to help find the most suitable instances to query.

2.1 SSL Pre-training for Balanced Representations

Training from the semantic labels as supervision signal on long-tailed distribution can bring significant bias since majority classes can occupy the dominant portion of the feature space [12]. A straightforward solution is to leverage self-supervised learning (SSL) technique to obtain category distribution-agnostic features. Different from the supervised learning methods, SSL techniques do not require annotations but learn representations via maximizing the instance-wise discriminativeness [13, 11]. The contrastive learning loss [19] adopted in this study can be formulated as:


where is a positive sample for after the data augmentation, is a negative sample drawn from any other samples, is the temperature factor. SSL can learn stably well feature spaces which is robust to the underlying distribution of a dataset, especially in a strongly-biased long-tailed distribution [11].

Figure 2: The overview of our proposed framework.

2.2 Sample Anchor Points using the Prototype

To sample anchor points for training the inference model, we first comupte the mean features for each class as the prototype. Concretely, after pre-training a CNN model with feature representation layer and classifier , we can obtain features for n instances from original distribution by forward feeding into . Thus, for sorted class index where , the prototype of class can be calculated by averaging all the features from the same class:


where denotes the number of instances of class . Then, we find those representative samples which have the lowest distances from prototypes by calculating their Euclidean Distance:


By sorting the distances ], we can select those top- lowest-distance samples as the class anchor points to form the subset. For instance, given a long-tailed distribution following Pareto Distribution with the number of classes and imbalance ratio , the number of samples for class can be calculated as . Then, we use a scaling hyper-parameter as the imbalance ratio of anchor points and we can sample a less-imbalanced subset with .

Using the prototype for sampling anchor points show the following two advantages: 1) training the inference model with representative anchor points provides a better initialization compared to that of using random selected samples; 2) a re-sampled subset shows a more relatively balanced distribution, which helps reducing confirmation bias for predicting uncertainty on those unsampled instances.

2.3 Curriculum Sampling Module

2.3.1 Class-wise Sampling Probability

After obtaining anchor points , we use them to train the inference model for querying new samples from the unsampled instances in the training pool

. As we claimed that there exist various learning difficulties and some tail classes with less intra-class variety can converge well in an early stage. Over-sampling such classes may bring obvious marginal effects and do harm to other classes. Hence, a more flexible sampling strategy is required with estimating the learning difficulty of each class. Intuitively, this can be achieved by evaluating the competence of the current model. Thus, we calculate the validation accuracy of different classes after every epoch as the sampling probability:


where denotes the epoch and is the corresponding validation accuracy on class. Then, the can be normalized as , where . It should be noticed that when there are limited samples for validation, the can also be replaced by training accuracy with only one-time forward pass. The results evaluated on validation set can provide more unbiased reference with less risk of over-fitting since it exhibits different distribution from training set.

2.3.2 Instance-wise Sample Selection

With learning difficulty-aware sampling probability at the categorical level, we need to determine which instances should be sampled for more efficient querying in an instance difficulty-aware manner. Here, we proposed to leverage uncertainty for the selection of unlabeled samples. Those samples with higher uncertainty have larger entropy since CNN model always give less-confident predictions on hard samples [15]. Training from high-uncertainty samples can well rich the representation space with discriminative features. Here, we compute the mutual information between predictions and model posterior for uncertainty measurement [8]:


where is the entropy of given instance point and denotes the expectation of the entropy of the model prediction over the posterior of the model parameter. A high uncertainty value implies that the posterior draws are disagreeing among themselves [14]. Then, with sorted instances for class , the number of newly sampling instances could be . Finally, we merge the anchor points and newly-sampled instances into the training loop for a continual training.

3 Experiments

3.1 Datasets

We evaluate the competitive baselines and our methods on two dermatology datasets from ISIC [9] with different number of classes. ISIC-2019-LT is a long-tailed version constructed from ISIC-2019 Challenge111, which aims to classify 8 kinds of diagnostic categories. We follow [18] and sample a subset from a Pareto distribution. With classes and imbalance ratio , the number of samples for class can be calculated as . For this setting, we set . We select 50 and 100 images from the remained samples as validation set and test set. The second dataset is augmented from ISIC-2019 to increase the length of tail with extra 6 classes (14 in total) added from the ISIC Archive222 and named as ISIC-Archive-LT333Please see our supplementary document for more details.. This dataset is more challenging with longer length of tail and larger imbalance ratio . The dataset is divided into train: val: test = 7:1:2.

3.2 Implementation Details

All images were resized into 224224 pixels. We used ResNet-18 and ResNet-50 [7] as our backbones for ISIC-2019-LT and ISIC-Archive-LT respectively. For the SSL pre-training, we used Adam optimizer with a learning rate of 2 with a batch size of 16. For the classification task, we used Adam optimizer with a learning rate of 3 and a batch size of 256. Some augmentations techniques were applied such as random crop, flip and colour jitter. An early stopping monitor with the patience of 20 was set when there was no further increase on the validation set with a total of 100 epochs.

For our proposed methods, we warmed up the model with plain cross-entropy loss for 30 epochs. The scaling value for sampling anchor points was set as 0.1. We queried 10% of the samples in the unsampled pool when there was no further improvement with the patience of 10. Then, the optimizer will be re-initialized with a learning rate of 3

. For a fair comparison study, we kept all basic hyper-parameters the same on all the comparative methods. We reported the average of 5-trial running in all presented results. All experiments are implemented by PyTorch 1.8.1 platform on a single NVIDIA RTX 3090.

3.3 Comparison Study

The comparative methods selected in this study can be briefly introduced as follow: (1) plain class-balancing re-sampling (RS) and re-weighting (RW) strategies; (2) information augmentation: MixUp [21] and OLTR [18]; (3) Modified re-weighting loss functions: Focal Loss [16], Class-Balancing (CB) Loss  [2] and label-distribution-aware margin (LDAM) Loss [1]; (4) Curriculum learning-based methods: Curriculum Net [6] and EarLy-exiting Framework (ELF) [3].

3.3.1 Performance on Isic-2019-Lt

The overall results of the comparative methods and our proposed framework on ISIC-2019-LT are summarized in Table 1

using the mean and standard deviation (Std.) in terms of top-1 accuracy. Here are some findings: (1) Re-sampling strategy obtained little improvements under an extremely imbalanced condition, e.g., r = 500. Repeatedly sampling those relatively easy-to-learn classes brings obvious marginal effects. (2) CB Loss showed superior performance with 4.56%, 4.35% and 5.05% improvements on the baseline trained from cross-entropy, since it tended to sample more effective samples. (3) The proposed flexible sampling effectively outperformed other methods with 63.85%, 59.38% and 50.99% top-1 accuracy over three different imbalance ratios.

Methods Top-1 @ Ratio 100 Top-1 @ Ratio 200 Top-1 @ Ratio 500
CE 55.09 (4.10)=0.00 53.40 (0.82)=0.00 44.40 (1.80)=0.00
RS 61.05 (1.26)+4.96 55.10 (1.29)+1.70 46.82 (1.33)+2.82
RW 61.07 (1.73)+5.98 58.03 (1.24)+4.63 50.55 (1.19)+6.15
MixUp [21] 57.25 (0.71)+2.16 52.05 (1.71)1.35 42.78 (1.28)1.62
OLTR [18] 60.95 (2.09)+5.86 56.77 (1.99)+3.37 48.75 (1.52)+4.35
Focal Loss [16] 60.62 (1.04)+5.53 53.88 (1.19)+0.48 44.38 (1.44)0.02
CB Loss [2] 59.65 (1.52)+4.56 57.75 (1.39)+4.35 49.45 (1.11)+5.05
LDAM [1] 58.62 (1.46)+3.53 54.92 (1.24)+1.52 44.90 (2.60)+0.50
CN [6] 56.41 (1.62)+1.32 53.55 (0.96)+0.15 44.92 (0.31)+0.52
ELF [3] 61.94 (1.53)+6.85 54.60 (1.78)+1.20 43.19 (0.31)1.21
Ours 63.85 (2.10)+8.76 59.38 (1.90)+5.98 50.99 (2.02)+6.59
Table 1: The comparative results on three different imbalance ratios {100, 200, 500}.
Methods Head Medium Tail All
CE 71.23 (1.33) 48.73 (4.24) 36.94 (7.17) 52.30 (4.25)
RS 70.26 (1.21) 55.98 (1.63) 37.14 (5.44) 54.46 (2.76)
RW 59.70 (2.84) 61.53 (3.69) 63.64 (3.11) 61.62 (3.21)
CBLoss 65.27 (5.73) 56.97 (6.64) 66.74 (2.94) 62.89 (5.10)
ELF 68.81 (3.69) 50.17 (4.12) 60.77 (4.99) 59.92 (4.27)
Ours 68.26 (2.01) 57.72 (2.98) 67.01 (4.12) 64.33 (3.04)
Table 2: The comparative results on ISIC-Archive-LT.

3.3.2 Performance on ISIC-Archive-LT

For ISIC-Archive-LT dataset with imbalanced test set, we reported the accuracy on shot-based division, e.g., head, medium, tail and their average results as most existing works [18, 22]. The overall results are shown in Table 2. RS did not obtain satisfactory results. It is due to that the same training samples from tail classes can be occurred repeatedly in a mini-batch during the iteration under an extremely imbalanced condition, resulting in learning from an incomplete distribution. Compared with CBLoss which sacrificed head classes performance, our methods evenly improved the overall performance (+12.03%) with the medium (+8.99%) and tail classes (+30.07%) but only a little performance loss on head classes (-2.97%). In Fig. 3 we present the performance of several selected classes evaluated on three competitors. It is noticed that flex sampling can also improve those head classes with more learning difficulty, e.g., MEL, while RW severely hurts the recognition performance especially for those high-accuracy classes, e.g., BCC.

Figure 3: The performance on several classes from head/tail classes on three methods.

3.4 Ablation Study

To investigate what makes our proposed methods performant, we conduct ablation studies and report the results in Table 3. We first evaluate the selection of pre-training methods. It can be found that using SSL pre-training with anchor points can well boost the accuracy of the inference model, where Random denotes the random selection of samples and Edge Points denotes those samples away from the prototypes. Finally, we validated the effect of different querying strategies. Compared with random querying, uncertainty-based mutual information can help introduce samples more efficiently in an instance-aware manner.

Pre-train Sample Selection Acc. +Querying Strategy
CE Anchor Points 58.09 (0.95) Random Mutual Information
SSL Random 56.19 (2.20) 57.77 (3.09) 60.99 (2.41)
Edge Points 52.62 (1.76) 56.95 (1.04) 58.63 (1.07)
Anchor Points 59.32 (1.58) 60.26 (1.73) 63.85 (2.10)
Table 3: Ablation study results on ISIC-2019-LT ( = 100).

4 Conclusion

In this work, we present a flexible sampling framework with taking into consideration the learning difficulties of different classes. We advocate training from a relatively balanced and representative subset and dynamically adjusting the sampling probability according to the learning status. The comprehensive experiments demonstrate the effectiveness of our proposed methods. In future work, more medical datasets and a multi-label setting can be further explored.


  • [1] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma (2019) Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems 32. Cited by: §3.3, Table 1.
  • [2] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019) Class-balanced loss based on effective number of samples. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    pp. 9268–9277. Cited by: §3.3, Table 1.
  • [3] R. Duggal, S. Freitas, S. Dhamnani, D. H. Chau, and J. Sun (2020) Elf: an early-exiting framework for long-tailed classification. arXiv preprint arXiv:2006.11979. Cited by: §3.3, Table 1.
  • [4] N. England and N. Improvement (2016) Diagnostic imaging dataset statistical release. London: Department of Health 421. Cited by: §1.
  • [5] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017)

    Dermatologist-level classification of skin cancer with deep neural networks

    nature 542 (7639), pp. 115–118. Cited by: §1.
  • [6] S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang (2018) Curriculumnet: weakly supervised learning from large-scale web images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150. Cited by: §3.3, Table 1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • [8] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel (2011)

    Bayesian active learning for classification and preference learning

    arXiv preprint arXiv:1112.5745. Cited by: §2.3.2.
  • [9] ISIC (2021) ISIC archive. Note: Cited by: §1, §3.1.
  • [10] L. Ju, X. Wang, L. Wang, T. Liu, X. Zhao, T. Drummond, D. Mahapatra, and Z. Ge (2021) Relational subsets knowledge distillation for long-tailed retinal diseases recognition. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Cham, pp. 3–12. External Links: ISBN 978-3-030-87237-3 Cited by: §1, §1.
  • [11] B. Kang, Y. Li, S. Xie, Z. Yuan, and J. Feng (2020) Exploring balanced feature spaces for representation learning. In International Conference on Learning Representations, Cited by: §2.1.
  • [12] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217. Cited by: §1, §2.1.
  • [13] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. Advances in Neural Information Processing Systems 33, pp. 18661–18673. Cited by: §2.1.
  • [14] A. Kirsch, J. Van Amersfoort, and Y. Gal (2019) Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. Advances in neural information processing systems 32. Cited by: §2.3.2.
  • [15] R. Leipnik (1959) Entropy and the uncertainty principle. Information and Control 2 (1), pp. 64–79. Cited by: §2.3.2.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.3, Table 1.
  • [17] F. Liu, S. Ge, and X. Wu (2021) Competence-based multimodal curriculum learning for medical report generation. In

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    pp. 3001–3012. Cited by: §1.
  • [18] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu (2019) Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2537–2546. Cited by: §1, §1, §3.1, §3.3.2, §3.3, Table 1.
  • [19] A. Van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv e-prints, pp. arXiv–1807. Cited by: §2.1.
  • [20] X. Wang, L. Lian, Z. Miao, Z. Liu, and S. X. Yu (2020) Long-tailed recognition by routing diverse distribution-aware experts. arXiv preprint arXiv:2010.01809. Cited by: §1.
  • [21] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1, §3.3, Table 1.
  • [22] Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng (2021) Deep long-tailed learning: a survey. arXiv preprint arXiv:2110.04596. Cited by: §1, §3.3.2.
  • [23] B. Zhou, Q. Cui, X. Wei, and Z. Chen (2020) Bbn: bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9719–9728. Cited by: §1.