Alleviating the Incompatibility between Cross Entropy Loss and Episode Training for Few-shot Skin Disease Classification

04/21/2020 ∙ by Wei Zhu, et al. ∙ 0

Skin disease classification from images is crucial to dermatological diagnosis. However, identifying skin lesions involves a variety of aspects in terms of size, color, shape, and texture. To make matters worse, many categories only contain very few samples, posing great challenges to conventional machine learning algorithms and even human experts. Inspired by the recent success of Few-Shot Learning (FSL) in natural image classification, we propose to apply FSL to skin disease identification to address the extreme scarcity of training sample problem. However, directly applying FSL to this task does not work well in practice, and we find that the problem can be largely attributed to the incompatibility between Cross Entropy (CE) and episode training, which are both commonly used in FSL. Based on a detailed analysis, we propose the Query-Relative (QR) loss, which proves superior to CE under episode training and is closely related to recently proposed mutual information estimation. Moreover, we further strengthen the proposed QR loss with a novel adaptive hard margin strategy. Comprehensive experiments validate the effectiveness of the proposed FSL scheme and the possibility to diagnosis rare skin disease with a few labeled samples.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a key step in the dermatological diagnosis, skin disease classification is quite challenging due to the extremely scarce annotations for a large number of categories. Such complexity in skin disease taxonomy requires a great deal of expertise. In addition, the diagnosis is often subjective and inaccurate even by human experts, which necessitates the research for computer-aided diagnosis [10, 14]

. Motivated by the unprecedented success of deep neural networks (DNNs), many researchers resort to deep learning technologies to handle this task

[2, 7, 8]. For example, Esteva et al. adopt GoogleNet Inception V3 [16] to train a large-scale skin disease classification network [2]. Liao et al.

jointly train skin lesion and body location classifiers using a multi-task network

[8]. However, since DNN-based methods usually require a significant number of training samples for each category, categories with only a few number of samples are often discarded [7]. This reduces the applicability of DNN-based methods, especially for infrequent skin disease diagnosis.

Shi et al.

propose to adopt active learning to reduce the annotation cost

[12], but still need up to 50% of labeled samples to train their model. Alternatively, Few-Shot Learning (FSL) is usually leveraged to address such tasks with only a few training samples [13, 15, 5, 6]. By assuming the availability of a large-scale auxiliary training set, one can learn generalized patterns and knowledge which facilitate the learning for unseen tasks. Formally, for each few-shot task, we are provided with a support set , a query set , and an auxiliary set , where the support set contains different categories and each category has training samples, i.e., -way -shot, and contains unlabeled query data. Instead of conventional minibatch training, FSL is always trained with the episode training mechanism [13]. Basically, at each training iteration, we generate an episode by drawing samples from different categories of the auxiliary set , with samples in each category as support samples and others as query samples . As a crucial step, we need to randomly shuffle the labels for all categories from episode to episode. Episode training mechanism benefits FSL in at least two aspects. First, it enables FSL to be trained under similar scenarios as testing tasks. Second, the labels are randomly shuffled during episode training, which enables the model to learn category-agnostic representation for a better generalization ability.

Generally, FSL employs the Cross Entropy (CE) loss as an objective for classification. Although CE is useful for conventional classification, we find that it is somewhat incompatible with the episode mechanism. Well-designed FSL methods trained with CE even perform significantly worse than the baseline methods [1]. As we will see, CE classifies the query samples individually and relies highly on well-trained category-wise representation, a.k.a. proxies in proxy-based metric learning methods which share the similar formulation as CE [9, 11]. The proxy is an category-wise aggregation of labeled support samples, e.g., the center used in Prototypical Network (PN). However, accurate proxies could only be obtained by a large-scale unified labeled dataset under the conventional minibatch training mechanism. This is hardily fulfilled under the episode training mechanism since we are only provided with a few training samples with randomly shuffled labels in each iteration.

To alleviate the problem, we propose a Query-Relative (QR) loss, which works much better with the episode training mechanism than CE for FSL.

We highlight our main contributions as follows:

  • Upon an insightful analysis of the CE loss and episode mechanism, we propose a Query-Relative (QR) loss to better utilize the cross sample information and avoid possible sub-optimal aggregation of negative support samples, which significantly boosts the FSL performance;

  • We develop an adaptive hard margin method for the QR loss to further penalize the categories with more error similarity connections;

  • We evaluate our methods against a benchmark FSL suite [1], and the experiments strongly validate our analysis and the proposed methods.

Figure 1: Block diagram of few-shot learning-based skin disease classification and the difference between QR and CE loss. CE considers queries individually, while QR takes the relation across samples into consideration. Moreover, CE aggregates the support samples into proxies with possible information loss, while QR allows the model to fully exploit the information of negative support samples guided by the training objective.

2 Methodology

2.1 Discussions on FSL

Cross Entropy (CE) loss is often jointly used with episode mechanism to solve the FSL tasks. It can be generally formulated as


Here, are the query embeddings, and are the representations for the support categories, where , , and denote the number of queries, support categories, and feature dimensions, respectively. denotes the similarity between the support category proxy and query sample . Different FSL methods have different formulations of similarity measurement and category proxy aggregated by the support samples. For example, PN (Prototypical Network) uses the centers of the support samples from the -th category as and the Euclidean distance as

, Matching Net employs an FCE (Fully Context Embedding) layer to encode the support samples and chooses cosine similarity for

, and MAML implements as a Fully Connected (FC) layer where the

-th weight vector of the FC layer corresponds to

. To unify these methods, we normalize and which leads to better performance shown in the recent literature [20]. Eq. (1) can then be rewritten as .

According to Eq. (1), CE individually classifies the query samples and completely relies on the category-wise representation to train the model. For the conventional classification task trained with minibatch SGD, such a mechanism could prompt to learn high-level representative features of each category by exposing them to a large and balanced dataset. Unfortunately, this is not the case for FSL due to the episode training mechanism. Although episode training is important for FSL since it empowers FSL with the ability to learn generalized class agnostic representation and provides similar training scenarios as testing scenarios, it is also a double-edged sword: it makes inevitably biased and inaccurate. The reasons are two aspects: first, is learned from a few samples in each episode, e.g., 1 and 5 for -shot and -shot respectively, and it is difficult to learn to aggregate support samples to obtain without losing useful information with so few training samples; second, the labels are randomly shuffled for each episode which limits to be consistently trained across episodes. Therefore, cannot be fully relied on under the episode mechanism and training the model with CE loss will eventually degrade the performance for FSL. The sub-optimal performance has been observed and experimentally validated by several recent benchmark papers for natural images, where well-designed baselines could achieve similar and even better performance than CE-trained FSL counterparts [1, 17]. Similar results are also disappointingly observed in the skin disease tasks according to our experiments in Section 3.

2.2 Query-Relative Loss

We alleviate the above problem from two aspects. First, instead of classifying the query separately, we unify all samples into a joint objective to allow them to mutually share information cross samples. Second, we avoid using negative category proxies which are aggregated by the negative support samples with a manually designed strategy (e.g., the center of support samples in PN), and the information of the support samples can then be largely preserved and extracted with the guidance of the training objective. To this end, we propose the Query-Relative (QR) loss as follows


where denotes the set of positive query samples that belong to the -th category and denotes the set of negative query and support samples that are not from the -th category. denotes the number of samples in the set.

We then present an analysis on how our objective improves CE from the two aforementioned aspects. First of all, Eq. (2) implicitly utilizes the cross sample information to re-weight each sample. Specifically, taking the derivation w.r.t. and , we have


Here, we only focus on the absolute value of the gradient. According to Eq. (3), will induce a large gradient and will be punished if (i) is small; (ii) is smaller than where ; or (iii) is small so that we could focus on intra-class relation. Moreover, a large will provide with tolerance to some extent, which allows our model to focus on reducing the large similarity of . Similar analysis can be performed with based on Eq. (4), and we omit the detail here. Therefore, in contrast to CE which deals with each sample separately, QR allows the query and support samples to share information across each other and category-wisely re-weights their importance.

Second, note that the negative set of each category contains not only the negative query samples but also the support samples from other categories. This avoids the information loss caused by the possibly sub-optimal support sample aggregation and allows the model to learn to utilize the negative support samples directly by the objective.

It turns out that the QR loss is closely related to Deep Mutual Information (MI) maximization recently proposed by [4]. Without loss of generality, following [4], the JSD-based (Jensen-Shannon Divergence) MI estimator between and can be formulated as


Here we use the fact that is convex and the Jensen’s inequality. We can thus derive that the QR loss is actually a lower bound of the JSD MI. The reason why we do not directly optimize is that the re-weighting mechanism of does not take both and into consideration for each . We experimentally verify the superiority of our formulation in Sec. 3.

2.3 Adaptive Hard Margin

The adaptive hard margin is built upon the fact that the cosine similarity between uniformly distributed normalized samples approaches

[19] and is thus likely to be zero. Therefore, should be at least larger than and should be at least smaller than , where and denote the average of with and average of with , respectively. Based on this observation, we propose a QR loss with online Adaptive Hard Margin which can be written as


Basically, Eq. (6) imposes extra punishment on categories with more positive samples whose similarities are smaller than random or negative samples, and negative samples whose similarities are larger than random or positive samples.

3 Experiments

3.1 Datasets

We collect the dermatology images from the Dermnet atlas website To perform few-shot learning, we discard categories with less than samples, which are required for the 5-way 5-shot setting. Finally, we obtain images in total belonging to different categories. The largest category “seborrheic keratoses ruff” contains images and the smallest categories contain 10 samples. Detailed statistics of the data can be found in the supplemental material. The data is manually split into categories for training, for validation, and for testing, respectively. Moreover, to better simulate the scenario of few-shot learning, we deliberately choose categories with more than samples ( categories in total) as the training data.

3.2 Benchmark Methods and Experimental Settings

We benchmark the dataset on an FSL suite proposed by [1]. The suite contains 2 strong baseline methods (denoted as baseline and baseline++ following [1]) and 4 FSL methods including Relation Net[15], Model-Agnostic Meta-Learning (MAML) [3], Matching Net (MN)[18], and Prototypical Net (PN)[13]. The baseline methods are carefully designed and outperform FSL methods in some cases. We refer readers to [1] for details. The four FSL methods are regarded as the state-of-the-art FSL baselines in recent benchmark literature [17, 1], and we train them with CE as our baselines except for the Relation Net, which is trained with Mean Square Error (MSE) Loss following the original paper. We apply the proposed QR loss to MN and PN since these two methods have proven to have superior and stable performance in natural image classification [1]. The model trained with JSD-based MI maximization Eq. (5) is denoted as JSD MI, and models trained with the proposed QR loss Eq. (2) and QR loss with adaptive hard margin Eq. (6) are denoted as QR and QR+M, respectively.

For the network structure, we follow the commonly adopted FSL settings [15, 1]

. The feature embedding network used in this paper is a convolutional neural network which has four convolutional blocks with each block containing a sequence of a convolutional layer with 64 filters of size

, a batch normalization layer, a

max-pooling layer and a Leaky ReLU layer. For the experimental settings, the episodic training mechanism is applied to all FSL models, and episodes are constructed in total during training for all methods. For validation and testing, 600 episodes are randomly constructed from the validation and test set, respectively. We conduct 5-way 1-shot and 5-way 5-shot classification tasks on the collected Dermnet dataset, and 5 query samples are provided for each category within each episode for either training and testing. For optimization, we adopt the Adam algorithm with a learning rate of

. Experiments are run five times and we report the performance on test set corresponding to the best validation results. The average Accuracy, Precision, and F1 score with 95% confidence interval are reported.

Methods 5-way 1-shot 5-way 5-shot
ACC% Precision% F1% ACC% Precision% F1%
Baseline 39.89 40.57 37.16 59.87 62.37 58.19
Baseline++ 42.47 43.70 40.34 63.37 65.80 61.75
MAML 45.95 44.82 42.18 66.93 69.24 64.92
Relation Net 45.50 46.36 44.00 62.53 64.90 62.26
MN 44.59 44.96 41.52 61.21 63.15 58.29
MN+JSD MI 43.28 43.26 40.00 58.99 60.23 55.78
MN+QR 48.01 48.87 44.30 67.09 69.18 64.53
MN+QR+M 49.29 49.95 45.64 66.83 69.10 64.25
MN+QR* 48.66 48.86 44.98 - - -
MN+QR+M* 49.76 49.52 46.01 - - -
PN 46.77 46.82 43.58 62.06 63.39 59.50
PN+JSD MI 47.55 47.90 44.33 61.15 61.74 58.34
PN+QR 49.85 49.53 46.34 70.38 72.13 68.50
PN+QR+M 52.41 53.21 49.52 71.99 74.23 70.30
PN+QR* 50.62 50.83 47.16 - - -
PN+QR+M* 53.30 53.69 50.45 - - -
Table 1: Experimental results on the Derment skin disease classification dataset. * denotes that the model is trained with 9 query samples per episode. - denotes that the setting is not applicable. M denotes our methods with an adaptive hard margin.

3.3 Result Analysis

The experimental results are reported in Table 1, and we draw several interesting points from the results as follows. First of all, the baseline methods with minibatch training and CE loss perform reasonably well in practice. The FSL methods trained with CE loss have comparable or slightly better performance. In contrast, FSL methods trained with the proposed QR loss significantly outperform the baseline methods and the FSL methods with CE. For Matching Net, our QR loss achieves 3.42 % and 5.88 % improvements compared with the CE loss in terms of accuracy for 5-way 1-shot and 5-way 5-shot tasks. Significant improvements are also observed for PN, and our QR loss outperforms CE 3.08 % and 8.32 % for 5-way 1-shot and 5-way 5-shot, respectively. The improvements are obtained by fully utilizing the cross-sample information and avoiding the information loss caused by manually designed support sample aggregation during training. Second, we compare the QR loss with JSD MI. Although the formulations are similar, QR is significantly better than JSD MI. The reason should be attributed to the fact that JSD MI does not mutually utilize the information in and . Third, the adaptive hard margin consistently boosts the performance of the models trained by QR. For example, the adaptive hard margin improves PN trained with QR 2.56 % and 1.61 % for 5-way 1-shot and 5-way 5-shot, respectively. Finally, our method could be further boosted by increasing the number of queries for both training and testing. Overall, our proposed FSL methods classify skin disease with only a few available training samples and makes it possible to diagnose rare diseases using modern neural networks.

3.4 Influence of the number of shots and ways

For simplicity, we only conduct experiments on PN with various ways and shots and report the accuracy. As shown in Tables 3 and 3, QR has clear advantages over CE when more samples are available per episode, suggesting that QR can better utilize the cross sample information.

# shots 1 2 3 4 5 CE 46.77 54.04 57.15 59.65 62.06 QR 49.85 62.37 66.87 68.95 70.38
Table 2: 5-way different-shot. (ACC%)
# ways 2 3 5 10 20 CE 69.80 59.42 46.77 35.78 24.30 QR 72.02 61.57 49.85 40.37 31.31
Table 3: Different-way 1-shot. (ACC%)

4 Conclusions

We propose to apply Few-Shot Learning to address the classification for rare skin diseases. We find that existing FSL methods do not perform significantly better than the baseline methods. Through careful analysis, we believe the problem should be largely attributed to the incompatibility between the episode training mechanism and cross entropy loss. Therefore, we propose a novel QR loss for FSL to make fully use of the information across samples and also allow the model to learn to extract information of the support samples guided by the training objective. With the proposed QR loss, the state-of-the-art FSL methods perform consistently better than methods training with the conventional CE loss. Our work demonstrates the promise of diagnosing rare skin diseases with one or a few labeled samples. In the future, we will investigate extensions to other medical classification problems or even natural image classification.


  • [1] W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang (2019) A closer look at few-shot classification. In International Conference on Learning Representations, Cited by: 3rd item, §1, §2.1, §3.2, §3.2.
  • [2] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), pp. 115. Cited by: §1.
  • [3] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §3.2.
  • [4] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.2.
  • [5] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7260–7268. Cited by: §1.
  • [6] W. Li, J. Xu, J. Huo, L. Wang, Y. Gao, and J. Luo (2019) Distribution consistency based covariance metric networks for few-shot learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8642–8649. Cited by: §1.
  • [7] H. Liao, Y. Li, and J. Luo (2016) Skin disease classification versus skin lesion characterization: achieving robust diagnosis using multi-label deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 355–360. Cited by: §1.
  • [8] H. Liao and J. Luo (2018) A deep multi-task learning approach to skin lesion classification. arXiv preprint arXiv:1812.03527. Cited by: §1.
  • [9] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §1.
  • [10] D. A. Okuboyejo, O. O. Olugbara, and S. A. Odunaike (2013) Automating skin disease diagnosis using image classification. In proceedings of the world congress on engineering and computer science, Vol. 2, pp. 850–854. Cited by: §1.
  • [11] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6450–6458. Cited by: §1.
  • [12] X. Shi, Q. Dou, C. Xue, J. Qin, H. Chen, and P. Heng An active learning approach for reducing annotation cost in skin lesion analysis. H. Suk, M. Liu, P. Yan, and C. Lian (Eds.), Cited by: §1.
  • [13] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §3.2.
  • [14] R. Sumithra, M. Suhil, and D. Guru (2015) Segmentation and classification of skin lesions for disease diagnosis. Procedia Computer Science 45, pp. 76–85. Cited by: §1.
  • [15] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H.S. Torr, and T. M. Hospedales (2018-06) Learning to compare: relation network for few-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.2, §3.2.
  • [16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
  • [17] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P. Manzagol, et al. (2019) Meta-dataset: a dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096. Cited by: §2.1, §3.2.
  • [18] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §3.2.
  • [19] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2.3.
  • [20] H. Ye, H. Chen, D. Zhan, and W. Chao (2020) Identifying and compensating for feature deviation in imbalanced deep learning. arXiv preprint arXiv:2001.01385. Cited by: §2.1.