Learning and Exploiting Interclass Visual Correlations for Medical Image Classification

by   Dong Wei, et al.

Deep neural network-based medical image classifications often use "hard" labels for training, where the probability of the correct category is 1 and those of others are 0. However, these hard targets can drive the networks over-confident about their predictions and prone to overfit the training data, affecting model generalization and adaption. Studies have shown that label smoothing and softening can improve classification performance. Nevertheless, existing approaches are either non-data-driven or limited in applicability. In this paper, we present the Class-Correlation Learning Network (CCL-Net) to learn interclass visual correlations from given training data, and produce soft labels to help with classification tasks. Instead of letting the network directly learn the desired correlations, we propose to learn them implicitly via distance metric learning of class-specific embeddings with a lightweight plugin CCL block. An intuitive loss based on a geometrical explanation of correlation is designed for bolstering learning of the interclass correlations. We further present end-to-end training of the proposed CCL block as a plugin head together with the classification backbone while generating soft labels on the fly. Our experimental results on the International Skin Imaging Collaboration 2018 dataset demonstrate effective learning of the interclass correlations from training data, as well as consistent improvements in performance upon several widely used modern network structures with the CCL block.


page 1

page 2

page 3

page 4


Learning Discriminative Representation via Metric Learning for Imbalanced Medical Image Classification

Data imbalance between common and rare diseases during model training of...

Collaboration based Multi-Label Learning

It is well-known that exploiting label correlations is crucially importa...

Reconstruction Regularized Deep Metric Learning for Multi-label Image Classification

In this paper, we present a novel deep metric learning method to tackle ...

A novel adversarial learning strategy for medical image classification

Deep learning (DL) techniques have been extensively utilized for medical...

Leveraging Human Selective Attention for Medical Image Analysis with Limited Training Data

The human gaze is a cost-efficient physiological data that reveals human...

GraSSNet: Graph Soft Sensing Neural Networks

In the era of big data, data-driven based classification has become an e...

1 Introduction

Computer-aided diagnosis (CAD) has important applications in medical image analysis, such as disease diagnosis and grading [doi2005CAD, doi2007CAD]

. Benefiting from the progress of deep learning techniques, automated CAD methods have advanced remarkably in recent years, and are now dominated by learning-based classification methods using deep neural networks 

[ker2017survey, litjens2017survey, shen2017survey]. Notably, these methods mostly use “hard” labels as their targets for learning, where the probability for the correct category is 1 and those for others are 0. However, these hard targets may adversely affect model generalization and adaption as the networks become over-confident about their predictions when trained to produce extreme values of 0 or 1, and prone to overfit the training data [szegedy2016rethinking].

Studies showed that label smoothing regularization (LSR) can improve classification performance to a limited extent [muller2019whenDoesLSHelp, szegedy2016rethinking], although the smooth labels obtained by uniformly/proportionally distributing probabilities do not represent genuine interclass relations in most circumstances. Gao et al. [gao2017DLDL] proposed to convert the image label to a discrete label distribution and use deep convolutional networks to learn from ground truth label distributions. However, the ground truth distributions were constructed based on empirically defined rules, instead of data-driven. Chen et al. [chen2019GCN] modeled the interlabel dependencies as a correlation matrix for graph convolutional network based multilabel classification, by mining pairwise cooccurrence patterns of labels. Although data-driven, this approach is not applicable to single-label scenarios in which labels do not occur together, restricting its application. Arguably, softened labels reflecting real data characteristics improve upon the hard labels by more effectively utilizing available training data, as the samples from one specific category can help with training of similar categories [chen2019GCN, gao2017DLDL]. For medical images, there exist rich interclass visual correlations, which have a great potential yet largely remain unexploited for this purpose.

In this paper, we present the Class-Correlation Learning Network (CCL-Net) to learn interclass visual correlations from given training data, and utilize the learned correlations to produce soft labels to help with the classification tasks (Fig. 1). Instead of directly letting the network learn the desired correlations, we propose implicit learning of the correlations via distance metric learning [sohn2016DML, zhe2019DML] of class-specific embeddings [mikolov2013word2vec] with a lightweight plugin CCL block. A new loss is designed for bolstering learning of the interclass correlations. We further present end-to-end training of the proposed CCL block together with the classification backbone, while enhancing classification ability of the latter with soft labels generated on the fly. In summary, our contributions are three folds:

  • We propose a CCL block for data-driven interclass correlation learning and label softening, based on distance metric learning of class embeddings. This block is conceptually simple, lightweight, and can be readily plugged as a CCL head into any mainstream backbone network for classification.

  • We design an intuitive new loss based on a geometrical explanation of correlation to help with the CCL, and present integrated end-to-end training of the plugged CCL head together with the backbone network.

  • We conduct thorough experiments on the International Skin Imaging Collaboration (ISIC) 2018 dataset. Results demonstrate effective data-driven CCL, and consistent performance improvements upon widely used modern network structures utilizing the learned soft label distributions.

2 Method

2.0.1 Preliminaries

Before presenting our proposed CCL block, let us first review the generic deep learning pipeline for single-label multiclass classification problems as preliminaries. As shown in Fig. 1(b), an input

first goes through a feature extracting function

parameterized as a network with parameters

, producing a feature vector

for classification: , where . Next, is fed to a generalized fully-connected (fc) layer (parameterized with

) followed by the softmax function, obtaining the predicted classification probability distribution

for : , where , is the number of classes, and . To supervise the training of and (and the learning of and correspondingly), for each training sample , a label is given. This label is then converted to the one-hot distribution [szegedy2016rethinking]: , where is the Dirac delta function, which equals 1 if and 0 otherwise. After that, the cross entropy loss can be used to compute the loss between and : . Lastly, the loss is optimized by an optimizer (e.g., Adam [kingma2014adam]), and and are updated by gradient descent algorithms.

Figure 1: Diagram of the proposed CCL-Net. (a) Soft label distributions () are learned from given training data and hard labels . (b) Generic deep learning pipeline for classification. (c) Structure of the proposed CCL block. This lightweight block can be plugged into any classification backbone network as a head and trained end-to-end together, and boost performance by providing additional supervision signals with .

As mentioned earlier, the “hard” labels may adversely affect model generalization as the networks become over-confident about their predictions. In this sense, LSR [szegedy2016rethinking] is a popular technique to make the network less confident by smoothing the one-hot label distribution to become , where , is a weight, and is a distribution over class labels for which the uniform [muller2019whenDoesLSHelp] or a priori [muller2019whenDoesLSHelp, szegedy2016rethinking] distributions are proposed. Then, the cross entropy is computed with instead of .

2.0.2 Learning Interclass Visual Correlations for Label Softening

In most circumstances, LSR cannot reflect the genuine interclass relations underlying the given training data. Intuitively, the probability redistribution should be biased towards visually similar classes, so that samples from these classes can boost training of each other. For this purpose, we propose to learn the underlying visual correlations among classes from the training data and produce soft label distributions that more authentically reflect intrinsic data properties. Other than learning the desired correlations directly, we learn them implicitly by learning interrelated yet discriminative class embeddings via distance metric learning. Both the concepts of feature embeddings and deep metric learning have proven useful in the literature (e.g., [mikolov2013word2vec, sohn2016DML, zhe2019DML]). To the best of our knowledge, however, combining them for data-driven learning of interclass visual correlations and label softening has not been done before.

The structure of the CCL block is shown in Fig. 1(c), which consists of a lightweight embedding function (parameterized with ), a dictionary , a distance metric function

, and two loss functions. Given the feature vector

extracted by , the embedding function projects into the embedding space: , where . The dictionary maintains all the class-specific embeddings: . Using , the distance between the input embedding and every class embedding can be calculated by . In this work, we use , where is the L2 norm. Let , we can predict another classification probability distribution based on the distance metric: , and a cross entropy loss can be computed accordingly. To enforce interrelations among the class embeddings, we innovatively propose the class correlation loss :



is the Rectified Linear Unit (ReLU), and

is a margin parameter. Intuitively, enforces the class embeddings to be no further than a distance from each other in the embedding space, encouraging correlations among them. Other than attempting to tune the exact value of

, we resort to the geometrical meaning of correlation between two vectors: if the angle between two vectors is smaller than 90

, they are considered correlated. Or equivalently for L2-normed vectors, if the Euclidean distance between them is smaller than , they are considered correlated. Hence, we set . Then, the total loss function for the CCL block is defined as , where is a weight. Thus, and are updated by optimizing .

After training, we define the soft label distribution for class as:


It is worth noting that by this definition of soft label distributions, the correct class always has the largest probability, which is a desired property especially at the start of training. Next, we describe our end-to-end training scheme of the CCL block together with a backbone classification network.

2.0.3 Integrated End-to-End Training with Classification Backbone

As a lightweight module, the proposed CCL block can be plugged into any mainstream classification backbone network—as long as a feature vector can be pooled for an input image—and trained together in an end-to-end manner. To utilize the learned soft label distributions, we introduce a Kullback-Leibler divergence (KL div.) loss

in the backbone network (Fig 1), and the total loss function for classification becomes


Consider that tries to keep the class embeddings within a certain distance of each other, it is somehow adversarial to the goal of the backbone network which tries to push them away from each other as much as possible. In such cases, alternative training schemes are usually employed [goodfellow2014generative] and we follow this way. Briefly, in each training iteration, the backbone network is firstly updated with the CCL head frozen, and then it is frozen to update the CCL head; more details about the training scheme are provided in Algorithm 1. After training, the prediction is made according to : .

0:  Training images and labels
0:  Learned network parameters
1:  Initialize
2:  for

 number of training epochs 

3:     for number of minibatches do
4:        Compute soft label distributions from
5:        Sample minibatch of images , compute
6:        Update and by stochastic gradient descending:
7:        Update and by stochastic gradient descending:
8:     end for
9:  end for
Algorithm 1 End-to-end training of the proposed CCL-Net.

During training, we notice that the learned soft label distributions sometimes start to linger around a fixed value for the correct classes and distribute about evenly across other classes after certain epochs; or in other words, approximately collapse into LSR with uniform distribution (the proof is provided in the supplementary material). This is because of the strong capability of deep neural networks in fitting any data [zhang2016understanding]. As forces the class embeddings to be no further than from each other in the embedding space, a network of sufficient capacity has the potential to make them exactly the distance away from each other (which is overfitting) when well trained. To prevent this from happening, we use the average correct-class probability as a measure of the total softness of the label set (the lower the softer, as the correct classes distribute more probabilities to other classes), and consider that the CCL head has converged if does not drop for 10 consecutive epochs. In such case, is frozen, whereas the rest of the CCL-Net keeps updating.

3 Experiments

3.0.1 Dataset and Evaluation Metrics

The ISIC 2018 dataset is provided by the Skin Lesion Analysis Toward Melanoma Detection 2018 challenge [codella2019ISIC], for prediction of seven disease categories with dermoscopic images, including: melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis/Bowen’s disease (AKIEC), benign keratosis (BKL), dermatofibroma (DF), and vascular lesion (VASC) (example images are provided in Fig. 2

). It comprises 10,015 dermoscopic images, including 1,113 MEL, 6,705 NV, 514 BCC, 327 AKIEC, 1,099 BKL, 115 DF, and 142 VASC images. We randomly split the data into a training and a validation set (80:20 split) while keeping the original interclass ratios, and report evaluation results on the validation set. The employed evaluation metrics include accuracy, Cohen’s kappa coefficient 

[cohen1960coefficient], and unweighed means of F1 score and Jaccard similarity coefficient.

Figure 2: Official example images of the seven diseases in the ISIC 2018 disease classification dataset [codella2019ISIC].

3.0.2 Implementation

In this work, we use commonly adopted, straightforward training schemes to demonstrate effectiveness of the proposed CCL-Net in data-driven learning of interclass visual correlations and improving classification performance, rather than sophisticated training strategies or heavy ensembles. Specifically, we adopt the stochastic gradient descent optimizer with a momentum of 0.9, weight decay of

, and the backbone learning rate initialized to 0.1 for all experiments. The learning rate for the CCL head (denoted by ) is initialized to 0.0005 for all experiments, except for when we study the impact of varying . Following [he2016resnet], we multiply the initial learning rates by 0.1 twice during training such that the learning process can saturate at a higher limit.111The exact learning-rate-changing epochs as well as total number of training epochs vary for different backbones due to different network capacities. A minibatch of 128 images is used. The input images are resized to have a short side of 256 pixels while maintaining original aspect ratios. Online data augmentations including random cropping, horizontal and vertical flipping, and color jittering are employed for training. For testing, a single central crop of size 224

224 pixels is used as input. Gradient clipping is employed for stable training.

is set to 10 empirically. The experiments are implemented using the PyTorch package. A singe Tesla P40 GPU is used for model training and testing. For

, we use three fc layers of width 1024, 1024, and 512 (i.e.,

), with batch normalization and ReLU in between.

Backbone Method Accuracy F1 score Kappa Jaccard
ResNet-18 [he2016resnet] Baseline 0.8347 0.7073 0.6775 0.5655
LSR-u1 0.8422 0.7220 0.6886 0.5839
LSR-u5 0.8437 0.7211 0.6908 0.5837
LSR-a1 0.7883 0.6046 0.5016 0.4547
LSR-a5 0.7029 0.2020 0.1530 0.1470
CCL-Net (ours) 0.8502 0.7227 0.6986 0.5842
EfficientNet-B0 [tan2019efficientnet] Baseline 0.8333 0.7190 0.6696 0.5728
LSR-u1 0.8382 0.7014 0.6736 0.5573
LSR-u5 0.8432 0.7262 0.6884 0.5852
LSR-a1 0.8038 0.6542 0.5526 0.5058
LSR-a5 0.7189 0.2968 0.2295 0.2031
CCL-Net (ours) 0.8482 0.7390 0.6969 0.6006
MobileNetV2 [sandler2018mobilenetv2] Baseline 0.8308 0.6922 0.6637 0.5524
LSR-u1 0.8248 0.6791 0.6547 0.5400
LSR-u5 0.8253 0.6604 0.6539 0.5281
LSR-a1 0.8068 0.6432 0.5631 0.4922
LSR-a5 0.7114 0.2306 0.2037 0.1655
CCL-Net (ours) 0.8342 0.7050 0.6718 0.5648
  • LSR settings: -u1: uniform, ; -u5: uniform, ; -a1: a priori, ; -a5: a priori, .

Table 1: Experimental results on the ISIC18 dataset, including comparisons with the baseline networks and LSR [szegedy2016rethinking]. Higher is better for all evaluation metrics.

3.0.3 Comparison with Baselines and LSR

We quantitatively compare our proposed CCL-Net with various baseline networks using the same backbones. Specifically, we experiment with three widely used backbone networks: ResNet-18 [he2016resnet], MobileNetV2 [sandler2018mobilenetv2], and EfficientNet-B0 [tan2019efficientnet]. In addition, we compare our CCL-Net with the popular LSR [szegedy2016rethinking] with different combinations of and : and , resulting in a total of four settings (we compare with since for the specific problem, the learned soft label distributions would eventually approximate uniform LSR with this value, if the class embeddings are not frozen after convergence). The results are charted in Table 1. As we can see, the proposed CCL-Net achieves the best performances on all evaluation metrics for all backbone networks, including Cohen’s kappa [cohen1960coefficient] which is more appropriate for imbalanced data than accuracy. These results demonstrate effectiveness of utilizing the learned soft label distributions in improving classification performance of the backbone networks. We also note that moderate improvements are achieved by the LSR settings with uniform on two of the three backbone networks, indicating effects of this simple strategy. Nonetheless, these improvements are outweighed by those achieved by our CCL-Net. In addition to the superior performances to LSR, another advantage of the CCL-Net is that it can intuitively reflect intrinsic interclass correlations underlying the given training data, at a minimal extra overhead. Lastly, it is worth noting that the LSR settings with a priori decrease all evaluation metrics from the baseline performances, suggesting inappropriateness of using LSR with a priori distributions for significantly imbalanced data.

0.0001 0.0005 0.001 0.005 0.01 0.05
Epochs of converge Never 137 76 23 12 5
0.48 0.41 0.38 0.35 0.33 0.30
Accuracy 0.8422 0.8502 0.8452 0.8442 0.8422 0.8417
Kappa 0.6863 0.6986 0.6972 0.6902 0.6856 0.6884
Table 2: Properties of the CCL by varying the learning rate of the CCL head. ResNet-18 [he2016resnet] backbone is used.
Figure 3: Visualization of the learned soft label distributions using different , where each row of a matrix represents the soft label distribution of a class.

3.0.4 Analysis of Interclass Correlations Learned with CCL-Net

Next, we investigate properties of the learned interclass visual correlations by the proposed CCL-Net, by varying the value of . Specifically, we examine the epochs and label softness when the CCL head converges, as well as the final accuracies and kappa coefficients. Besides , the overall softness of the set of soft label distributions can also be intuitively perceived by visualizing all together as a correlation matrix. Note that this matrix does not have to be symmetric, since the softmax operation is separately conducted for each class. Table 2 presents the results, and Fig. 3 shows three correlation matrices using different . Interestingly, we can observe that as increases, the CCL head converges faster with higher softness. The same trend can be observed in Fig. 3, where the class probabilities become more spread with the increase of . Notably, when , the CCL head does not converge in given epochs and the resulting label distributions are not as soft. This indicates that when is too small, the CCL head cannot effectively learn the interclass correlations. Meanwhile, the best performance is achieved when in terms of both accuracy and kappa, instead of other higher values. This may suggest that very quick convergence may also be suboptimal, probably because the prematurely frozen class embeddings are learned from the less representative feature vectors in the early stage of training. In summary, is a crucial parameter for the CCL-Net, though it is not difficult to tune based on our experience.

4 Conclusion

In this work, we presented CCL-Net for data-driven interclass visual correlation learning and label softening. Rather than directly learning the desired correlations, CCL-Net implicitly learns them via distance-based metric learning of class-specific embeddings, and constructs soft label distributions from learned correlations by performing softmax on pairwise distances between class embeddings. Experimental results showed that the learned soft label distributions not only reflected intrinsic interrelations underlying given training data, but also boosted classification performance upon various baseline networks. In addition, CCL-Net outperformed the popular LSR technique. We plan to better utilize the learned soft labels and extend the work for multilabel problems in the future.

4.0.1 Acknowledgments.

This work was funded by the Key Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), National Key Research and Development Project (2018YFC2000702), and Science and Technology Program of Shenzhen, China (No. ZDSYS201802021814180).