1 Introduction
Computeraided diagnosis (CAD) has important applications in medical image analysis, such as disease diagnosis and grading [doi2005CAD, doi2007CAD]
. Benefiting from the progress of deep learning techniques, automated CAD methods have advanced remarkably in recent years, and are now dominated by learningbased classification methods using deep neural networks
[ker2017survey, litjens2017survey, shen2017survey]. Notably, these methods mostly use “hard” labels as their targets for learning, where the probability for the correct category is 1 and those for others are 0. However, these hard targets may adversely affect model generalization and adaption as the networks become overconfident about their predictions when trained to produce extreme values of 0 or 1, and prone to overfit the training data [szegedy2016rethinking].Studies showed that label smoothing regularization (LSR) can improve classification performance to a limited extent [muller2019whenDoesLSHelp, szegedy2016rethinking], although the smooth labels obtained by uniformly/proportionally distributing probabilities do not represent genuine interclass relations in most circumstances. Gao et al. [gao2017DLDL] proposed to convert the image label to a discrete label distribution and use deep convolutional networks to learn from ground truth label distributions. However, the ground truth distributions were constructed based on empirically defined rules, instead of datadriven. Chen et al. [chen2019GCN] modeled the interlabel dependencies as a correlation matrix for graph convolutional network based multilabel classification, by mining pairwise cooccurrence patterns of labels. Although datadriven, this approach is not applicable to singlelabel scenarios in which labels do not occur together, restricting its application. Arguably, softened labels reflecting real data characteristics improve upon the hard labels by more effectively utilizing available training data, as the samples from one specific category can help with training of similar categories [chen2019GCN, gao2017DLDL]. For medical images, there exist rich interclass visual correlations, which have a great potential yet largely remain unexploited for this purpose.
In this paper, we present the ClassCorrelation Learning Network (CCLNet) to learn interclass visual correlations from given training data, and utilize the learned correlations to produce soft labels to help with the classification tasks (Fig. 1). Instead of directly letting the network learn the desired correlations, we propose implicit learning of the correlations via distance metric learning [sohn2016DML, zhe2019DML] of classspecific embeddings [mikolov2013word2vec] with a lightweight plugin CCL block. A new loss is designed for bolstering learning of the interclass correlations. We further present endtoend training of the proposed CCL block together with the classification backbone, while enhancing classification ability of the latter with soft labels generated on the fly. In summary, our contributions are three folds:

We propose a CCL block for datadriven interclass correlation learning and label softening, based on distance metric learning of class embeddings. This block is conceptually simple, lightweight, and can be readily plugged as a CCL head into any mainstream backbone network for classification.

We design an intuitive new loss based on a geometrical explanation of correlation to help with the CCL, and present integrated endtoend training of the plugged CCL head together with the backbone network.

We conduct thorough experiments on the International Skin Imaging Collaboration (ISIC) 2018 dataset. Results demonstrate effective datadriven CCL, and consistent performance improvements upon widely used modern network structures utilizing the learned soft label distributions.
2 Method
2.0.1 Preliminaries
Before presenting our proposed CCL block, let us first review the generic deep learning pipeline for singlelabel multiclass classification problems as preliminaries. As shown in Fig. 1(b), an input
first goes through a feature extracting function
parameterized as a network with parameters, producing a feature vector
for classification: , where . Next, is fed to a generalized fullyconnected (fc) layer (parameterized with) followed by the softmax function, obtaining the predicted classification probability distribution
for : , where , is the number of classes, and . To supervise the training of and (and the learning of and correspondingly), for each training sample , a label is given. This label is then converted to the onehot distribution [szegedy2016rethinking]: , where is the Dirac delta function, which equals 1 if and 0 otherwise. After that, the cross entropy loss can be used to compute the loss between and : . Lastly, the loss is optimized by an optimizer (e.g., Adam [kingma2014adam]), and and are updated by gradient descent algorithms.As mentioned earlier, the “hard” labels may adversely affect model generalization as the networks become overconfident about their predictions. In this sense, LSR [szegedy2016rethinking] is a popular technique to make the network less confident by smoothing the onehot label distribution to become , where , is a weight, and is a distribution over class labels for which the uniform [muller2019whenDoesLSHelp] or a priori [muller2019whenDoesLSHelp, szegedy2016rethinking] distributions are proposed. Then, the cross entropy is computed with instead of .
2.0.2 Learning Interclass Visual Correlations for Label Softening
In most circumstances, LSR cannot reflect the genuine interclass relations underlying the given training data. Intuitively, the probability redistribution should be biased towards visually similar classes, so that samples from these classes can boost training of each other. For this purpose, we propose to learn the underlying visual correlations among classes from the training data and produce soft label distributions that more authentically reflect intrinsic data properties. Other than learning the desired correlations directly, we learn them implicitly by learning interrelated yet discriminative class embeddings via distance metric learning. Both the concepts of feature embeddings and deep metric learning have proven useful in the literature (e.g., [mikolov2013word2vec, sohn2016DML, zhe2019DML]). To the best of our knowledge, however, combining them for datadriven learning of interclass visual correlations and label softening has not been done before.
The structure of the CCL block is shown in Fig. 1(c), which consists of a lightweight embedding function (parameterized with ), a dictionary , a distance metric function
, and two loss functions. Given the feature vector
extracted by , the embedding function projects into the embedding space: , where . The dictionary maintains all the classspecific embeddings: . Using , the distance between the input embedding and every class embedding can be calculated by . In this work, we use , where is the L2 norm. Let , we can predict another classification probability distribution based on the distance metric: , and a cross entropy loss can be computed accordingly. To enforce interrelations among the class embeddings, we innovatively propose the class correlation loss :(1) 
where
is the Rectified Linear Unit (ReLU), and
is a margin parameter. Intuitively, enforces the class embeddings to be no further than a distance from each other in the embedding space, encouraging correlations among them. Other than attempting to tune the exact value of, we resort to the geometrical meaning of correlation between two vectors: if the angle between two vectors is smaller than 90
, they are considered correlated. Or equivalently for L2normed vectors, if the Euclidean distance between them is smaller than , they are considered correlated. Hence, we set . Then, the total loss function for the CCL block is defined as , where is a weight. Thus, and are updated by optimizing .After training, we define the soft label distribution for class as:
(2) 
It is worth noting that by this definition of soft label distributions, the correct class always has the largest probability, which is a desired property especially at the start of training. Next, we describe our endtoend training scheme of the CCL block together with a backbone classification network.
2.0.3 Integrated EndtoEnd Training with Classification Backbone
As a lightweight module, the proposed CCL block can be plugged into any mainstream classification backbone network—as long as a feature vector can be pooled for an input image—and trained together in an endtoend manner. To utilize the learned soft label distributions, we introduce a KullbackLeibler divergence (KL div.) loss
in the backbone network (Fig 1), and the total loss function for classification becomes(3) 
Consider that tries to keep the class embeddings within a certain distance of each other, it is somehow adversarial to the goal of the backbone network which tries to push them away from each other as much as possible. In such cases, alternative training schemes are usually employed [goodfellow2014generative] and we follow this way. Briefly, in each training iteration, the backbone network is firstly updated with the CCL head frozen, and then it is frozen to update the CCL head; more details about the training scheme are provided in Algorithm 1. After training, the prediction is made according to : .
During training, we notice that the learned soft label distributions sometimes start to linger around a fixed value for the correct classes and distribute about evenly across other classes after certain epochs; or in other words, approximately collapse into LSR with uniform distribution (the proof is provided in the supplementary material). This is because of the strong capability of deep neural networks in fitting any data [zhang2016understanding]. As forces the class embeddings to be no further than from each other in the embedding space, a network of sufficient capacity has the potential to make them exactly the distance away from each other (which is overfitting) when well trained. To prevent this from happening, we use the average correctclass probability as a measure of the total softness of the label set (the lower the softer, as the correct classes distribute more probabilities to other classes), and consider that the CCL head has converged if does not drop for 10 consecutive epochs. In such case, is frozen, whereas the rest of the CCLNet keeps updating.
3 Experiments
3.0.1 Dataset and Evaluation Metrics
The ISIC 2018 dataset is provided by the Skin Lesion Analysis Toward Melanoma Detection 2018 challenge [codella2019ISIC], for prediction of seven disease categories with dermoscopic images, including: melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis/Bowen’s disease (AKIEC), benign keratosis (BKL), dermatofibroma (DF), and vascular lesion (VASC) (example images are provided in Fig. 2
). It comprises 10,015 dermoscopic images, including 1,113 MEL, 6,705 NV, 514 BCC, 327 AKIEC, 1,099 BKL, 115 DF, and 142 VASC images. We randomly split the data into a training and a validation set (80:20 split) while keeping the original interclass ratios, and report evaluation results on the validation set. The employed evaluation metrics include accuracy, Cohen’s kappa coefficient
[cohen1960coefficient], and unweighed means of F1 score and Jaccard similarity coefficient.3.0.2 Implementation
In this work, we use commonly adopted, straightforward training schemes to demonstrate effectiveness of the proposed CCLNet in datadriven learning of interclass visual correlations and improving classification performance, rather than sophisticated training strategies or heavy ensembles. Specifically, we adopt the stochastic gradient descent optimizer with a momentum of 0.9, weight decay of
, and the backbone learning rate initialized to 0.1 for all experiments. The learning rate for the CCL head (denoted by ) is initialized to 0.0005 for all experiments, except for when we study the impact of varying . Following [he2016resnet], we multiply the initial learning rates by 0.1 twice during training such that the learning process can saturate at a higher limit.^{1}^{1}1The exact learningratechanging epochs as well as total number of training epochs vary for different backbones due to different network capacities. A minibatch of 128 images is used. The input images are resized to have a short side of 256 pixels while maintaining original aspect ratios. Online data augmentations including random cropping, horizontal and vertical flipping, and color jittering are employed for training. For testing, a single central crop of size 224224 pixels is used as input. Gradient clipping is employed for stable training.
is set to 10 empirically. The experiments are implemented using the PyTorch package. A singe Tesla P40 GPU is used for model training and testing. For
, we use three fc layers of width 1024, 1024, and 512 (i.e.,), with batch normalization and ReLU in between.
Backbone  Method  Accuracy  F1 score  Kappa  Jaccard 

ResNet18 [he2016resnet]  Baseline  0.8347  0.7073  0.6775  0.5655 
LSRu1  0.8422  0.7220  0.6886  0.5839  
LSRu5  0.8437  0.7211  0.6908  0.5837  
LSRa1  0.7883  0.6046  0.5016  0.4547  
LSRa5  0.7029  0.2020  0.1530  0.1470  
CCLNet (ours)  0.8502  0.7227  0.6986  0.5842  
EfficientNetB0 [tan2019efficientnet]  Baseline  0.8333  0.7190  0.6696  0.5728 
LSRu1  0.8382  0.7014  0.6736  0.5573  
LSRu5  0.8432  0.7262  0.6884  0.5852  
LSRa1  0.8038  0.6542  0.5526  0.5058  
LSRa5  0.7189  0.2968  0.2295  0.2031  
CCLNet (ours)  0.8482  0.7390  0.6969  0.6006  
MobileNetV2 [sandler2018mobilenetv2]  Baseline  0.8308  0.6922  0.6637  0.5524 
LSRu1  0.8248  0.6791  0.6547  0.5400  
LSRu5  0.8253  0.6604  0.6539  0.5281  
LSRa1  0.8068  0.6432  0.5631  0.4922  
LSRa5  0.7114  0.2306  0.2037  0.1655  
CCLNet (ours)  0.8342  0.7050  0.6718  0.5648 

LSR settings: u1: uniform, ; u5: uniform, ; a1: a priori, ; a5: a priori, .
3.0.3 Comparison with Baselines and LSR
We quantitatively compare our proposed CCLNet with various baseline networks using the same backbones. Specifically, we experiment with three widely used backbone networks: ResNet18 [he2016resnet], MobileNetV2 [sandler2018mobilenetv2], and EfficientNetB0 [tan2019efficientnet]. In addition, we compare our CCLNet with the popular LSR [szegedy2016rethinking] with different combinations of and : and , resulting in a total of four settings (we compare with since for the specific problem, the learned soft label distributions would eventually approximate uniform LSR with this value, if the class embeddings are not frozen after convergence). The results are charted in Table 1. As we can see, the proposed CCLNet achieves the best performances on all evaluation metrics for all backbone networks, including Cohen’s kappa [cohen1960coefficient] which is more appropriate for imbalanced data than accuracy. These results demonstrate effectiveness of utilizing the learned soft label distributions in improving classification performance of the backbone networks. We also note that moderate improvements are achieved by the LSR settings with uniform on two of the three backbone networks, indicating effects of this simple strategy. Nonetheless, these improvements are outweighed by those achieved by our CCLNet. In addition to the superior performances to LSR, another advantage of the CCLNet is that it can intuitively reflect intrinsic interclass correlations underlying the given training data, at a minimal extra overhead. Lastly, it is worth noting that the LSR settings with a priori decrease all evaluation metrics from the baseline performances, suggesting inappropriateness of using LSR with a priori distributions for significantly imbalanced data.
0.0001  0.0005  0.001  0.005  0.01  0.05  

Epochs of converge  Never  137  76  23  12  5 
0.48  0.41  0.38  0.35  0.33  0.30  
Accuracy  0.8422  0.8502  0.8452  0.8442  0.8422  0.8417 
Kappa  0.6863  0.6986  0.6972  0.6902  0.6856  0.6884 
3.0.4 Analysis of Interclass Correlations Learned with CCLNet
Next, we investigate properties of the learned interclass visual correlations by the proposed CCLNet, by varying the value of . Specifically, we examine the epochs and label softness when the CCL head converges, as well as the final accuracies and kappa coefficients. Besides , the overall softness of the set of soft label distributions can also be intuitively perceived by visualizing all together as a correlation matrix. Note that this matrix does not have to be symmetric, since the softmax operation is separately conducted for each class. Table 2 presents the results, and Fig. 3 shows three correlation matrices using different . Interestingly, we can observe that as increases, the CCL head converges faster with higher softness. The same trend can be observed in Fig. 3, where the class probabilities become more spread with the increase of . Notably, when , the CCL head does not converge in given epochs and the resulting label distributions are not as soft. This indicates that when is too small, the CCL head cannot effectively learn the interclass correlations. Meanwhile, the best performance is achieved when in terms of both accuracy and kappa, instead of other higher values. This may suggest that very quick convergence may also be suboptimal, probably because the prematurely frozen class embeddings are learned from the less representative feature vectors in the early stage of training. In summary, is a crucial parameter for the CCLNet, though it is not difficult to tune based on our experience.
4 Conclusion
In this work, we presented CCLNet for datadriven interclass visual correlation learning and label softening. Rather than directly learning the desired correlations, CCLNet implicitly learns them via distancebased metric learning of classspecific embeddings, and constructs soft label distributions from learned correlations by performing softmax on pairwise distances between class embeddings. Experimental results showed that the learned soft label distributions not only reflected intrinsic interrelations underlying given training data, but also boosted classification performance upon various baseline networks. In addition, CCLNet outperformed the popular LSR technique. We plan to better utilize the learned soft labels and extend the work for multilabel problems in the future.
4.0.1 Acknowledgments.
This work was funded by the Key Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), National Key Research and Development Project (2018YFC2000702), and Science and Technology Program of Shenzhen, China (No. ZDSYS201802021814180).