Skin cancer, especially melanoma, is one of the most dangerous type of cancers and causes over 87,000 incidents (Rogers et al., 2015) with over 9,000 fatalities (Siegel et al., 2016) in the United States alone. While it is difficult to treat in the last stage, early diagnosed patients have high chances to be successfully cured. Therefore, there have been multiple efforts in detecting the disease in its early stages (Masood and Ali Al-Jumaily, 2013; Curiel-Lewandrowski et al., 2019). One of the most promising technology is Dermatoscope, which can generate high-resolution images of the skin lesion and allow dermatologists to examine the lesion regions carefully (Celebi et al., 2019). However, dermoscopy still requires extensive training, which is expensive, time-consuming, error-prone, and might not be widely available. Therefore, it is important to develop automatic systems to detect abnormal skin lesions and aid dermatologists during diagnosis.
For this purpose, the International Skin Imaging Collaboration (ISIC) hosted Challenges for automatic melanoma detection based on dermoscopic images (ISIC-2018, ISIC-2017) (Codella et al., 2019, 2018). Particularly, we focus on predicting the locations of dermoscopic attributes in an image since it not only detects the anomalous regions but also provides an explanation for dermatologists to verify and make further diagnosis. There are five dermoscopic attributes that the challenge focused on: streaks, globules, pigment network, negative network, and milia-like cysts. We provide an example of such attributes in Figure 1. Due to the competition’s highly competitive nature, most state-of-the-art methods in ISIC challenges are approaches based on ensemble tactics with different types of ImageNet pre-trained backbones. For instance, in ISIC 2018, the 1-st placed method (Koohbanani et al., 2018) employed a mixture of four pre-trained networks ResNet152 (He et al., 2016), DenseNet169 (Huang et al., 2017), Xception (Chollet, 2017), ResNetV2 (Szegedy et al., 2017) in the encoding part of the U-Net (Ronneberger et al., 2015a) segmentation models and then perform transfer learning on the ISIC 2018 training set. Although Koohbanani et al. (2018) achieved state-of-the-art performance, it is not an attractive method in practice because of two reasons. First, it requires a massive amount of memory for the pre-trained sub-models, which is not scalable in practice. Second, recent works in Raghu et al. (2019); Cheplygina (2019); Nguyen et al. (2020) showed that transfer learning from ImageNet might be suboptimal in many scenarios because medical images are primarily different from ImageNet data. Moreover, most medical datasets, including ISIC-2018 and ISIC-2017, suffer from the lack of training data, and the number of instances per attribute is imbalanced, as depicted in Table 1.
To address the challenges mentioned above, we propose Task Agnostic Transfer Learning (TATL), an efficient framework to detect skin attributes in dermoscopic images. TATL’s design is inspired by how dermatologists diagnose in practice: first, identify an abnormal region and then inspect it more closely. Unlike previous works that try to segment the attribute on the image directly, TATL introduces an Attribute-Agnostic Segmenter that first detects anomalous regions in an image, regardless of their attributes. Then, TATL transfers the segmenter’s knowledge to a set of attributes-specific segmenters (Target-Segmenters) to detect each specific attribute. Notably, the Attribute-Agnostic Segmenter is task-agnostic because it only identifies abnormal areas, including data from all attributes. Therefore, TATL alleviates the lack of training samples by training the segmenter in the first stage. Additionally, by transferring the segmenter’s knowledge to the classifiers, TATL allows knowledge sharing among attribute-segmenters, enhancing the generalization and stability of the system. Furthermore, we also provide theoretical insight showing that TATL works by bridging the gap between the target task’s data and the source dataset. The attribute-specific classifiers are particularly initialized from the TATL’s Union-Segmenter, which enjoys a tighter domain gap than other methods initializing from ImageNet. This analysis sheds light on the remarkable empirical performances of TATL.
In summary, our contributions are three-fold: (1) Firstly, we propose TATL, a novel strategy for solving skin attribute detection. Extensive experiments on the ISIC 2018 validate the effectiveness of TATL against state-of-the-art methods while just requiring only 1/30 number of parameters. Furthermore, TATL makes a significant improvement over several skin diagnosis baselines pre-trained on ImageNet, especially for attributes with only a few images. (2) Secondly, we provide theoretical insights explaining the success of TATL; thereby, TATL has a high ability to reduce domain gaps by shifting from color images (ImageNet) to the medical domain (Skin data). (3) Finally, TATL can provide rich and informative outputs to aid doctors in making a further diagnosis as it was designed to mimic how dermatologists examine patients. Moreover, the interaction between doctors and machines can be integrated during training to improve the treatment’s quality further.
2 Related Work
2.1 Transfer Learning for Medical Image Analysis
Medical image analysis is a vital research venue and has a significant impact on practice. However, most medical image datasets are often limited in training data and often suffer from imbalanced data. Therefore, a popular strategy is transfer learning, which uses a pre-trained ImageNet model as initialization to build additional components. Transfer learning is a base of many existing methods (Abràmoff et al., 2016; De Fauw et al., 2018; Gulshan et al., 2016; Rajpurkar et al., 2017), and is a norm for practitioners. However, recent studies in Cheplygina (2019); He et al. (2020) conducted a large-scale analysis on the benefit of this strategy and concluded that transfer learning is not consistently better than random initialization. One reason is that medical images are vastly different from the ImageNet dataset, resulting in the pre-trained features are not helpful for the current task. Another reason is that medical data are often imbalanced and rare due to data privacy. For example, in our work’s chosen skin detection datasets, Table 1 shows the distributions of each skin attribute in the ISIC 2017 and ICSIC 2018 dataset. Remarkably, the rarest class (streaks) only made up 7.98% (113 images) and 3.86% (100 images), respectively, of the training data. In comparison, the most common class (pigment network) has 79.03% (1119 images) and 58.67% (1522 images) of the total samples. Our TATL addresses the lack of training data problems by transferring the knowledge from the Union-Segmenter. Moreover, we apply strategy one class one model (Buda et al., 2018), thus each classifier Target-Segmenter in TATL only detects one attribute, which alleviates the data’s imbalance.
2.2 Self-Supervised Learning In Computer Vision
Self-supervised learning, first mentioned inSchmidhuber (1990)
, refers to a technique of creating additional tasks for training where the label is also a part of the data (images) rather than a set of separate labels (annotations). This strategy has been a successful pre-training technique in various vision applications, including image colorization(Vondrick et al., 2018; Larsson et al., 2016; Zhang et al., 2016), image im-painting (Pathak et al., 2016; Chen et al., 2020), and video representation (Misra et al., 2016; Lee et al., 2017). In self-supervised learning, a newly created task for pre-training is called ”pretext task”, and the main tasks used for fine-tuning are called ”down-stream tasks”. Various strategies have been proposed to construct auxiliary tasks based on temporal correspondence (Li et al., 2019; Wang et al., 2019a), cross-modal consistency (Wang et al., 2019b) and instance discrimination with contrastive learning (Wu et al., 2018). Recently, in the medical domain, He et al. (2020) successfully applied self-supervised learning in diagnosing COVID19 from CT scans based on contrastive self-supervised learning for reducing the risk of overfitting.
In our TATL, one can regard training and leveraging a Union-segmenter to initialize the attribute-specific classifiers as a special case of self-supervised learning. Here, Union-Segmenter plays a role as a pretext task, while detecting the attributes in Target-Segmenter are the downstream tasks. However, our work differs from He et al. (2020) because TATL does not require additional training data besides the current task of interest. Furthermore, the proposed approach’s out-standing property is that our pretext task closely supports the downstream tasks. If the pretext task of recognizing abnormal regions can perform well, it will likely facilitate detecting such areas’ attributes. Finally, our TATL also provides abnormal regions from the pretext task, which is meaningful to end-users as it helps dermatologists double-check their diagnoses.
This section aims to formalize our problem setting and outline the dermatologists’ practice to diagnose skin attributes, which later motivates our method.
3.1 Problem Statement and Background
We consider the skin attributes detection problem on a target dataset consisting of training images and their corresponding masks . The detector, parameterized by W, can be initialized from a pre-trained model on another dataset, which we call the source dataset . Moreover, each training sample in the target domain consists of an image and a set of labels , where is a binary mask indicating the skin region associated with the -th attribute. In this work, we consider five different diseases: Globules, Milia, Negative, Pigment Network, and Streaks, shorthanded as , i.e., . It is worth noting that each sample may not have all the attributes and the label for those missing attributes is the empty mask. The training process can be performed by minimizing the empirical risk:
where denotes the binary mask prediction of the network on a sample about the
-th attribute. For each attribute, we use the Dice and Jaccard loss functions,and , to penalise the deviation between its prediction and the ground-truth. Formally, these loss functions can be calculated as:
Here, the prediction and the ground-truth
are first re-shaped into a vector form of size; the summation and multiplication are performed element-wise. In experiment, we choose to balance the importance of two loss functions and is a small constant added to avoid zero division.
3.2 Inspirations from Dermatologists’ Behaviours
We provide a pipeline composed to simulate the conventional diagnosis processes for a patient (Kawahara and Hamarneh, 2019) in Figure 2. In the first step, dermatologists will identify lesion regions by eliminating irrelevant background and rescaling these regions to a higher resolution for better visualization (Stage 1). Next, they continue to spot any abnormal and clinically relevant sub-areas on the lesion (Stage 2). Finally, by accounting for these factors, doctors make the decision by taking into account various features based on their textures and colors compared to neighboring spaces (Stage 3). We argue that identifying lesion and abnormal regions is crucial since it provides focal points for later steps. However, existing techniques do not follow this pipeline and try to segment the image’s attribute directly. This motivates us to develop a skin attributes detection framework that closely follows the three-step procedure depicted in Figure 2.
We realize the diagnosis procedure into a single framework named Task Agnostic Transfer Learning (TATL). First, TATIL employs a segmenter to segment the lesion regions from the normal skin regions. Then, TATL trains an Attribute-Agnostic Segmenter to detect all abnormal areas in the images, regardless of their attributes, which mimics the second step in the procedure. Finally, the Attribute-Agnostic Segmenter’s parameter is used as an initialization for the Target-Segmenters (Tar-S), which are trained to detect only one particular attribute, which follows the final step in the procedure.
TATL not only closely resembles how dermatologists diagnose, but TATL also enjoys two additional benefits than conventional approaches. First, TATL provides additional information about the abnormal regions’ locations, which can be helpful for dermatologists. Remarkably, such areas reveal variations and commonalities of relevant lesions, thereby reducing subjective errors in the evaluation process. Second, adapting weights trained on abnormal regions to a specific attribute can guide the network to pay attention to shared features across diverse attributes, thus strengthening trained systems to be more robust and stable.
We now detail our TATL framework and discuss the theoretical properties, which will shed light on its impressive empirical performance.
4.1 Task Agnostic Transfer Learning for Skin lesion Attribute Detection
The Encoder - Decoder Architecture
The core component in our TATL framework is the encoder-decoder architecture, which takes an image/mask as input and produces a mask as output. While we employ several encoder-decoder networks in our method, they share the same design as follow. The encoder part could be any object classification model architectures such as ResNet152 (He et al., 2016)
. We only employ feature extraction layers in these networks and discard non-linear rectification layers for classification in our setting. Then, the backbone network is divided into several stages based on their corresponding architecture. Moreover, we also employ a decoder network to up-sample the encoder’s feature back to the original input’s dimension. Particularly, to match the encoder’s stages, the decoder consists of an up-sampling layer and a sequence of convolutional blocks where each block has two
convolution filters with activation functions in between. Each stage in the decoder receives a feature map from its immediate preceding layer and a corresponding feature from the encoder’s stage. The two inputs are combined by either theadding or concatenating operations, corresponding to the settings of LinkNet (Chaurasia and Culurciello, 2017) and U-Net (Ronneberger et al., 2015b).
The TATL Framework
Our TATL framework consists of three encoder-decoder networks. The first network, Segment-Net, segments and upscales the lesion regions in the original image. Then, the second network, Attribute-Agnostic Segmenter, can take the lesion regions as input and learn to segment the abnormal regions, including any of the five attributes of interests. Finally, for each attribute, a corresponding network, the Target-Segmenter, is trained to segment that attribute’s regions. Moreover, the Target-Segmenter’s decoder also rescales the final mask to match the original image’s dimensions. Therefore, our TATL framework consists of seven networks in total, one Segment-Net, one Attribute-Agnostic Segmenter, and five Target-Segmenter corresponding to five attributes. Each network uses either the Link-shape or U-shape architecture with the EfficientNet as the main backbone due to its lightweight property compared to other architectures (please refer to Table 1 in the Appendix).
We now introduce some notations to detail our method. We denote
as the corresponding networks of the Segment-Net, the Attribute-Agnostic Segmenter, and the Target-Segmenter for each disease in Stage 3 respectively (Figure 2). We also use as the total parameters of networks , where , and represent weights of encoder and decoder layers. To train the Segment-Net, we denote as the set of images and their corresponding lesion region masks.
4.2 The Attribute-Agnostic Segmenter
In Stage 1, we use the Segment-Net on the lesion dataset to eliminate extraneous skin-based regions and only keep the lesion regions. The Segment-Net’s output is then upscaled to a higher resolution to provide better visual features for later steps.
The second stage focuses on training the Attribute-Agnostic Segmenter. We first define the Attribute-Agnostic region as a region that contains at least one of the attributes in . From this, we define an intermediate dataset of the Attribute-Agnostic as , where is the binary mask corresponding to an image whose value is whenever a pixel is an attribute from (Pretext Task). Note that given an image and a set of attributes masks, is the union of all the masks and can be easily constructed by performing bitwise OR operator as:
where denotes the bitwise OR operator. The dataset is used to train the Attribute-Agnostic Segmenter such that it can detect the abnormal regions belonging to any of the attributes. It is important to note that contains masks covering any attributes; therefore, the Attribute-Agnostic Segmenter does not suffer from the lack of training data from minor classes.
4.3 The Target-Segmenters
Given learned from the Attribute-Agnostic Segmenter, we can proceed to Stage 3 and train the segmenter for each of the attributes (Downstream Task). Different from previous approaches, we initialize the Target-Segmenter parameters from the Attribute-Agnostic Segmenter’s parameters as: and for each type - attribute. Lastly, a set of Target-Segmenters is trained to segment the attributes.
Having a dedicated network for each attribute is advantageous in alleviating the imbalance training data problem. Moreover, we explore two strategies in training the Target-Segmenters, which corresponds to allowing knowledge sharing across attributes or not. First, we freeze all the encoders (TATL-Freeze) to allow feature sharing across attributes because the encoder is initialized from the Attribute-Agnostic Segmenter. Second, we allow both the encoder and decoder to be updated (TATL-Non Freeze), which allows each Target-Segmenter to adapt to their dedicated attribute. We summarize all these steps in Algorithm 1 and Figure 3.
4.4 Theoretical Insights
This section provides theoretical insights to justify our approach using recent results from data-dependent stability of optimization schemes. First of all, we will start with some definitions and notations.
We use an example space by and as its member. In a supervised learning case, , where is the input, and
is the learning problem’s output space. We also assume that training and testing are sampled, i.i.d. from a probability distributionover . Also, we denote the training set . For a hypothesis parameter space , we define a map as a learning algorithm given the training data .
In Kuzborskij and Lampert (2018)
, authors established a data-dependent aspect of algorithm stability for Stochastic Gradient Descent (SGD) given a traning set, step sizes , random indices , and an initialization weight :
Here, is a loss function, which measures the difference between predicted values and true values with parameters on an sample . We indicate as the data-generating distribution and the initialization point of SGD, as a stability function of and .
To characterize randomized learning algorithm , we define its “On-Average stability”.
(On-Average stability). A randomized algorithm is -on-average stable if it is true that
where and is S copy with i-th example replaced by .
We now have the following theorem (Kuzborskij and Lampert, 2018):
Let be on average stable, then
where are risk and empirical risk of respectively, defined by:
In words, the generalization of a learning algorithm on unseen data drawn from the same distribution is controlled by its -on average stable, which depends on initialized weights . In the following, we will examine model’s generalization performance through the lens of its training algorithm’s stability.
In the transfer learning setting, Theorem 1
provides a tool to understand the model’s generalization on the target domain, given that it is initialized from one of the pre-trained models on a set of source domains (source tasks). Specially, we suppose that the target task is characterized by a joint probability distributionand the training set and assume that a set of source hypotheses trained on different source tasks. In this paper, we consider two distinct source cases with where “” refers to weights trained on ImageNet and “” is our approach of learning the attribute-agnostic mask. We will show that in most of cases our convergence rate with is faster than
, which translates to the neural network trained with TATL could generalized better than the conventional pre-trained ImageNet via Theorem1. To this end, we need a further proposition in Kuzborskij and Lampert (2018).
Given a non-convex loss function and assume that has a -Lipschitz Hessian, -smooth and that step sizes of a form satisfy , then with high probability, the of SGD scheme satisfies:
with is a spectral norm. Intuitively, Theorem 1 and Proposition 1 suggest that an initialization’s generalization error depends on two factors: (i) how well it performs on the target domain without any training, which is characterized by ; and (ii) the loss function’s curvature around this initialization, which is characterized by empirical over training samples, denoted as . This result provides an intuitive explanation why TATL provides a more favourable initialization than the traditional ImageNet pre-trained models. Particularly, we will explain why TATL, which initializes the Target-Segmenter from the Attribute-Agnostic Segmenter, can achieve better performance compared to initializing the segmenter from ImageNet pre-trained models.
Pre-trained ImageNet models are unlikely to perform well on medical images due to the huge diversity between the two domains. Therefore, such models often have higher empirical error on the target domain and is usually lie in high curvature regions . On the other hand, TATL uses an initialization from the Attribute-Agnostic Segmenter, which is pre-trained on self-generated data of the target task. Note that the Attribute-Agnostic Segmenter can detect any of the attributes, and therefore enjoy lower empirical risk compared to ImageNet models. Moreover, due to its construction, the Attribute-Agnostic Segmenter’s parameter lies in a region closed to the local minimal of each attribute detectors, which enjoys lower curvature . Consequently, TATL exploits the target task’s knowledge to form an initialization that has a high probability to attain lower empirical error and curvature, which translates to a tighter generalization error bound compared to initializing from pre-trained ImageNet models. We empirically verify this result by comparing the bound’s values in Eq. (6) of different initialization strategies in Figure 4.
5 Experiments and Results
|ISIC - 2017||Number||113||122||NA||475||1119||1416|
|ISIC - 2018||Number||100||190||602||681||1522||2594|
We conduct experiments on two well-known datasets for skin attributes detection: the ISIC 2017111https://challenge.isic-archive.com/landing/2017 and 2018222https://challenge2018.isic-archive.com/ Task 2 datasets. Table 1 provides a summary of the two datasets. It is worth noting that the ISIC 2017 dataset only contains four classes: Streaks, Negative Network, Milia, and Pigment Network, while the ISIC 2018 introduces a new class of Globules. Moreover, both datasets exhibit high data imbalance among the attributes. For example, in the ISIC 2018 dataset, the class “Streaks” only has 3.86% of the training data while “Pigment Network” has 58.67%.
5.2 Experimental Settings
We conducted all experiments using the Pytorch framework(Paszke et al., 2019) on 4 NVIDIA TITAN RTX GPUs. All images were pre-processed by centering and normalizing the pixel density per channel. We used the SGD optimizer (Goodfellow et al., 2016) with an initial learning rate of and momentum of to be consistent with the theory presented in subsection 4.4. For TATL, we obtained the Segment-Net by training an EfficientNet backbone with U-shape on both the ISIC 2018 and ISIC 2017 Task 1 using the loss function in Eq. (1
). Given the segmentation results, we defined a bounding box around the masks with an offset of 40 pixels in four directions to mitigate the segmentation errors. The Attribute-Agnostic Segmenter and the Target-Segmenters were then trained for 40 epochs with early-stopping after 10 epochs. The best model on the validation data was picked to measure the final performance.
5.3 Comparison Against Other Approaches
Due to a highly competitiveness of skincare challenges, we utilize both U-shape and Link-shape architectures with EfficientNet (Tan and Le, 2019) as backbone, and taking the average probability predictions. Both methods are trained with our TATL framework. For a comprehensive comparison, we include four
variants of four TATL corresponding to removing either or both Stage 1 and 2 of the TATL framework: (i) the vanilla encoder-decoder architecture but without the Segment-Net and the Agnostic-Attribute Segmenter (Stage 3); (ii) a variant that performs the first and last stage of our TATL: segment the lesion regions and then the attributes (Stage 1 and 3); (iii) a variant that performs the second and last stage of our TATL: train first the Attribute-Agnostic segmenter on the original images and then a set of Target-Segmenters (Stages 2 and 3); (iv) our full TATL framework that performs all three stages (Stages 1, 2, and 3). We report the Dice and Jaccard index of our method against the winner of ISIC 2017(Kawahara and Hamarneh, 2018) and ISIC 2018 (Koohbanani et al., 2018) in Table 3 and Table 2 respectively. Our TATL consistently outperforms its variants and the competitions’ winners by a large margin on both benchmarks and metrics. Notably, our method shows substantial improvements over other baselines on attributes with the least amount of training data, such as Streaks and Negative Network. These results corroborate our design of TATL to improve the performance of minor classes by transferring knowledge from the attribute-agnostic segmenter.
Performance of Each Backbone
To validate our method’s performance using either U-shape or Link-shape, we compare in detail each network with backbones in ISIC-2018-1st: ResNet-151, Resnet-v2, and DenseNet-169. Besides, since our model used EfficientNet as the main backbone, we also provide this network performance in the ISIC-2018 challenge to have an overall comparison. One can see the corresponding experimental results in Table 4 in the blue and red representing the best Jaccard and Dice scores. Our two versions denoted as U-EfficientNet(TATL) and L-EfficientNet(TATL), applied Focus and transfer learning process. All highest results were from one of our models in which the U-shape structured version is the winner in Pigment Network and Globules lesion, and the remaining belonged to the Link-shape model. On the other hand, the baseline with the EfficientNet backbone seemed to perform better than the other three backbones.
Considering the diseases with many training data such as Pigment Network () and Milia-lie cysts (), our model slightly increased the Jaccard of the baseline with EfficientNet from to and to for both skin illness respectively. The Dice coefficient also was minuscule improved by on Pigment Network and on Milia-like cysts. The smaller number of images, the more significant margin in the improvement made by our model under the transfer learning stage. For instance, with the Streaks lesion, our L-EfficientNet(TATL) achieved and in Jaccard and Dice, which were and higher than the best results of the baseline with EfficientNet backbone. Overall, our model with the Link-shape structure performed the best among all methods with the score of of Jaccard and of Dice, although the U-shape also could obtain similar performance. It is important to mention that although not using any ensemble techniques, we still managed to achieve a better result than the official ISIC2018 winner model integrating three backbones as shown in Table 2.
Our proposed method consistently outperforms other competitors in all experiments and enjoyed a significant reduction in the number of parameters. We provide the number of trainable parameters on different architectures in Table 5. Notably, compared to the winner of ISIC 2017 and ISIC 2018 challenge, our method has 1.4 to 2.33 times and 30 to 50 times fewer parameters as training each network. Consequently, our TATL consumes less GPU memory usage and can be trained with larger batch size and higher image resolution, facilitating the convergent rate and the model’s generalization.
|Our (EfficientNet, U-shape)||10,115,501|
|Our (EfficientNet, Link-shape)||6,096,333|
5.4 Evaluation of TATL
Our model with transfer learning could perform well with the EfficientNet backbone and other architectures. We highlighted this advantage through five distinct networks and evaluated performance on Negative and Streaks attributes as they had the least number of images in the dataset. The networks included VGG16, ResNet151, ResNet-v2, DenseNet-169, and EfficientNet-b0, which are backbones used in two top ranks in ISIC-2017 and ISIC-2018. Table 6 presents the main results on both datasets. To obtain it, we perform large-scale experimentation. We have three settings to run on the high level: TATL (FE), TATL (NF), and ImageNet. For each model, we use five different backbone architectures and two different convolution network shapes. We also rerun three times for each configuration and measure with
folds cross validation to estimate average results. This setting results in a total of600 models to be examined, which provides a comprehensive analysis of our TATL. The first one was to apply the TATL technique but froze the encoder part and only update weights of the decoder module while training for a specific disease. The second configuration was similar to the former but would update the parameters in the encoder as well. The last setting was not to apply the transfer learning process and train from the pre-trained weights on the ImageNet dataset. As can be seen from Table 6, applying TATL could help improve all backbone performance except the ResNet-v2 with the Negative attribute. However, the difference between the Dice values in this run was not noticeable with less than . In contrast, the TATL could boost the Dice by up to nearly when using DenseNet-169 with U-shape design to segment Streaks regions in ISIC2018 dataset, and when using ResNet151 with Link-shape in a similar task. Our TATL consistently improves the performance across the different backbone and the U-Shape and Link-Shape design.
|Architecture||Setting||ISIC 2017||ISIC 2018|
5.5 TATL Is More Than Just Data Augmentation
We now show that the benefit of TATL compared to ImageNet initialization does not lie in increasing the number of training samples. To do it, we compare our TATL with the popular augmentation initialization and image-context restoration approaches (Pathak et al., 2016; Chen et al., 2019). We utilize common data augmentation transformations such as random rotation, flip, shift, brightness, or zoom to increase total training instances on each attribute for the augmentation setting. For the image reconstruction context, we expect to achieve a powerful image representation by forcing the network to reproduce successfully randomized sub-regions removed given training samples then continue fine-tuning obtained weights on each attribute. This process was recently shown to perform competitively to the state-of-art methods in image classification, object detection, or semantic segmentation (Kolesnikov et al., 2019; Chen et al., 2020). We present the experiment results in Table 7
with the metrics computed over all lesion attributes in the ISIC-2018 challenge. In general, both the data augmentation and the image reconstruction tasks can provide marginal improvements to the traditional Imagenet initialization on the U-Shape design. However, our TATL significantly outperforms such strategies on both evaluation metrics and architecture designs. This result confirms our finding that transferring knowledge from the Attribute-Agnostic Segmenter is beneficial for the skin-attribute segmentation task.
5.6 Generalization Bound Of TATL Compared to Other Strategies
In this part, we examine the theoretical insights of our Proposition 1 by estimating generalization error’s bound in the right side of Eq 6 with four different cases: our TATL, ImageNet, Augmentation, and Image-Context Reconstruction (Pathak et al., 2016; Chen et al., 2019). For each setting, we run a full pass over all training samples of each attribute to estimate the spectral norm of the Hessian matrix and the empirical risk
where the largest eigenvalue is approximately by the power iteration method(Solomon, 2015). We choose for all attributes and present the relative relations among categories in Figure 4. Interestingly, our TATL is the best minimize across upper bounds, especially for two attributes with fewer samples like Streaks and Negative. These observations validate that our scheme satisfying the upper bound conditions in Proposition 1 in the sense of data-dependent stability, which reasons why TATL could hold competitive performance overall several experiments.
Figure 5 illustrates some sample results of our proposed TATL model. The ground-truth segmentation was highlighted in green color, and our prediction was marked with red color. Regarding diseases containing many images for training, such as Globules or Pigment Network, the TATL seems to have a better segmentation covering most of the ground truth area. Although Streaks and Negative Network’s prediction missed some injured regions, the result still captured the main matter location. Along with this, our model also creates the benefit of extra information for end-users through the predicted union. This segmentation can help doctors to realize which area they should pay attention to. Hence it would be easier for them to localize the lesion regions and more accurate, which is crucial as they, not our model, are the ones who will give the final diagnosis.
6 Discussion and Future Work
Our work proposes a novel strategy to initialize the attribute segmenters’ parameters using an attribute-agnostic segmenter trained on abnormal skin regions. We empirically demonstrate this benefit over the traditional strategy of using the ImageNet pretrained models. From the promising results, we outline several potential and interesting directions for future research.
Generalization to Other Medical Image Analysis Tasks
We develop TATL to address the skin-attribute detection problem specifically. It would be interesting to test the TATL’s generalization capabilities to other medical image analysis tasks, where using pretrained Imagenet models is likely to be suboptimal. For example, similar tasks such as brain lesion segmentation (Hu et al., 2018; Duy et al., 2018; Mallick et al., 2019; Nguyen et al., 2017) or chest abnormal detection (Hashir et al., 2020; Ibrahim et al., 2021; Nguyen et al., 2021) share similar characteristic to our problem setting: the data are often imbalance and classes share semantic features that can be leveraged to improve the overall performance. Therefore, it is of interest to explore the applications of TATL in such tasks and make possible adjustments.
Real-world Applications Using TATL
Our ultimate goal is to develop a model that not only makes prediction but also provides useful information and assists doctors in making the final decisions. Our TATL framework realizes this goal by providing a mask of abnormal regions, which compensates for inaccurate predictions of later stages, especially on minor classes. A promising future direction for TATL is integrating it in an online learning setting with human-in-the-loop. Particularly, a model is trained to detect some diseases and then deployed to a real-world environment with a stream of data and feedback from doctors and patients. In such scenarios, the model can continuously improve its performance by accumulating the attribute-agnostic information via the doctors’ feedback and then transferring it to the target segmenters, allowing for a fast adaptation to newer patients and more accurate predictions over time.
A Holistic Medical Image Analysis Method Beyond TATL
Intuitively, TATL works by achieving a tighter generalization error bound compared to other initialization strategies. However, the theoretical result in Proposition 1 only bounds using the initialization parameters. In practice, additional aspects can affect the model’s generalization, such as (i) the number of source tasks (training classes in our case); (ii) which properties among those tasks that can be safely transferred; and (iii) beyond an initialization, which mechanisms allow for a successful knowledge transfer. Such properties are not yet rigorously studied, and exploring them can potentially provide a holistic method for medical image analysis: a method not only starts with a quality initialization but also exploits the complex relationship of medical images to improve its performance over time. Such a method can provide accurate disease detection and assist doctors in diagnosing rare diseases more precisely, which results in effective treatments at a lower cost.
We have investigated the limitations of the common fine-tuning strategy in state-of-the-art skin attributes detection methods. We show that such strategies are not optimal when the current task is largely different from ImageNet and contains limited training data. This limitation motivated us to develop TATL, a novel method that exploits all attributes data to train an agnostic segmenter. By transferring the agnostic segmenter’s knowledge to each attribute classifier, TATL significantly alleviates the lack of training data and allows knowledge sharing among attribute models. Through extensive experiments on the ISIC 2017 and ISIC 2018 benchmarks, we demonstrate the efficacy of TATL over existing state-of-the-art methods. Moreover, TATL can work well with many different backbone networks while enjoying minimal model and computational complexity. Finally, we provide theoretical insights, showing that TATL works by bridging the domain gap via the task-agnostic segmenter, which sheds light on its remarkable performances.
This research has been supported by the Ki-Para-Mi project (BMBF, 01IS1903-8B), the pAItient project (BMG, 2520DAT0P2), and the Endowed Chair of Applied Artificial Intelligence, Oldenburg University. Binh T. Nguyen is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2019-18-01. We would like to thank Dr. Fabrizio Nunnari (DFKI) and Dr. Paul Swoboda (MPI-INF) for their valuable discussions.
Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Investigative ophthalmology & visual science 57 (13), pp. 5200–5206. Cited by: §2.1.
A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, pp. 249–259. Cited by: §2.1.
- Dermoscopy Image Analysis: Overview and Future Directions. IEEE Journal of Biomedical and Health Informatics 23 (2), pp. 474–478. External Links: Cited by: §1.
- Linknet: exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §4.1.
- Self-supervised learning for medical image analysis using image context restoration. Medical image analysis 58, pp. 101539. Cited by: §5.5, §5.6.
A simple framework for contrastive learning of visual representations.
International conference on machine learning, pp. 1597–1607. Cited by: §2.2, §5.5.
- Cats or cat scans: transfer learning from natural or medical image source data sets?. Current Opinion in Biomedical Engineering 9, pp. 21–27. Cited by: §1, §2.1.
- Xception: deep learning with depthwise separable convolutions. In , pp. 1251–1258. Cited by: §1.
- Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 168–172. Cited by: §1.
- Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368. Cited by: §1.
- Artificial Intelligence Approach in Melanoma. In Melanoma, D. E. Fisher and B. C. Bastian (Eds.), pp. 1–31. External Links: Cited by: §1.
- Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine 24 (9), pp. 1342–1350. Cited by: §2.1.
- Accurate brain extraction using active shape model and convolutional neural networks. arXiv preprint arXiv:1802.01268. Cited by: §6.
- Deep learning. Vol. 1, MIT press Cambridge. Cited by: §5.2.
- Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316 (22), pp. 2402–2410. Cited by: §2.1.
- Quantifying the value of lateral views in deep learning for chest x-rays. In Medical Imaging with Deep Learning, pp. 288–303. Cited by: §6.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
- Sample-efficient deep learning for covid-19 diagnosis based on ct scans. medRxiv. Cited by: §2.1, §2.2, §2.2.
- Deep learning for image-based cancer detection and diagnosis- a survey. Pattern Recognition 83, pp. 134–149. Cited by: §6.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
- Pneumonia classification using deep learning from chest x-ray images during covid-19. Cognitive Computation, pp. 1–13. Cited by: §6.
- Fully convolutional neural networks to detect clinical dermoscopic features. IEEE journal of biomedical and health informatics 23 (2), pp. 578–585. Cited by: §5.3.
- Visual diagnosis of dermatological disorders: human and machine performance. arXiv preprint arXiv:1906.01256. Cited by: §3.2.
- Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1920–1929. Cited by: §5.5.
- Leveraging transfer learning for segmenting lesions and their attributes in dermoscopy images. arXiv preprint arXiv:1809.10243. Cited by: §1, §5.3.
- Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 2815–2824. Cited by: §4.4, §4.4, §4.4.
- Learning representations for automatic colorization. In European conference on computer vision, pp. 577–593. Cited by: §2.2.
- Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §2.2.
- Joint-task self-supervised learning for temporal correspondence. In Advances in Neural Information Processing Systems, pp. 318–328. Cited by: §2.2.
Brain mri image classification for cancer detection using deep wavelet autoencoder-based deep neural network. IEEE Access 7, pp. 46278–46287. Cited by: §6.
- Computer Aided Diagnostic Support System for Skin Cancer: A Review of Techniques and Algorithms. International Journal of Biomedical Imaging 2013, pp. 1–22 (en). External Links: Cited by: §1.
Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §2.2.
- An attention mechanism with multiple knowledge sources for covid-19 detection from ct images. In AAAI 2021 Workshop on Trustworthy AI for Healthcare, Cited by: §6.
3D-brain segmentation using deep neural network and gaussian mixture model. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 815–824. Cited by: §6.
- A visually explainable learning system for skin lesion detection using multiscale input with attention u-net. In German Conference on Artificial Intelligence (Künstliche Intelligenz), pp. 313–319. Cited by: §1.
- Pytorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §5.2.
- Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536–2544. Cited by: §2.2, §5.5, §5.6.
- Transfusion: understanding transfer learning for medical imaging. In Advances in neural information processing systems, pp. 3347–3357. Cited by: §1.
- Chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225. Cited by: §2.1.
- Incidence estimate of nonmelanoma skin cancer (keratinocyte carcinomas) in the us population, 2012. JAMA dermatology 151 (10), pp. 1081–1086. Cited by: §1.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Vol. 9351, pp. 234–241. External Links: Cited by: §1.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1.
Making the world differentiable: on using self-supervised fully recurrent n eu al networks for dynamic reinforcement learning and planning in non-stationary environm nts. Cited by: §2.2.
- Cancer statistics, 2016. CA: a cancer journal for clinicians 66 (1), pp. 7–30. Cited by: §1.
- Numerical algorithms: methods for computer vision, machine learning, and graphics. CRC press. Cited by: §5.6.
Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, Cited by: §1.
- Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §5.3.
- Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), pp. 391–408. Cited by: §2.2.
- Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2566–2576. Cited by: §2.2.
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §2.2.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §2.2.
- Colorful image colorization. In European conference on computer vision, pp. 649–666. Cited by: §2.2.