Melanoma is among the most lethal type of skin cancer that increases rapidly throughout the world, with the five-year survival rate less than 15% for advanced-stage melanoma [31, 18]. Mortality rates of melanoma are associated with its high possibility of metastasizing to other organs (e.g. lung and brain) in the human body . Early melanoma usually starts as a brown or black spot that is confined to the cells in the top layer of the skin (epidermis). Then they progressively extend through the epidermis and further into the dermis, followed with the invasion to other tissues and organs through the circulatory system. Timely detection and proper treatment are crucial for patient survival since melanoma can be cured with prompt excision . Dermoscopy, widely used in the clinical examination of melanoma, is a noninvasive imaging tool that provides an accurate detail visualization of the pigmented skin lesion structure [27, 40]. But manual visual inspection based on dermoscopy is a time-consuming, hardly reproducible and subjective work. Even experienced dermatologists may produce the inconsistent diagnosis results . During recent decades, computer-aided diagnosis systems (CADs) have been developed and already demonstrated strengths for assisting dermatologists in enhancing their clinical diagnosis of melanoma [9, 4].
Automatic segmentation of skin lesion is a fundamental component of CADs for the analysis of melanoma 
. Recently, fully convolutional neural networks (FCNs)[53, 8, 7] have shown promising achievements in skin lesion segmentation. The success of FCNs relies on their powerful feature representation ability that encodes both low-level appearance information and high-level semantic information. However, a series of pooling and down-sampling operations at consecutive layers of FCNs reduce the spatial resolution of feature maps and thus yield insufficient skin lesion prediction. Some small skin lesions are so inconspicuous compared to background that FCNs fail to extract valid feature information, though they are important in the diagnosis of melanoma. In addition, the loss of the detailed features in FCNs also limits the localization of skin lesion boundary. Atrous convolutional neural network has displayed its strength in semantic segmentation by handling the feature resolution reduction with multiple parallel dilated convolutional layers [12, 51]. Nevertheless, it is challenging for atrous convolutional neural network to learn a discriminative feature representation for skin lesion due to the non-negligible heterogeneous characteristic of lesions. For instance, there are large appearance variations like shape, size, and color (Fig. 1 (a-b)) of the skin lesion during different lesion progressing stages. Other factors, such as the low contrast between lesion and background, and the presence of artifacts (hairs, air bubbles, color calibration charts, etc), also impede the accurate skin lesion segmentation, as shown in Fig. 1.
Skin lesions develop as a result of proliferation of a single or multiple components of the skin, which ranges from benign lesions to premalignant lesions and aggressive tumors . Since skin lesions progressively invade nearby tissues, there exists a complex correlation between different parts of lesion anatomical structure. Utilizing this correlation relationship between candidate pixels and their surrounding contextual regions is beneficial for the network to learn a discriminative feature representation. In this paper, we propose a bi-directional dermoscopic feature learning (biDFL) framework that integrates lesion with their informative context to achieve a substantially rich description of lesion structure. Different from the naive way of FCNs that learns an abstract feature representation for each candidate pixel, our proposed biDFL enriches feature representation by controlling information propagation from two complementary directions among high-level parsing layers. With the integration of both directional feature information passing, the proposed biDFL module enhances the network capability of learning the complex structure of the skin lesion. Without changing the spatial resolution of feature maps, the proposed bi-directional dermoscopic feature learning framework improves the representative capability of feature maps. In addition, our biDFL module also mitigates the challenge of the existence of high variation of skin lesion size.
Different score maps can be generated by classifying features learned from different layers of a neural network, showing the classification results from different scales of learned features. Score maps integration via sum fusion have been adopted to aggregate multi-scale feature information for the refinement of the semantic segmentation in many deep learning based approaches[30, 42, 16, 41, 14, 46]. However, some skin lesions have large scale while others have small ones. Moreover, pixels far away from the lesion boundary will be more reliably classified by using larger scales of features while pixels near the lesion boundary need to use smaller scale of features for better localization of the segmented lesion. These motivate us to explore the consistency of the classification scores of the local neighborhood for scale selection. Specifically, we propose a multi-scale consistent decision fusion (mCDF) that assesses the reliability of each decision in score maps generated from multiple classification layers. Our multi-scale consistent decision fusion embeds the consistency information around local decision context to adjust the confidence of decision and thus allows more reliable and precise skin lesion delineation. If the decision for a candidate pixel is consistent to its local context decision, then the network gives high confidence for this decision; otherwise the network reduces the confidence of this decision.
Our segmentation network is fully convolutional that provides an end-to-end way for skin lesion training and prediction. Main contributions of this paper are summarized as follows:
We propose a bi-directional dermoscopic feature learning framework to generate a substantially rich and discriminative feature representation by integrating skin lesions with their informative context. By manipulating the feature propagation through two complementary directions among high-level layers, we improve the image parsing ability of the network.
We further propose a multi-scale consistent decision fusion to enhance the reliability and consistency of the decision by selectively fusing decisions generated from multiple classification layers.
We achieve state-of-the-art performance consistently on the evaluated benchmark databases. Even for challenging dermoscopic images, our proposed network also yields high performance on lesion segmentation.
Ii Previous work
In the past decades, many methods have been reported to deal with the challenges in skin lesion segmentation. Those algorithms can be broadly divided into two categories: unsupervised and supervised methods.
Unsupervised methods mainly focus on thresholding [55, 10], clustering [38, 58], and deformable contour model [57, 32]. Specifically, Yksel and Borlu  utilized the type-2 fuzzy logic technique for automatic threshold determination. Celebi et al.  fused four thresholding methods for skin lesion boundary detection. Both thresholding methods separate skin lesions from background based on the histogram distribution of image intensity, which produce undesirable errors for inhomogeneous skin lesions. In  and , fuzzy c-means clustering was employed to segment skin lesion from dermoscopy images. Clustering algorithm assigns pixels with similar characteristics into one identical class, and thus has a limitation for the detection of artificial noises like some hairs and air bubbles. Zhou et al. 
segmented skin lesion by using a mean shift based gradient vector flow (GVF) algorithm. Ma and Tavares exploited a geometric deformable model to simulate the process of the skin lesion segmentation. Deformable models based methods detect skin lesion boundary by minimizing the energy function defined within an image domain. It is difficult for the deformable model to converge around the skin lesions that have low contrast compared to background.
Supervised methods can be further grouped into the shallow and deep learning based methods. Shallow learning based methods usually employ hand-crafted features for segmentation, and thus require necessary domain knowledge. For instance, Wang et al.  deployed 20 descriptive features for lesion segmentation with a neural network classifier. In 
, a 74-dimension texture feature vector was extracted and then fed into the support vector machine for skin lesion segmentation. Jahanifaret al.  applied a supervised saliency detection method tailored for dermoscopic images based on the discriminative regional feature integration. Shallow learning based methods typically depend on capturing appropriate low-level appearance information (e.g. color and texture structure from shallow layer) without combining high-level semantic information, thereby limiting the capability of skin lesion localization as well as the generalization of approaches on other medical image segmentation tasks.
Recently, deep learning achieves great success [33, 29] and deep learning based segmentation techniques have been reported in skin lesion prediction following their success in other medial image analysis fields, such as multi-modal brain tumor segmentation [35, 44], gland segmentation , pulmonary nodule detection , and body organs recognition . Gu et al. integrated the dense atrous convolution module and residual multi-kernel pooling with encoder-decoder structure for the segmentation of optic disc, retinal vessel, lung, cell contour and OCT layer . Zhang et al. embedded edge-attention representations to guide the process of segmentation on optic disc, retinal vessel, and lung 
. Attention modules incorporated in deep learning architectures have also shown their strengths in many computer vision based tasks[48, 22]. Schlemper et al. encapsulated attention gates into a 3D U-Net architecture for abdominal organ segmentation . The target of attention gates is to highlight salient features that are passed through the skip connections. Wang et al. built a 3D attention guided deep learning network for prostate segmentation by harnessing the spatial context across deep and shallow layers . It refines the features at each individual layer by selectively leveraging the multi-level features integrated from different layers. For skin lesion segmentation task, Yuan et al.  segmented skin lesion using deep fully convolutional networks (FCNs) with Jaccard distance. Yu et al.  utilized a fully convolutional residual network incorporating a multi-scale contextual information integration scheme to automatically segment skin lesion. Bi et al.  presented a multi-stage segmentation approach where early-stage FCNs extracted coarse appearance information and late-stage FCNs learned the subtle characteristics of the lesion boundaries. Yuan et al.  designed a deeper network architecture with more smaller convolutional kernels than their previous work , and investigated the efficiency of using the channels from Hue-Saturation-Value color space. In , the authors applied U-net for skin lesion segmentation. Al-masni et al.  achieved pixel-wise segmentation of the skin lesion by a full resolution convolutional networks (FrCn), which eliminated the subsampling layers in the networks and enabled the convolutional layers to extract and learn the full spatial features of the input image. Bi et al.  trained the deep ResNet model 
independently across different classes and refined the segmentation performance by a probability based step-wise integration.
Different from the aforementioned medical image segmentation approaches, our proposed network investigates the complex correlation between skin lesions and their informative context to achieve a discriminative feature representation, which leads to more robust high-level parsing. Furthermore, a multi-scale consistent decision fusion technique is proposed to make more reliable and precise skin lesion prediction by analyzing the consistency of decisions in a local area. Quantitative and qualitative evaluations show the superiority of the proposed network on skin lesion segmentation.
Iii Proposed Framework
In this paper, we tackle the challenging task of skin lesion segmentation. Fig. 2
shows the overall framework of the proposed network, where a FCN-like architecture with ResNet50 (pre-trained on ImageNet) is applied as our baseline network. On top of it, we control feature information passing with a series of modulators from two directions: the forward direction from local to zooming-out region (marked as green solid arrow in Fig. 2) and the backward direction from global to zooming-in region (marked as red dot arrow in Fig. 2). The forward feature propagation simulates the human visual perceptual process that cerebral cortex encoding visual stimulus starts from extracting local feature of image in the lower visual pathways, followed by integrating the local features into global features in the higher visual pathways . This integration of local features into more global ones dominates the field of the anatomical, physiological and behavioral studies of vision system . The other direction of information passing follows another visual perceptual mechanism that relies on the feature cognition from global to local [3, 59]. With this bi-directional dermoscopic feature learning design, local and contextual features cooperate with each other on the skin lesion description. Furthermore, multi-scale consistent decision fusion (shown in the yellow box in Fig. 2) helps the network selectively combine the informative decisions from multiple classification layers, which leads to the improvement of the reliability and consistency of the predication. Details of the proposed bi-directional dermoscopic feature learning and multi-scale consistent decision fusion are described in the following sections.
Iii-a Context Feature Map Generation with Multiple Dilation Rates
, progressively reduce the feature map resolution by a series of consecutive pooling and convolution striding operations. This produces the abstract feature representation that characterizes local areas of an image. Different from the object recognition that classifies the whole input image, the segmentation task requires classifying every local pixel separately. Thus, we need to generate high-level semantic context features for every local pixel for more reliable classification. To recognize the complex structure of the skin lesion, it is essential to form a discriminative high-level semantic context feature representation for skin lesion in dermoscopic image. Since the semantic meaning of feature maps generated at different spatial configurations of the skin lesion is correlated, information passing through each part of the lesion structure can effectively improve the feature description of the skin lesion. Herein, we explore an effective way to integrate the information of skin lesions with their context.
Let denote the feature maps generated at the top layer of a pretrained CNN, i.e. the output of the Block 5 as illustrated in Fig. 2. The multi-scale context feature maps are produced by a series of dilated convolutions with ascending dilation rates :
where is the function of 33 convolution with the dilation rate , are the respective parameters, and is the number of dilation rates. The set of feature maps cover a rich context information of skin lesion since different sizes of the receptive field of dilated convolution filters are applied to perceive the information near and far. The detailed relationship of different dilation rates and their corresponding capability of encoding spatial contextual information are concretely elaborated as
(1) : Dilated convolutions with small rates focus on the extraction of features around the local region, which allows the network to form a detailed feature representation. Small dilation rates preserve locality of features but are sensitive to artificial signals emerged during dermoscopic image acquisition, like noises and air bubbles. Small dilation rates provide more accurate boundary information but have a high potential to lead the network to predict many isolated fake lesion tissues.
: With the increasing dilation rates, the receptive fields of dilated convolution filters are enlarged to capture much larger spatial contextual information of skin lesions. Compared to small dilation rates, those medium dilation rates allow the network to harness larger scale of contextual information more efficiently and thus generate prediction more robust to the challenging artificial noises, irregular lesion tissues, as well as some inhomogeneous dermoscopic images. Since medium dilation rates are not fit for matching the size of extremely tiny or large skin lesions, features extracted with medium dilation rates are hard to characterize both of them.
(3) : Large dilation rates further enlarge the receptive fields of dilated convolution filters that are suitable to capture skin lesion of large scales. Dilated convolutions with large dilation rates are able to see more global skin lesion pattern, but reduce the power of capturing detailed local information and thus yield coarse prediction for skin lesion.
Iii-B Bi-directional Dermoscopic Feature Learning
A common way of utilizing the feature maps generated by dilated convolutions with multi-scale rates is to concatenate them directly. For the task of skin lesion segmentation, simple concatenation largely increases the feature dimensionality that will degrade the generalization capability of the feature maps for classification. In this paper, we propose a bi-directional dermoscopic feature learning module to perceive the information of lesion characteristics. It controls the propagation of the feature flow among the complex spatial configuration of the skin lesion. Fig. 3 illustrates the process of the feature information passing along two directions. Given three feature maps obtained from dilated convolutions with different rates as shown in Fig. 3 (a), the complex correlation between skin lesion and their informative context can be captured through two directions: one from local to zooming-out region (Fig. 3 (b)) and the other from global to zooming-in region (Fig. 3 (c)), respectively. By integrating both directional feature learning propagation, the proposed biDFL module allows the network to form a substantially rich description of skin lesion.
More specifically, the feature maps after information propagation from local to zooming-out region are refined to be
where is the concatenation process, is the function of convolution, and are the respective parameters. The number of channels of is kept the same as that of through a number of 11 convolutions. This progressively accumulates the informative context ,…, , into that will have more powerful representation for the complex structure of skin lesion. Especially for generated from a large receptive field, the degradation of representation ability of feature maps generated by simple dilated 33 convolutions are alleviated by the proposed feature information propagation.
To obtain the complementary and detailed features for skin lesion, we exploit another information passing that has the opposite propagation direction. The refined feature maps through this direction are updated as
As formulated in Eqs (2-3), the correlation between skin lesions and their informative context is learned through a series of modulators ( and ). The feature maps receive the messages through the forward feature learning process that aggregates a series of zooming-out context feature information from (). Moreover, the feature maps obtain messages through the backward feature learning process that fuses a set of zooming-in context feature information from (). By combining the feature propagation along both directions, our proposed biDFL achieves a discriminative feature representation for lesion where information progressively passes among the complex spatial configuration of the skin lesion.
Stacking convolution layers together by two directions also effectively enriches the set of feature maps with multiple receptive fields. This is important for skin lesion segmentation since features for each lesion pixel can be represented more discriminatively by propagating information flow between those consecutive convolution layers of the proposed framework. Especially for some skin lesions with similar appearance as their non-lesion neighborhoods, without the support of information from proper receptive fields, they are hard to be distinguished from background. With the design of sequentially aggregating the features from multiple receptive fields, our proposed biDFL assists the network to learn feature description with rich multi-scale information, which is helpful for those challenging cases.
Iii-C Multi-scale Consistent Decision Fusion
Dermoscopic images have a large variation in size of skin lesions. Let denote the size of the skin lesion relative to that of the respective dermoscopic image. We found great variation of the value in the dermoscopic image database. Specifically, the maximum value for ISBI 2016 and ISBI 2017 databases are 0.9954 and 0.9522 respectively, where the minimum value for ISBI 2016 and ISBI 2017 databases are 0.0027 and 0.0030 respectively. This shows that there exist relatively tiny and inconspicuous skin lesions compared to background. For traditional networks like VGG net  and Resnet-50 , the spatial resolutions of the feature maps are usually reduced by a factor of 32 after a series of down-sampling processes. It is hard for those networks to capture the information of the aforementioned skin lesions, though they play an important role in the diagnosis of melanoma. In addition, the boundary of skin lesions are complex curves that are difficult to be precisely delineated by the down-sampled feature maps with low resolution.
To further enhance the prediction performance, we exploit features from shallow layers as they contain more detailed information of the inconspicuous skin lesions as well as complex boundary structure. Previous works [30, 34, 42, 41], mainly integrate score maps non-selectively from different skip layers, where some inappropriate scores may decrease the precision of the skin lesion prediction. For example, the score maps from low-level layers may contain debatable noisy information in homogeneous region. On the other hand, the scores from high-level layers of pixels located near the boundary of skin lesion are less informative, since they are insensitive to the spatial location of the skin lesion. Therefore, pixels far away from the lesion boundary will be more reliably classified by using larger scales of features in high-level layers while pixels near the lesion boundary need to use smaller scales of features in low-level layers for better localization of the segmentation boundary. An adaptive score map aggregation, incorporating the selection of reliable scale features for each pixel, is beneficial to skin lesion segmentation. In this paper, we propose a multi-scale consistent decision fusion strategy that selectively aggregates score maps by controlling the reliability of multi-scale feature representation among skip layers. With the embedding of consistency analysis to the decisions from each classification layer, our proposed mCDF assists the network to learn better about which scales of features are more desirable for each individual pixel.
Suppose for a class (here skin lesion or background), there are score maps generated by skip layers from different scales of features, where is the spatial position and . We compute the coefficient of mCDF, , by assessing the consistency of the th decision within a local region centered at . We formulate
as a Gaussian function of the standard deviation of the scoresover a spatial local region :
Gaussian function supports the decision consistency varying in an effective range, i.e. from 0 to 1. If the prediction from a skip layer for a pixel is consistent to its context, the output of Gaussian function is close to 1. Otherwise it decays to 0 quickly. In addition, the form of Gaussian function has the benefit of the implementation of the gradient back-propagation. Parameter controls the sensitivity of the consistency coefficient to the variation of the score in the local region. For each position , we compute multi-scale consistent coefficients for skip layers. The coefficient reflects the consistency of decision in a local region centered at position at the th skip layer. A smaller coefficient implies less reliable of the decision at the th skip layer, so score fusion process should suppress the effect of . A larger coefficient means that for position , the decision at the th skip layer is more desirable and has high potential to achieve an accurate prediction. As a consequence, the effect of with a larger coefficient should be highlighted during the decision fusion.
Under the control of the consistent coefficient , all score maps are selectively fused by
is adaptive to the score map
without complicated heuristic learning process.is the decision refinement mechanism that adjusts the contribution of the decision of each skip layer by the factor . In contrast to the simple score summation, the mCDF depends on the consistency of the score map at each skip layer to achieve selectively multiple decisions fusion. Moreover, the model trained with mCDF implicitly learns to inhibit irrelevant noises while adaptively merging rich feature information at multiple scales. Therefore, by adding multi-scale consistent decision fusion, our proposed network is more powerful to recover the details in shallow layers and yield reliable pixel-wise skin lesion prediction maps.
Iv Experimental Evaluation
We test the proposed framework on two publicly available databases (ISBI 2016  and ISBI 2017 ) provided by the International Skin Imaging Collaboration (ISIC) archive. ISBI 2016 and ISBI 2017 are two challenge databases of “Skin Lesion Analysis toward Melanoma Detection” hosted by the International Symposium on Biomedical Imaging (ISBI) in 2016 and 2017 respectively.
ISBI 2016, comprises 900 training images and 379 test images in JPEG format. Ground truths for ISBI2016, created by an expert clinician, are encoded as single-channel (grayscale) 8-bit PNGs. Training and test images are diagnosed as non-melanoma or melanoma, resulting in 727 non-melanoma and 173 melanoma in training set, 304 non-melanoma and 75 melanoma in test set.
ISBI 2017, the extension of ISBI2016, consists of 2000 training images and 600 test images in JPEG format. Training set contains 374 melanoma, 254 seborrheic keratosis, and 1372 benign nevi, while test set includes 117 melanoma, 90 seborrheic keratosis, and 393 benign nevi. In addition, ISBI 2017 database also provides the validation set that includes 150 skin lesion images.
Iv-B Evaluation Criterion
Performance of the proposed method is evaluated by comparing the skin lesion segmented result with the ground truth created by each database. Four different measurements for skin lesion segmentation performance evaluation include Jaccard index (JA), Dice coefficient (DI), segmentation accuracy (AC), and G-mean (GM). JA and DI measure the similarity between the detected result and ground truth: JA =and DI = , where and represent the number of pixels correctly classified as lesion and background pixels; and
denote the number of pixels incorrectly classified as the background and lesion pixels, respectively. AC is the pixel-wise accuracy, i.e. the ratio of correctly detected pixels to total pixels. GM is a metric that estimates the imbalance between segmentation sensitivity and specificity by taking their geometric mean: GM =. ISBI 2016 and 2017 challenges took Jaccard index (JA) as the most important criterion for segmentation comparison and participants were ranked based on it.
Iv-C Implementation Details
We employ a FCN-like architecture with ResNet50 (pre-trained on ImageNet ) as our baseline network. In detail, we first remove the last pooling layer and layers after it. For the last two blocks of the network, we keep the resolution of feature maps unchanged but set the dilation rates of convolution layers at the two blocks to be 2 and 4 respectively, which allows the reuse of the pre-trained weights. For Blocks 1, 2 and 3, the dimensions of feature maps after each block are , , and respectively. For Blocks 4 and 5, the dimensions of feature maps after each block become and . We generate five sets of feature maps with five different dilation rates for the output of Block 5, i.e. (3, 6, 12, 18, 24), which are the extension of the four dilation rates of ASPP with large FOV . To avoid the network growing too wide and reduce the computer consumption, we apply a convolution layer before the dilated convolution to decrease the number of feature channels from 2048 to 512.
Since the distribution of lesion and non-lesion pixels are unbalanced, i.e. lesion pixels occupy a relatively small proportion on both ISBI 2016 and ISBI 2017 databases, this paper uses the weighted cross entropy loss to measure the difference between the true label and the predicted result, shown as [42, 41, 46]:
where is the number of pixels in image . is the true label for the pixel located at . is the corresponding class likelihood. is the number of categories. Here changes from 1 to 2, where 1 indexes lesion class and 2 indexes background class. is the weight for class . We set 0.8 for lesion class and 0.2 for background class to alleviate the unbalance problem, since the lesion class has only about 20% of the pixels in the whole images. Inspired by , we apply the “poly” learning rate policy where the current learning rate is multiplied by . The initial learning rate and power are set to and 0.9 respectively.
is empirically set to 10 for controlling the sensitivity of the consistency coefficient to the variation of the score. Stochastic gradient descent (SGD) algorithm is exploited to train our end-to-end network. The number of iterations is set to 30k for ISBI 2017 database and 12k for ISBI 2016 database respectively. To capture the effective feature maps at different scales, we design a controllable local region sizefor . For feature maps from the first three Blocks, is set to as each block has a down-sampling process. For feature maps from Blocks 4 to 5, as well as the five dilated convolution layers, is progressively enlarged as , ,…,, respectively.
Skip layers utilize convolution transpose kernels  to achieve up-sampling operation. For batch processing, all images are resized to have maximum extent of 512 pixels. Since ISBI 2016 database only provides a training set and a test set, we randomly select 800 images from training set for training, and the rest 100 images in training set for validation. For ISBI 2017 database, we conduct training, validation, and testing on its provided training, validation and test sets, respectively. One of the challenges of skin lesion segmentation is the insufficiency of the training data with high quality. Augmentation strategy including random flipping images (horizontally, vertically), and random scaling in the range of [0.8 , 1.2], are performed to generate more diverse training data.
|Method||ISBI 2016||ISBI 2017|
Iv-D Ablation Studies
Iv-D1 Evaluation of individual component in our approach
To evaluate the proposed architecture on skin lesion segmentation, we conduct step-by-step ablation experiment on dermoscopic images in ISBI 2016 and ISBI 2017 databases. The detailed quantitative experimental results are shown in Table I, where we can observe that each proposed contribution collectively improves the baseline network on skin lesion segmentation. By jointly learning feature from the complex spatial configuration of skin lesion through two complementary directions, we improve the segmentation performance JA for ISBI 2016 and ISBI 2017 databases by 2.7% and 3.1% respectively. By integrating the multi-scale consistent decision fusion, we enhance the segmentation performance JA on both databases about 1.5%-1.6%. Combination of bi-directional dermoscopic feature learning and multi-scale consistent decision fusion achieves a performance superior to the baseline network, with the JA enhancement of 3.7%-5.7% on both databases. In addition, the significant segmentation performance improvement consistently on melanoma and non-melanoma cases in both databases, demonstrates the reliability and robustness of the proposed architecture on dealing with the task of skin lesion segmentation.
We also qualitatively analyze the effectiveness of the proposed architecture on skin lesion prediction. Figs. 4 and 5 show experimental results on some challenging dermoscopic images, i.e. skin lesions with variable scale, low contrast, artifact, as well as irregular convex and concave boundary. Compared to the segmentation results of baseline architecture, our proposed network achieves more precise prediction for complex skin lesion boundary. For example, images in the first column of Fig. 5 show the segmentation results on skin lesion with convex boundary. The proposed model has the capability to make the segmentation result converge around the convex boundary, while baseline model results in the large interval zone between segmentation and ground truth. For skin lesion with concave boundary (e.g. images in the fourth column of Fig. 4), baseline model just yields a smooth boundary which loses parts of the geometrical structure of the skin lesion. By contrast, our proposed model achieves more detailed interpretation of skin lesion with concave boundary. In addition, images in the last column of Fig. 4 show skin lesion with low contrast that is difficult for baseline model to generate discriminative feature maps and make a correct prediction. Our proposed model alleviates the limitation of the baseline model on low contrast lesion segmentation and produces a better delineation of lesion boundary. Moreover, there is an interesting observation from Fig. 5 (i.e. images in the sixth column) that the segmented results by baseline with biDFL generate some holes in the skin lesion prediction map, while the proposed multi-scale consistent decision fusion can effectively refine the false inconsistent prediction and improve overall segmentation result. Qualitative analyses of the experimental results in Figs. 4 and 5 clearly exemplify the validity of the proposed model on skin lesion segmentation.
Iv-D2 Evaluation of detailed branch of the proposed feature learning scheme
To investigate the effect of the proposed biDFL strategy on skin lesion segmentation, we conduct several controlled experiments with variable conditions for feature learning scheme. Table II shows the performance of dermoscopic feature learning with message propagation in single forward and backward directions (denoted by and , respectively). Compared to the baseline model, the proposed framework with feature passing along single direction (forward or backward) improves the JA by 1.5%-2.6%, which validates the effectiveness of directional feature propagation on skin lesion segmentation. Moreover, the combination of feature passing along two complementary directions further boosts the high-level parsing performance of the network, resulting in an additional increase of the JA by 0.5%-1.2%. We also observe that feature passing along the forward direction performs slightly better than feature passing along the backward direction on lesion detection, i.e. improving the JA by 0.1%-0.3%.
Furthermore, we also compare the proposed framework with the atrous spatial pyramid pooling (ASPP)  and DenseASPP  in Table III. ASPP directly concatenates multiple atrous-convolved features with different dilation rates into a final feature representation. DenseASPP emphasizes generating features that cover a large scale range in a dense way. It achieves multi-scale feature representation by stacking a set of atrous convolutional layers. The feature maps generated at the smallest scale dominate the feature representation in DenseASPP, since they participate in producing feature maps at each scale. In contrast to ASPP and DenseASPP feature concatenation schemes, our method focuses on learning relationship between different parts of lesion anatomical structure by passing messages among different receptive fields within two complementary directions, thus local and contextual features cooperate with each other effectively to improve the feature discriminative representation. The results in Table III show that the proposed feature learning framework outperforms the ASPP and DenseASPP, i.e. improving the JA by 1.6%-2.0%.
Iv-D3 Evaluation of different local region sizes of multiscale consistent decision fusion
We investigate the lesion segmented performance of local region with different size for multiscale consistent decision fusion, as shown in Table IV. If the local region size is fixed, the performance of a small size is slightly better than that of a large one . However, in our method, the size that is progressively increased for different layers, produces visible better segmentation performance, i.e. enhancing the JA by 1.0%-1.5%.
Iv-D4 Evaluation of different number of training images
The ground truths for skin lesion segmentation are taken from International Skin Imaging Collaboration (ISIC), where they are created by an expert clinician. Though human is not error-free, ISIC annotations of skin lesions are quite reliable for performance assessment. Thus we evaluate the performance of our proposed approach in the experiment based on the human annotated ground truths. To analyze the effect of overfitting, we train the model with different number of training images on ISBI 2017 database. Specifically, we randomly select 500, 1000, and 1500 training images for network training. Table. V shows the segmentation performance of different number of training samples. With the increase of the number of training images, the segmentation performance JA is progressively improved, since more general information (less overfitting to specific training data) can be learned from training samples with growing number and diversity.
Iv-D5 Evaluation of the sensitivity of dilated rates
To investigate how different settings of the dilated rates affect the segmentation performance, we conduct experiments with different dilated rate sets on ISBI 2017 database, as shown in Table VI. Compared to the dilation rate set (3, 6, 12, 18), the one used in ASPP  (6, 12, 18, 24) produces better segmentation performance, i.e. improving the JA by 0.3%. The dilated rate set designed in our work further enhances the lesion segmentation performance by 0.6% in JA.
Iv-E Comparison to Other Published Methods
|Arroyo and Zapirain ||79.10||86.90||93.40|
|Yuan et al. ||84.70||91.2||95.50|
|Bi et al. ||84.64||91.18||95.51|
|Yuan et al. ||84.90||91.30||95.70|
|Bi et al. ||85.92||91.77||95.78|
|Arroyo and Zapirain ||66.50||76.00||88.40|
|Lin et al. ||65.00||79.00||N.A|
|Al-masni et al. ||77.11||87.08||94.03|
|Bi et al. ||77.73||85.66||94.08|
We compare our method with the techniques from top 10 different teams in the competitions of challenge of ISBI 2016 and 2017 databases and 7 other top published methods. All compared results are taken from their respective publications. Tables VII and VIII illustrate the segmentation performance on ISBI 2016 and 2017 databases with different methods respectively. Compared to those state-of-the-arts architectures, our proposed model produces the best segmentation performance on skin lesions in ISBI 2016 and 2017 database consistently. For ISBI 2017 database that contains more complex skin lesions difficult to be distinguished from background, our proposed method shows more performance gain on skin lesion segmentation, resulting in the JA improvement over 3.7% from the second best approach. Tables IX and X list the segmentation performance of different methods on melanoma and non-melanoma cases separately, where the proposed method outperforms other techniques in both cases consistently. Segmentation of melanoma is more difficult than non-melanoma due to severe inhomogeneous of lesion pattern. Higher performance gain of melanoma segmentation than non-melanoma segmentation in Table X indicates the effectiveness of the proposed method on melanoma detection, which is beneficial to the further inspection of melanoma.
Fig. 6 shows the distribution comparison in terms of Jaccard index on two skin lesion databases. For ISBI 2016 database, the proposed architecture yields 57.8% segmentation results whose Jaccard index higher than 90%, leading to the noticeable improvement than other methods, e.g. 17.4% higher than the second best Team-EXB . For more difficult ISBI 2017 database, the proposed architecture produces 70.3% segmentation results with Jaccard index larger than 80%, increasing 12.6% compared to Team-Yuan . Analysis of distribution of Jaccard index indicates the utility and stability of our proposed method on skin lesion segmentation, where segmentation results with high Jaccard index occupy the majority of both ISBI 2016 and 2017 databases, significantly better than other methods used for comparison.
|Arroyo and Zapirain ||80.94||88.60||86.51||78.68|
|Bi et al. ||85.84||92.03||84.34||90.97|
|Bi et al. ||85.62||91.72||85.60||91.78|
|Arroyo and Zapirain ||65.81||76.56||66.70||75.88|
|Bi et al. ||72.18||81.65||79.07||86.63|
Iv-F Result summary
The proposed network achieves state-of-the-art segmentation performance on two publicly available skin lesion databases and in both melanoma and non-melanoma cases consistently. Results recorded in Table I and illustrated in Figs. 4 and 5, show the significant performance gain brought by the two proposed techniques, biDFL and mCDF. This superiority of the proposed biDFL comes from the informative feature passing through two complementary directions, which makes the feature maps receive discriminative information from complex spatial configuration of the skin lesion. The proposed mCDF selectively fuses decision scores of multi-scale features by checking their consistency in a spatial local area. Extensive comparisons with other reported methods have shown that our approach consistently performs better than others on skin lesion segmentation (see Tables VII-X). We attribute this profit to the fact that we investigate more insightful relationship between skin lesions and their informative context, as well as the consistency of the decision from multiple classification layers, which have not yet been well explored by the previous studies. Moreover, in addition to the task of skin lesion segmentation, the proposed architecture is flexible to be extended to other field of image analysis.
Although our proposed network has generated high performance for lesion segmentation, it should be noted that there are some lesion segmentation cases that can be further improved, as shown in Fig. 7. Most of those skin lesions are of low contrast and have irregular structure. The main reason for insufficient segmentation of those skin lesions is the scarcity of the relative dermoscopic images in the training data. One way to further improve the segmentation performance is to learn productive feature representation from more accessible training samples. Images with skin lesion acquired from mobile computing devices such as smartphones will provide an appealing way for efficiently collecting lesion images and self-monitoring of melanoma. It is an interesting project worthy of further investigation.
In this paper, we propose a bi-directional dermoscopic feature learning framework to generate a substantially rich description for skin lesion structure. Feature information passing through two complementary directions among high-level layers gives a significant improvement of the parsing ability of the network. Furthermore, we propose a multi-scale consistent decision fusion to selectively focus on more consistent decisions generated from multiple classification layers, which achieves more reliable prediction for skin lesion on dermoscopic images. Both qualitative and quantitative analyses of segmentation performance on two publicly available databases show the superiority of the proposed method, especially for tiny skin lesion structure delineation and complex boundary localization. As an effective and efficient segmentation tool, the proposed network is flexible to be extended to solve many other image segmentation problems.
The authors would like to thank the organizers of International Symposium on Biomedical Imaing 2016 and 2017 for kindly providing benchmark databases and annotations.
-  Skin lesion analysis towards melanoma detection. Note: https://challenge.kitware.com/#phase/566744dccad3a56fac786787. Cited by: §IV-A, §IV-E, TABLE VII, TABLE IX.
-  Skin lesion analysis towards melanoma detection. Note: https://challenge.kitware.com/#phase/584b0afacad3a51cc66c8e24. Cited by: §IV-A, TABLE X, TABLE VIII.
-  (2004) The reverse hierarchy theory of visual perceptual learning. Trends cogn. sci. 8 (10), pp. 457–464. Cited by: §III.
-  (2017) Saliency-based lesion segmentation via background detection in dermoscopic images. IEEE J. Biomed. Health Inform 21 (6), pp. 1685–1693. Cited by: §I.
-  (2018) Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks. Comput. Meth. Prog. Bio. 162, pp. 221–231. Cited by: §II, TABLE VIII.
-  (2017) Development of a clinically oriented system for melanoma diagnosis. Pattern Recogn. 69, pp. 270–285. Cited by: §I.
-  (2019) Step-wise integration of deep class-specific learning for dermoscopic image segmentation. Pattern Recogn. 85, pp. 78–89. Cited by: §I, §II, §III-A, TABLE X, TABLE VII, TABLE VIII, TABLE IX.
-  (2017) Dermoscopic image segmentation via multi-stage fully convolutional networks. IEEE Trans. Biomed. Eng 64 (9), pp. 2065–2074. Cited by: §I, §II, TABLE VII, TABLE IX.
-  (2011) Automated prescreening of pigmented skin lesions using standard cameras. Comput. Med. Imag. Grap. 35 (6), pp. 481–491. Cited by: §I.
-  (2013) Lesion border detection in dermoscopy images using ensembles of thresholding methods. Skin Research and Technology 19 (1), pp. e252–e258. Cited by: §II.
DCAN: deep contour-aware networks for accurate gland segmentation.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2487–2496. Cited by: §II.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40 (4), pp. 834–848. Cited by: §I, §IV-C, §IV-D2, §IV-D5.
-  (2018) DermaKNet: incorporating the knowledge of dermatologists to convolutional neural networks for skin lesion diagnosis. IEEE J. Biomed. Health Inform. Cited by: §I.
-  (2019) Boundary-aware feature propagation for scene segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6819–6829. Cited by: §I.
-  (2019) Semantic correlation promoted shape-variant context for segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8885–8894. Cited by: §IV-C.
-  (2020) Semantic segmentation with context encoding and multi-path decoding. IEEE Transactions on Image Processing. Cited by: §I.
-  (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2393–2402. Cited by: §IV-C.
-  (2019) Segmentation of skin lesions in dermoscopy images using fuzzy classification of pixels and histogram thresholding. Comput. Meth. Prog. Bio. 168, pp. 11–19. Cited by: §I, TABLE X, TABLE VII, TABLE VIII, TABLE IX.
-  (2019) CE-net: context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging. Cited by: §II.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §II, §III-C.
Automatic skin lesion segmentation based on texture analysis and supervised learning. In Asian Conference on Computer Vision, pp. 330–341. Cited by: §II.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §II.
-  (2017) Rapid processing of a global feature in the on visual pathways of behaving monkeys. Front. neurosci. 11, pp. 474. Cited by: §III.
-  (1995) Eye, brain, and vision.. Scientific American Library/Scientific American Books. Cited by: §III.
-  (2018) Supervised saliency map driven segmentation of lesions in dermoscopic images. IEEE J. Biomed. Health Inform. Cited by: §II.
-  (2012) Skin tumours. Journal of cutaneous and aesthetic surgery 5 (3), pp. 159. Cited by: §I.
-  (2017) Automated detection and segmentation of vascular structures of skin lesions seen in dermoscopy, with an application to basal cell carcinoma classification. IEEE J. Biomed. Health Inform 21 (6), pp. 1675–1684. Cited by: §I.
-  (2017) Skin lesion segmentation: u-nets versus clustering. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. Cited by: §II, TABLE VIII.
Feature boosting network for 3d pose estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §I, §III-A, §III-C.
-  (2013) Analysis of the contour structural irregularity of skin lesions using wavelet decomposition. Pattern Recogn. 46 (1), pp. 98–106. Cited by: §I.
-  (2016) A novel approach to segment skin lesions in dermoscopic images based on a deformable model. IEEE J. Biomed. Health Inform 20 (2), pp. 615–623. Cited by: §II.
-  (2019) DeepDeblur: text image recovery from blur to sharp. Multimedia Tools and Applications 78 (13), pp. 18869–18885. Cited by: §II.
-  (2017) Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4353–4361. Cited by: §III-C.
-  (2016) Brain tumor segmentation using convolutional neural networks in mri images. IEEE Trans. Med. Imaging 35 (5), pp. 1240–1251. Cited by: §II.
-  (2015) Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115 (3), pp. 211–252. Cited by: §III, §IV-C.
-  (2019) Attention gated networks: learning to leverage salient regions in medical images. Medical image analysis 53, pp. 197–207. Cited by: §II.
-  (1999) Segmentation of digitized dermatoscopic images by two-dimensional color clustering. IEEE Trans. Med. Imaging 18 (2), pp. 164–171. Cited by: §II.
-  (2016) Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 35 (5), pp. 1160–1169. Cited by: §II.
-  (2015) Four-class classification of skin lesions with task decomposition strategy. IEEE Trans. Biomed. Eng 62 (1), pp. 274–283. Cited by: §I.
-  (2019) Toward achieving robust low-level and high-level scene parsing. IEEE Trans. Image Process. 28 (3), pp. 1378–1390. Cited by: §I, §III-C, §IV-C.
Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach.Intell. 40 (6), pp. 1480–1493. Cited by: §I, §III-C, §IV-C.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-C.
-  (2018) Interactive medical image segmentation using deep learning with image-specific fine-tuning. IEEE Trans. Med. Imaging. Cited by: §II.
-  (2011) Modified watershed technique and post-processing for segmentation of skin lesions in dermoscopy images. Comput. Med. Imag. Grap. 35 (2), pp. 116–120. Cited by: §II.
-  (2019) Dermoscopic image segmentation through the enhanced high-level parsing and class weighted loss. In Proceedings of the IEEE International Conference on Image Processing, pp. 245–249. Cited by: §I, §IV-C.
-  (2019) Deep attentive features for prostate segmentation in 3d transrectal ultrasound. IEEE transactions on medical imaging. Cited by: §II.
-  (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision, pp. 3–19. Cited by: §II.
-  (2017) Melanoma classification on dermoscopy images using a neural network ensemble model. IEEE Trans. Med. Imaging 36 (3), pp. 849–858. Cited by: §I.
-  (2016) Multi-instance deep learning: discover discriminative local anatomies for bodypart recognition. IEEE Trans. Med. Imaging 35 (5), pp. 1332–1343. Cited by: §II.
-  (2018) Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692. Cited by: §I, §IV-D2.
-  (2017) Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE Trans. Med. Imaging 36 (4), pp. 994–1004. Cited by: §I, §II, TABLE VII, TABLE IX.
-  (2017) Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE Trans. Med. Imaging 36 (9), pp. 1876–1886. Cited by: §I, §II, TABLE VII.
-  (2017) Improving dermoscopic image segmentation with enhanced convolutional-deconvolutional networks. IEEE J. Biomed. Health Inform. Cited by: §II, §IV-E, TABLE X, TABLE VII, TABLE VIII.
-  (2009) Accurate segmentation of dermoscopic images by image thresholding based on type-2 fuzzy logic. IEEE Trans. Fuzzy Syst. 17 (4), pp. 976–982. Cited by: §II.
-  (2019) ET-net: a generic edge-attention guidance network for medical image segmentation. Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention Society. Cited by: §II.
-  (2011) Gradient vector flow with mean shift for skin lesion segmentation. Comput. Med. Imag. Grap. 35 (2), pp. 121–127. Cited by: §II.
-  (2009) Anisotropic mean shift based fuzzy c-means segmentation of deroscopy images. IEEE J. Sel. Top. Signal Process. 3 (1), pp. 26–34. Cited by: §II.
-  (2015) Invariant visual object recognition and shape processing in rats. Behav. brain res. 285, pp. 10–33. Cited by: §III.