Boundary-aware Transformers for Skin Lesion Segmentation

Skin lesion segmentation from dermoscopy images is of great importance for improving the quantitative analysis of skin cancer. However, the automatic segmentation of melanoma is a very challenging task owing to the large variation of melanoma and ambiguous boundaries of lesion areas. While convolutional neutral networks (CNNs) have achieved remarkable progress in this task, most of existing solutions are still incapable of effectively capturing global dependencies to counteract the inductive bias caused by limited receptive fields. Recently, transformers have been proposed as a promising tool for global context modeling by employing a powerful global attention mechanism, but one of their main shortcomings when applied to segmentation tasks is that they cannot effectively extract sufficient local details to tackle ambiguous boundaries. We propose a novel boundary-aware transformer (BAT) to comprehensively address the challenges of automatic skin lesion segmentation. Specifically, we integrate a new boundary-wise attention gate (BAG) into transformers to enable the whole network to not only effectively model global long-range dependencies via transformers but also, simultaneously, capture more local details by making full use of boundary-wise prior knowledge. Particularly, the auxiliary supervision of BAG is capable of assisting transformers to learn position embedding as it provides much spatial information. We conducted extensive experiments to evaluate the proposed BAT and experiments corroborate its effectiveness, consistently outperforming state-of-the-art methods in two famous datasets.


page 2

page 7


XBound-Former: Toward Cross-scale Boundary Modeling in Transformers

Skin lesion segmentation from dermoscopy images is of great significance...

Global Guidance Network for Breast Lesion Segmentation in Ultrasound Images

Automatic breast lesion segmentation in ultrasound helps to diagnose bre...

Deeply Supervised Rotation Equivariant Network for Lesion Segmentation in Dermoscopy Images

Automatic lesion segmentation in dermoscopy images is an essential step ...

Action Quality Assessment using Transformers

Action quality assessment (AQA) is an active research problem in video-b...

BI-GCN: Boundary-Aware Input-Dependent Graph Convolution Network for Biomedical Image Segmentation

Segmentation is an essential operation of image processing. The convolut...

Frequency and Spatial domain based Saliency for Pigmented Skin Lesion Segmentation

Skin lesion segmentation can be rather a challenging task owing to the p...

Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification

This paper compares well-established Convolutional Neural Networks (CNNs...

Code Repositories


Boundary-aware Transformers for Skin Lesion Segmentation

view repo

1 Introduction

Melanoma is one of the most rapidly increasing cancers all over the world. According to the American Cancer Society’s estimation, there are about 100,350 new cases and over 65,00 deaths in 2020 

[mathur2020cancer]. Segmenting skin lesions from dermoscopy images is a key step in skin cancer diagnosis and treatment planning. In current clinical practice, dermatologists usually need to manually delineate skin lesions for further analysis. However, manual delineation is usually tedious, time-consuming, and error-prone. To the end, automated segmentation methods are highly demanded in clinical practice to improve the segmentation efficiency and accuracy.

Figure 1: The challenges of automatic skin lesion segmentation from dermoscopy images: (a)-(d) large skin lesion variations in size, shape, and color, (f)(g) partial occlusion by hair, and (f)-(h) ambiguous boundaries.

It remains, however, a very challenging task because (1) skin lesions have large variations in size, shape,and color (see Fig. 1 (a-d)), (2) present of hair will partially cover the lesions destroying local context, (3) the contrast between some lesions to normal skin are relatively low, resulting in ambiguous boundaries (see Fig. 1 (e-h)), and (4) the limited training data make the task even harder. A lot of effort has been dedicated to overcoming these challenges. Traditional methods based on various hand-crafted features are usually not stable and robust, leading to poor segmentation performance when facing lesions with large variations [5342426]

. The main reason is that these hand-crafted features are incapable of capturing distinctive representations of skin lesions. To solve the problem, deep learning models based on convolutional neural networks (CNN) have been proposed and achieved remarkable performance gains compared with traditional methods such as some advanced version of the fully convolutional network (FCN) 

[yuan2017automatic, 7792699]. However, these models are still insufficient to tackle the challenges of skin lesion segmentation due to the inductive bias caused by the lack of global context. With regard to this, researchers propose various approaches to enlarging the receptive fields inspired by the advancement of dilated convolution [yu2017dilated, yu2019multi]. Lee et al [9157193] extensively incorporate the dilated attention module with boundary prior so that the network predict boundary key-points maps to guide the attention module.

Nevertheless, most existing solutions are still incapable of effectively capturing sufficient global context to deal with above mentioned challenges. Recently, transformers have been proposed to regard an image as a sequence of patches and aggregate feature in global context by self-attention mechanisms [carion2020detr, prangemeier2020attention]. For example, TransUNet [chen2021transunet]

, a hybrid architecture of CNN and transformer, performs well on Synapse multi-organ segmentation. Yet, it is difficult for transformer based framework to achieve the same success on skin lesion segmentation, which usually has only thousands of data not the same as what they have done in the COCO 2017 Challenge 

[carion2020end] containing 118k training images and 5k validation images. Limited images make it difficult to encode position embedding, and hence will not always be able to accurately and effectively model long-range interactions. Moreover, regions of lesion cover a relatively small area compared to normal tissues and generally has ambiguous boundary not as human organs, which interferes with segmentation performance by a large margin.

In this paper, we propose boundary-aware transformer (BAT) to ably handle aforementioned problems, by holistically leveraging the advancement of boundary-wise prior knowledge and transformer-based network. In fact, this design is based on the intuitions for human beings to perceive lesions in vision, i.e. considering global context to coarsely locate lesion area and paying special attention to ambiguous area to specify the exact boundary. Concretely, we propose a boundary-wise attention gate (BAG) in transformer architecture to make full use of boundary-wise prior knowledge. Firstly, BAG would learn which patches in the sequence belong to ambiguous boundary, thus providing a patch-wise attention map to guide this attention gate. Secondly, a novel key-patch map generation algorithm is introduced for adeptly giving the ground-truth label that can best represent the ambiguous boundary of target lesion. Thirdly, the auxiliary supervision of BAG provides feedback to train transformers that can let it efficiently learn position embedding on a relatively small dataset. We evaluate our model on different publicly available databases. One is the ISBI 2016 and PH2 dataset following the experimental setting in most recent work [9157193], the other one is the latest ISIC 2018 dataset consisting of 2594 labeled images in total. All the experiment results demonstrate the significant performance gains of our proposed framework.

2 Method

Figure 2: An overview of the proposed Boundary-Aware Transformer framework.

An overview of our proposed model is illustrated in Fig. 2

. We will first introduce our basic transformer network that leverages the intrinsic local locality of CNN and innate long-range dependency to complement the skin lesion segmentation in Section 

2.1. Then, the overall framework of our boundary-aware transformer for segmenting regions of ambiguous boundary will be elaborated in Section 2.2. In the end, Section 2.3 introduces the hybrid objective function to efficiently train our boundary-aware transformer.

2.1 Basic Transformer for Segmentation

Our basic transformer (BT) blends typical CNN’s and transformer’s architectures to perform segmentation task, by three procedures: image sequentialization, sequence transformation, and atrous prediction. Image sequentialization is typically the primary step at which an image is converted into 1D sequential embedding fitting the input format of subsequent sequence transformation. Atrous prediction is designed to effectively remedy the local feature representation which has been ignored in sequence transformation and robustly segment lesions at multiple scales.

Image Sequentialization. Current literature processing sequentialization by typical CNN encoder [carion2020detr, li2020sttr] or linear projection [dosovitskiy2020vit, zheng2020rethinking]. Despite its successful advancement achieved on some natural vision tasks, linear projection remains the shortcoming that is the high dependency on the amount of dataset, leading to inferior segmentation performance in medical domain [chen2021transunet]. To this end, we perform sequentialization through the former. Given an input image with width and height , the CNN backbone (ResNet50 as default in this work) produces the corresponding image feature map , that the size of each patch is . The feature map is then flattened into 1D patch embedding and added by a learnable positial embedding [gehring2017convolutional] which is randomly initialized to compensate spatial information destroyed by sequentialization, resulting in final sequential embedding as .

Sequence Transformation. Transformer encoder composed of

stacked encoder layers is applied to capture long-range context in a whole dermoscopic image. Each layer in the encoder consists of a multi-head self-attention module (MSA) and a Multilayer Perceptron (MLP) following typical design 

[vaswani2017attention]. Assumed that the input of -th layer is (specially, ), the output can be written as follows:


Eventually, transformed feature of the last layer will be reshaped to 2D format as , for dense prediction in the next.

Atrous Prediction. For segmenting lesions at multiple scales, this module takes transformed feature after self-attention mechanism as input and aims to produce a dense prediction. Aiming to enhance the local feature representation and handle the multi-scale lesion context, an atrous prediction module is designed as follows:


Here, denotes dilated convolution function with a dilation rate and filter size of .

is a sigmoid function. The enhanced feature maps

with various receptive fields are concatenated across channel-wise and projected into segmentation map space.

2.2 Boundary-Aware Transformer

Efforts to incorporate structural boundary information to CNNs have been made a lot these years, but there is little literature investigating the effectiveness on transformer. We argue that the equipment of boundary information can also let transformer obtain more power in addressing lesions with ambiguous boundary. To this end, we devise the boundary-aware transformer (BAT), in which a boundary-wise attention gate (BAG) is added at end of each transformer encoder layer to refine transformed feature. BAG’s architecture is similar to conventional spatial attention gate including (1) a key-patch map generator which takes the transformed feature as input and output a binary patch-wise attention map , where value indicates that the corresponding patch is at ambiguous boundary. (2) and a residual attention scheme for preserving boundary-wise information. Hence, the boundary-aware transformed feature can be re-written as:


where and denote element-wise addition and channel-wise multiplication, respectively.

In addition to BAGs in transformer encoder layers, a query embedding based BAG is applied after encoder to refine the feature

. It plays the same role in boundary-wise attention but comes true by a totally different way. Here, instead of learning the linear projection as classifier, we refer a learnable embedding

as context prototype for regions among ambiguous boundary. It will be compared with all patch embedding () after aforementioned blocks, to produce a similarity map . Those patches with high similarity will be the regions of ambiguous boundary. Similar to other BAGs, a residual attention scheme is also applied here as: .

By this design, BAT learns robust feature representation of ambiguous boundary in a variety of ways, which is of great significance to handle segmentation of lesions with ambiguous boundary. Following our basic design, feature is fed into atrous prediction module to produce the segmentation map .

Boundary-Supervised Generator. As the generator doesn’t necessarily know on its own which patches can best represent structural boundary of target lesion, we introduce a novel algorithm to produce ground-truth key-patch map to train the generator with full supervision. Besides the enhancement of boundary features, this design can also help in accelerating training transformer thanks to the auxiliary constraints.

Specifically, boundary points set is produced using conventional edge detection algorithm at first. For each point in this set, we draw a circle of radius (set to as default) and calculate the proportion of lesion area in this circle. Larger or smaller proportion indicates that boundary is not smooth in this cricle. Thus we score each point as , representing the assistance in segmenting ambiguous parts. Non-maximum suppression is then utilized to filter points with larger proportion than neighbour (set to as default) points. Next, filtered points’ 2D location is mapped into 1D location as , and patch labels at these location are set to and others are set to 0, leading to final ground-truth .

2.3 Objective Function

To train the segmentation network including the proposed BAGs, we emply two types of loss functions. The first one is a Dice loss function to minimize the difference between the groud-truth segmentation map and the predicted segmentation map as

. The second one is a Cross-Entropy loss to reduce the predicted key-patch map and its ground-truth as . Total loss is defined as:


where denotes the predicted key-patch map at -th transformer encoder layer. denote Dice loss function and Cross-Entropy loss function, respectively. denotes the number of transformer encoder layers and is set to 4 as default.

3 Experimental Results

3.1 Datasets

We conduct extensive experiments on the skin lesion segmentation datasets from International Symposium on Biomedical Imaging (ISBI) of the years 2016 and 2018. The datasets are collected from a variety of different treatment centers, archived by the International Skin Imaging Collaboration (ISIC), which hosted a challenge named skin lesion analysis toward melanoma detection to boost the performance of melanoma diagnosis. ISIC 2016 contains a total number of 900 samples for training and a total number of 379 dermoscopy images for testing. We follow up the same experimental protocols in the most recent work [9157193], in which we train our model on the training set of ISIC 2016 and extensively evaluate it on PH2 dataset. ISIC 2018 contains 2594 training samples in total and annotation of its public test set is missing, therefore we perform five-fold cross-validation on its training set for fair comparison.

3.2 Implementation Details

Our network is implemented on a single NVIDIA RTX 3090Ti. All images are empirically resized to

considering the efficiency, and we do data augmentation including vertical flip, horizontal flip and random scale change (limited 0.9-1.1). Each mini-batch includes 24 images and we utilize Adam with an initial learning rate of 0.001 to optimize the network. Learning rate decrease in half when loss on the validation set has not dropped by 10 epochs. The encoder of each network has been pre-trained on ImageNet and all parameters are then fine-tuned for 500 epochs in total.

Model ISIC 2016 + PH2 Model ISIC 2018
SSLS[ahn2015automated] DeepLabv3 [chen2017rethinking]
MSCA[bi2016automated] U-Net++ [zhou2018unetpp]
FCN [long2015fully] CE-Net [gu2019cenet]
Bi et al [7942129] MedT [valanarasu2021medical]
Lee et al [9157193] TransUNet [chen2021transunet]
Table 1: Experimental results on different datasets.
Figure 3: Visual comparison of lesion segmentation results produced by different methods.

3.3 Comparison with State-of-the-Arts

For baseline comparisons, we run experiments on both convolutional and transformer-based methods. With regard to the evaluation metrics, we employ a Dice coefficient (Dice), and a Intersection over Union (IoU). Table 

1 displayed the comparative study of our proposed boundary-aware transformer (BAT) with other methods on different datasets. It’s obviously shown that our model achieves the best segmentation performance.

On the ISIC 2016 + PH2 dataset, we compare our method with five state-of-the-art methods. Among them, Lee et al [9157193] is a 2D attention-based model with use of boundary-prior knowledge, achieving best segmentation performance on skin lesion segmentation recently. As seen in the Table 1, our BAT achieves 0.920 in Dice and 0.858 in IoU, outperform Lee et al by 0.2% and 1.5% in Dice and IoU, respectively. We extensively conduct experiments with other SOTA segmentation networks on the ISIC 2018 dataset, including three famous convolutional models for segmentation (DeepLabv3 [chen2017rethinking], UNet++ [zhou2018unetpp], CE-Net [gu2019cenet]) and two transformer-based network to address medical image segmentation (TransUNet [chen2021transunet], MedT [valanarasu2021medical]). Even compared with other state-of-the-art segmentation models, our BAT still achieves the consistent and significant improvement on both metrics. It’s noteworthy that transformer-based network has superior performance than conventional CNNs, indicating the effectiveness of utilizing global context to detect skin lesion. In addition, compared with TransUNet, our method leveraging the boundary-prior knowledge significantly improves the segmentation performance (1.8% on Dice and 2.1% on IoU), proving that the combination of boundary information and transformer architecture is indeed helpful to segment target lesion.

Fig. 3 visualizes five typical challenging cases of lesion segmentation results. It is observed that our results are closest to the ground truth, when compared with our competitors. The first three rows represent cases with various color, size and shape, and our BAT outperforms others with most stable segmentation performance, indicating the robust advancement of global context. The last two rows highlight some small regions of ambiguous boundary and it’s shown that our BAT is capable of tacking such problems, due to the use of boundary-wise prior knowledge.

3.4 Ablation Study

Trans. BAG ISIC 2016 + PH2 ISIC 2018
Table 2: Experimental results on different datasets.

We further conduct ablation studies to demonstrate the effectiveness of three major components in BAT: (1) the transformer-based self-attention mechanism (Trans.), (2) boundary-wise attention gate (BAG). As shown in the Table. 2, by the incorporation of self-attention mechanism, the IoU increases by a large margin on both datasets. This result indicate that it’s essential to integrate global context to improve the skin lesion detection. On the other hand, applying BAGs to guide transformer further improves the performance significantly, confirming the effectiveness of boundary-wise prior knowledge to tackling challenging cases, such as lesions with ambiguous boundary.

4 Conclusion

We present a novel and efficient context-aware network, namely boundary-aware transformer (BAT) network, for accurate segmentation of skin lesion from dermoscopy images. Extensive experiments on two public datasets confirm the effectiveness of our proposed BAT, to help yield much better segmentation results for skin lesions. Our full model outperforms state-of-the-art models by a large margin in segmentation accuracy and the intuitive visualization shows that our BAT has most satisfactory performance on skin lesions with ambiguous boundary.