. Segmenting skin lesions from dermoscopy images is critical in the diagnosis and treatment planning, which is usually tedious, time-consuming, and error-prone for human beings. In this regard, automated segmentation methods are highly demanded in clinical practice to improve clinical workflow in terms of accuracy and efficiency. It remains a very challenging task because (1) skin lesions have large size, shape and color variance (see Fig.1 (a-b)), (2) the hair will partially cover the lesions destroying the local context (see Fig. 1 (c-d)), (3) sometimes, the contrast between lesions to normal skin is relatively low, resulting in ambiguous boundaries (see Fig. 1 (e-h)).
Many efforts have been dedicated to overcoming these challenges. Hand-crafted features are adopted in the early years, which are usually not stable and robust, leading to poor segmentation performance when facing lesions with large variations 51, 49]. However, due to the lack of global context modeling, these models are still insufficient to counteract the large variation of skin lesion segmentation. Concerning this, researchers propose various approaches to enlarge the receptive fields inspired by the advancement of dilated convolution [47, 48]. Lee et al.  extensively incorporate the dilated attention module with boundary prior so that the network predicts boundary key-points maps to guide the attention module.
Nevertheless, the receptive field of convolution is inevitably limited, so these solutions are still incapable of effectively capturing sufficient global context to deal with the challenges mentioned above. Recently, vision transformers have been proposed to regard an image as a sequence of patches and aggregate features in a global manner by self-attention mechanisms [5, 29, 42, 43]. It is also verified that transformers can be used to handle medical image segmentation tasks, i.e., TransUNet  and TransFuse . In the field of skin lesion segmentation, studies improve the transformer-based networks with boundary information [40, 4], while they have not thoroughly explored the potential usefulness of boundary information and global context in a multi-scale manner. Furthermore, these transformers still contain convolutional modules that may decrease the performance thanks to the inductive bias.
In this paper, we propose a novel cross-scale boundary-aware transformer (XBound-Former) to ably handle the problems mentioned above by holistically leveraging the advancement of boundary-wise prior knowledge and self-attention mechanism. This method is inspired by the intuition that human beings perceive lesions in vision, i.e., considering global context to coarsely locate lesion areas and paying particular attention to the ambiguous area to specify the exact boundary. Concretely, we enhance the boundary modeling ability of the transformer-based network via three key learners: implicit boundary learner (im-Bound), explicit boundary learner (ex-Bound), and cross-scale boundary learner (X-Bound).
Im-Bound is recommended to explore local contexts for accurate boundary modeling implicitly. As the points with large boundary variation contribute more to the segmentation result than other boundary points, we constrain the network attention on such boundary key points. It enhances the local boundary modeling while maintaining the global context.
Ex-Bound is proposed to explicitly extract the boundary knowledge as multiple embeddings where each embedding represents the boundary knowledge at a unique scale. They are used to further enhance the local boundary modeling and boost the cross-scale communication.
X-Bound is suggested as a cross-scale attention mechanism for simultaneously addressing the problems of ambiguous boundaries and size variation. Acting like human beings that determine the accurate boundaries by zooming in and zooming out, we use the learned boundary embedding at one scale to guide the boundary-aware attention at the other scales to enhance the cross-scale knowledge communication.
We evaluate our model on two skin lesion datasets, ISIS-2016&PH2 and ISIC-2018, following the standard experimental setup [19, 40, 4]. To evaluate the generalization, we perform an extensive experiment on the polyp lesion which has closed characteristics. Our model has achieved superior performance in all experiments compared to state-of-the-art CNN-based and transformer-based models, indicating the advanced power in addressing object segmentation with ambiguous boundaries, especially for skin lesion segmentation.
2 Related Work
2.1 Skin Lesion Segmentation
In the early years, traditional methods apply various hand-crafted features to learn lesion segmentation that are not robust and stable. It leads to poor segmentation performance when facing large lesions with large variations . Later, a fully convolutional network (FCN)  brings the deep learning model to skin lesion segmentation and achieves a much better result. Several improved networks following its direction are proposed to solve the imbalance between foreground and background pixels  and enhance the multi-scale feature representation . With the widespread use of the attention-based mechanism, channel and spatial attention-based methods are applied to enhance the lesion modeling [1, 12]. The performance indeed reaches a higher score, but skin lesions’ ambiguous boundaries are still hard to recognize. To address this issue,  propose adaptive dual attention modules to let the network focus on lesion boundaries while it fails to cope with blurry boundaries owing to poor use of boundary-aware prior knowledge. More recently, seeing the excellent success achieved by vision transformers, several studies employ transformer-based networks in the field of skin lesion segmentation [40, 4]. It works to solve the problem of large lesion variation by capturing the global context. However, they are still unable to handle the problem of ambiguous boundaries, especially the ones with size variation. Instead, our proposed XBound-Former exploits multi-scale boundary information through the advanced self-attention blocks and utilizes the boundary-aware prior knowledge to supervise the transformer training. Thus it can outperform the State-of-the-Arts and the latest vision transformers.
2.2 Vision Transformers
Transformer, as a standard model in natural language process
, has made great progress in the field of computer vision recently. The first vision transformer, ViT
, proposes to split an image into a certain number of patches and utilize self-attention blocks to embed the features, achieving competitive performance in image classification tasks compared to the latest convolution-based neural networks. Later work introduces a series of strategies to increase the training efficiency and improve the accuracy on small datasets. Although the transformers are originally proposed to explore global dependency, recent studies find that the transformers also need local communication [21, 50, 43], which can be achieved through the local window shift or pyramid architecture , especially for the tasks requiring dense representations [6, 54, 21]. As for medical image segmentation, the effectiveness of vision transformers is verified by TransUNet  and TransFuse . In the field of skin lesion segmentation, vision transformers also boost the performance to reach new higher scores [40, 4]. Despite their success, these models have not considered the complementary knowledge of boundary knowledge and global context in a multi-scale manner, which may help segment the extremely challenging lesions. XBound-Former aims to mitigate this issue through cross-scale boundary learners and, besides, builds a pure attention-based network instead of the fusion of transformer and convolution to prevent the inductive bias.
2.3 Boundary-aware Prior Knowledge
The accurate recognition of ambiguous boundaries is one of the most tricky problems in medical image segmentation. There are plenty of works to address this issue by taking full advantage of the boundary-aware prior knowledge. The earliest works propose to modify the loss function to give boundary-aware supervision for network optimization, i.e., HD loss, Boundary loss , etc. Later, multi-task learning is applied in this direction where manually designed tasks are used to provide extra supervision on the boundaries [41, 26]. Apart from the boundary-aware supervision, several networks propose to utilize spatial attention mechanisms to enhance the representation of boundaries . By contrast, we not only introduce the boundary-aware prior knowledge into vision transformers but also present a novel key-patch map generator that can select the most ambiguous points among the boundaries and convert them to the key-patch map to give supervision to the transformers.
An overview of the cross-scale boundary-aware transformer (XBound-Former) is presented in Fig. 2, where we show the details about how to leverage boundary prior knowledge and global dependency across different scales. It first utilizes a pyramid vision transformer  to coarsely extract the features of an input dermoscopy skin image. As a pyramid feature extractor, the backbone yield features at four different scales, . Here, denotes the lowest feature with the largest scale and denotes the deepest feature with the smallest scale (). Each feature will be enhanced through the in-scale and cross-scale boundary aggregation to strengthen the boundary representation. Finally, several linear classification heads are used to predict the segmentation maps.
3.1 In-scale Boundary Modeling
As an attention-based mechanism, transformers treat each image as a sequence of patches and explore the global dependency to represent them. The global view is precisely helpful for the vision tasks, while recent studies have shown that they also require local context modeling in the dense-level vision tasks [21, 42]. For the segmentation tasks, especially for skin lesions with ambiguous boundaries, global dependency can help locate coarse boundary but lacks local contexts to segment accurate boundaries. Therefore, we propose to fuse boundary information in the transformers to explore the local context of boundaries. It is achieved by using a sequence of implicit boundary learners (im-Bound) and explicit boundary learners (ex-Bound) to refine the feature at each scale as , where . The process is denoted as,
where we simplify the notation . As the in-scale boundary modeling module takes the sequential features instead of 2-D maps as inputs and outputs, we re-define the inputted features as . They are the encoded features after sequentialization and are added with position embeddings .
3.1.1 Implicitly Boundary-wise Attention
The im-Bound aims to constrain the model’s attention on the points with large boundary variation as they contribute more to the final segmentation result. With this inspiration, we propose to utilize the self-attention module to find such points in the manner of predicting boundary key-point map. The map is used for the feature refinement and offering a boundary-aware constraint. Specifically, it contains cascaded blocks in total. Assumed that at the -th block, given the inputted feature as , where , we firstly feed it into a sequence of multi-head self-attention (MSA) and multi-layer perception (MLP) to gather the global dependency for coarsely locating the boundaries . After each part, there is a Layer Normalization with residual short connection for a stable training process . We denote this intermediate feature as,
where denotes the element-wise addition and denotes the MSA operation. As the self-attention modules embed query, key and value together from
, we simplify the equation. Additionally, the LayerNorm operation is also simplified to save space. Then, a linear predictor with Sigmoid activation is utilized to classify each patch whether it is the point with large boundary variation, supervised by the boundary key-point map pre-produced by our boundary key-point map generation algorithm (see Sec.3.3). We denote the predicted key-point map as so that we could obtain the enhanced feature as,
where denotes the element-wise multiplication. After cascaded blocks, the resulted feature will be sent to ex-Bound for further refinement.
3.1.2 Learn Explicit Boundary Embedding
is proposed to embed boundary information into a set of feature vectors explicitly, where each embedding contains the high-level boundary semantics at a unique scale. This learner is different from the im-Bound regarding the implementation, as well as the motivation that it not only refines the features but also provides the explicit expression for subsequent cross-scale communication. To achieve this goal, we treat the boundary key points as query objects and employ a transformer decoder [6, 30] to learn the boundary embeddings. The decoder contains a sequence of the Masked MSA module, MSA module, and the MLP module, each after which there is a LayerNorm layer and the short connection . Thanks to the global context modeling, it refines the inputted randomly initialized vector into the boundary embedding that contains abundant boundary knowledge. After that, we send the feature and boundary embedding into the MSA module and the boundary key-point prediction part for the consideration of refining features and, of more importance, obtaining a preciser boundary embedding.
We repeat the ex-Bound times to guarantee the adequate boundary learning. For the -th block, it takes feature and current embedding as input and output the aggregated feature , the embedding , and predicted key-point map . After blocks, the resulted feature is reshaped as and sent to the cross-scale boundary aggregation along with the learned boundary embedding.
3.2 Attention-based Cross-scale Boundary Fusion
Automatic skin lesion segmentation suffers from the significant variance in lesion size and ambiguous boundaries. We take the first attempt to address these two issues simultaneously through the attention-based mechanism, our cross-scale boundary learners (X-Bound). It is inspired by the human beings that determine the accurate boundaries by zooming in and zooming out boundaries and combining multi-perspective information across different scales to make the final decision.
Generally, we visualize the details in Fig. 3 where features and boundary embeddings at low scale () and high scale () are inputted and the enhanced feature at low scale () is outputted. denotes the size larger than . Theoretically, the boundary embedding at a lower scale focuses on more local details and the boundary embedding at a larger scale focuses more on the high-level semantics. Thus, utilizing the embedding at one scale to attentively refine the features at another scale provides complementary boundary knowledge.
In detail, we compare to each point in the lower feature and compute the distance matrix, which is then used to transfer boundary knowledge in to each point in the feature . It means that the intermediate features can be calculated as:
where is the multi-head attention module used in Equation 2. After that, the intermediate features are concatenated after the up-sample operation of , which is fed into a linear projection head to reduce the feature dimension and refine the fusion. The resulted feature is denoted as .
Totally, except the deepest feature , we perform the cross-scale boundary learning on to obtain and is straightly set as . For the consideration of multi-scale model learning, we feed each feature into a linear classification head to predict the segmentation maps .
3.3 Boundary Key-point Generation Algorithm
As the boundary learners do not naturally know which points can best represent the ambiguous boundaries, we propose a novel generation algorithm to pre-produce a ground-truth key-point map supervising the boundary learning, as shown in Fig. 4. The first step is to calculate all points on the boundary using a conventional contour detection algorithm . After that we could obtain a set of coordinates of the boundary points. Then, as points with larger boundary deviation should be paid more attention to than those with smoother deviation, we propose filtering the points by scoring the deviation. For each point in this set, we draw a circle of radius and calculate the proportion of the lesion area in this circle region, where the larger or smaller indicates that the boundary is not smooth in this circle region. Hence, we score each point as to representation its deviation. To find the most valuable points, mon-maximum suppression is performed in which the points with larger than neighbor points are selected. Next, selected points’ 2D locations are mapped into the binary key-point map , where points at the selected location are set to one and others are set to zero. By minimizing the error between and , the supervision helps the boundary learners focus on the ambiguous boundary regions and helps the boundary embeddings learn correct boundary knowledge.
3.4 Objective Function
We design a joint objective to train the entire network, including the lesion segmentation loss for predicted segmentation maps and the key-point map loss for predicted boundary key-point maps, as
where denote Dice loss function and Cross Entropy function and are the ground-truth segmentation and boundary key-point maps pre-produced. is the weight to balance the two objectives. The detailed calculation is described as,
For deeply multi-scale supervision, given the original segmentation label, , we repeat the down-sample operation with different rates to obtain the set of ground-truth segmentation maps , where . For the key-point maps, we also repeat the down-sample operation and obtain where .
Following the classical experimental setting in the previous studies , we evaluate our model on two skin lesions segmentation datasets, ISIC-2016&PH and ISIC-2018. To further evaluate the model generalization, we evaluate it on the polyp lesion segmentation using five public polyp image datasets, named Polyp-seg.
The ISIC-2016&PH contains samples from two centers to evaluate the accuracy and generalization ability of skin lesion segmentation. One is the ISIC-2016 dataset that contains a total number of 900 samples for training and 379 samples for validation. The other one is the PH dataset , containing 200 samples in total. Here, we use samples in the ISIC-2016 dataset for model learning through the official train-validation split and test the model on the 200 samples from the PH dataset.
The ISIC-2018 dataset was also collected by ISIC in 2018, which contains 2594 images and labels. The resolution of each image varies from to . As the public test set has not been released, we perform a 5-fold cross-validation for a fair comparison.
The Polyp-seg dataset is collected following the most popular setting , which contains five public datasets: Kvasir-SEG , ClinicDB , ColonDB , Endoscene , and ETIS . The Kvasir-SEG and ClinicDB contain 612 and 1000 samples, respectively, of which 548 and 900 samples are used for training and the rest samples are used for testing. To evaluate the generalization ability, samples from the rest three datasets are also used for testing.
|Method||validation-ISIC-2016 ||test-PH |
|Attention U-Net ||79.70||87.43||16.41||48.78||69.52||80.52||26.73||74.51|
|Attention U-Net ||75.94||17.24||78.01||15.13||77.31||16.04||73.78||19.16||75.58||17.15||75.01||18.73|
Comparison of skin lesion segmentation with different approaches with 5-fold cross-validation on ISIC-2018 dataset. We present the averaged result and the standard error of all folds.
4.2 Evaluation Metrics
We employ four widely-used metrics to quantitatively evaluate the skin lesion segmentation performances, including coefficient, score, Average symmetric surface distance (), and Hausdorff distance of boundaries (95 percentile; ). Generally, a better segmentation performance shall have higher area-based metrics () and lower boundary-based metrics ().
The area-based similarity of predicted segmentation map and the ground-truth are computed as:
To better evaluate the segmentation performance of boundaries, we employ another two boundary-based metrics, as
where and denote the predicted boundary points and the ground-truth boundary points in the and , and denotes the minimum Euclidean distance function. Moreover, denotes the one-way hausdorff distance from to , and refers to the calculation of the percentile of the distances.
As for the polyp segmentation, we adopt the same metrics as the latest work, Polyp-PVT , including the area-based metric, IoU, and the boundary-based metric, .
4.3 Implementation Details
All methods are implemented on the Pytorch with a single NVIDIA Geforce GTX 3090 GPU with a memory of 24 GB. We empirically resize all images toconsidering the computation efficiency. A series of data augmentations are implemented to increase the data diversity, including vertical flip, horizontal flip, and random scale change (limited 0.9-1.1). Each mini-batch includes eight images, and the AdamW 
optimizer with an initial learning rate of 0.0003 is used to optimize the parameters. We train the network for 200 epochs and save the model parameters with the best performance during validation. We adopt the pyramid vision transformer, PVTv2
, as the backbone and pre-train it on the ImageNet dataset. As for the hype-parameters, we setand to 2 by default and discuss it in Section 4.6.2. In the boundary key-point generation algorithm, considering the image size and lesion size, we set and by default.
4.4 Comparisons with state-of-the-art Methods
4.4.1 Quantitative results for skin lesion segmentation
We compare our model to several popular segmentation models, including the CNN-based models, U-Net , U-Net++ , Attention U-Net , DeepLabV3+ , CE-Net , CA-Net , and the transformer-based models, TransFuse  and TransUNet . All models are trained under the same setting as our model.
For the ISIC-2016&PH dataset, it is found that our model has achieved the best performance on whatever the validation set or the test set. Since the samples from the PH dataset are unseen during the model learning, our superior performance indicates the satisfactory generalization ability, which is owing to the learning of boundaries that are the general features among different distributions. In comparison to us, TransFuse generalize poorly to the test set and TranUNet has poor segmentation accuracy on the validation set. Furthermore, it is seen that our model has obviously lower ( and ) and ( and ), demonstrating the promising advantage in handling boundary segmentation.
To extensively evaluate the models, we perform the 5-fold cross-validation in the ISIC-2018 dataset and show the evaluated scores of each fold as well as the overall scores in Table 2. The results illustrate that our model achieves the highest IoU score and the shortest ASSD distance on all sets. In addition to this, although the improvement on the IoU score is not as large as that on the ISIC-2016&PH dataset, the ASSD score has decreased a lot compared to the other models. It means that our model has superior accuracy in reducing the false positives away from the boundaries and detecting the ambiguous boundaries that are ignored by other models.
4.4.2 Visualized Comparison for Skin Lesion Segmentation
We visualize the predictions of some representative images in Fig. 5, including the lesions with hair occlusion, various sizes, and ambiguous boundaries. The first row shows that our model can detect the lesion covered by the hair with the largest accuracy. The second and third rows prove that our model consistently yields stable and the best prediction on the smallest or largest lesions. For all rows, particularly the last two rows where lesions show an extremely close appearance to neighbor tissues, our model is still able to give accurate segmentation.
4.4.3 Extensive Evaluation for Polyp Segmentation
We show the compared results in Table 3, where the overall scores and the scores of each dataset are presented. We highlight the best score in bold, and it is found that our model nearly achieves the best scores on all metrics. For overall performance, compared to the latest model, Polyp-PVT, which has also used PVTv2 as the backbone, our model yields obvious performance improvement, i.e., 1.5% on the IoU score and 2.7% on the score. As demonstrates the ability of accurate boundary segmentation, the result indicates that our boundary learners are genuinely able to enhance the determination of boundary points. The results on each dataset also support the conclusion, especially for the ETIS dataset. Samples from the ETIS dataset are more challenging to segment, leading to relatively poorer performance in all experiments. On such a difficult sampler, our model has a 4.5% improvement on the IoU score and 7.0% improvement on the score, indicating its superior ability to handle challenging boundaries.
4.5 Analytical Ablation Study
We conduct extensive ablation experiments on the ISIC-2016&PH dataset to demonstrate the effectiveness of the three bound learners in our proposed method. For the baseline comparison, we remove the learners of XBound-Former and maintain the same linear prediction and up-sampling fusion as U-Net. Then, we add the im-Bound learners, ex-Bound learners, and X-Bound learners step by step and obtain three models that are the imBound-Former, exBound-Former, and XBound-Former.
4.5.1 Quantitative Analysis
The results of the ablation experiment are shown in Fig. 6(a) using bar plots, and the evaluated IoU scores are highlighted by red scores. Compared to the baseline model, imBound-Former has gained a improvement on the validation set and improvement on the test set, verifying its benefits in boosting segmentation accuracy and generalization. In addition, exBound-Former gains further improvement on the validation set () and a slight improvement on the test set (). Since this module majorly aims to learn explicit embeddings for boundary knowledge that are essential for the X-Bound learners, the improvement is limited yet not important. The complete version, XBound-Former, shows obvious and consistent improvement on the validation and test sets, verifying the usefulness of our attention-based cross-scale boundary fusion.
4.5.2 Visual Comparison on Lesion Boundaries
We also visually analyze the effectiveness of each component in Fig. 6. As it shows, the baseline model lacks sufficient ability to address lesions with ambiguous boundaries as there are a lot of false positives. This issue has decreased significantly in the predictions of imBound-Former and exBound-Former, while the determination is still not accurate enough. By combining the multi-scale boundary knowledge, XBound-Former achieves the best performance on the small lesion (the second row) or the large lesion (the third row).
4.6 Detailed Analysis of Bound Learners
4.6.1 Boundary Supervision
As shown in Equation 6, we utilize the factor () to balance the segmentation map loss and boundary key-point map loss. The smaller may fail to provide strong enough supervision, while the larger may sometimes bring the noise to the model learning. Hence, we have a discussion about how it affects the final segmentation performance. The results are shown in Fig. 7, where is set to and all models adopt the same architecture as XBound-Former. As the plot shows in Fig. 7(a), the evaluated scores increase on both sets when enlarging the from to . However, they decrease when the reaches . It verifies the assumption that the small limits the improvement and the large one will harm the segmentation training. We additionally visualize the predicted segmentation map along with the point map in Fig. 7(b). As it shows, the model without boundary supervision is still able to predict coarse lesion regions for spatial attention while it lacks the ability to recognize the most challenging regions of the boundaries. In comparison, our predicted point map concentrates on the ambiguous boundaries so that it can boost the challenging lesion’s segmentation.
4.6.2 Statistics of the Efficiency
We set to control the number of im-Bound and ex-Bound learners. Enlarging them leads to more computation, while few learners may not be able to learn the correct boundary knowledge. Fig. 8 shows the evaluated IoU scores and inference time of the models with different . For the validation set, the evaluated IoU score increase obviously with more boundary learners, and the score changes a few when . The IoU score also increases with increasing to but it also drops with . The underlying reason may be that more learners bring larger hardness to model optimization. Considering both the efficiency, accuracy and generalization ability, we take as our final setting.
Skin lesion segmentation plays a vital role in the quantitative analysis of skin cancers, i.e., lesion size and shape analysis. Existing studies adopt attention-based networks to catch global context, and boundary-aware supervision is proved to be effective for object segmentation in other fields. In this work, we exploit the complementary advantage of global context and boundary knowledge at multi-scale, proposing a cross-scale boundary-aware transformer, XBound-Former, for precise segmentation of skin lesions with ambiguous boundaries. The main contribution is our three boundary learners to explore in-scale and cross-scale boundary knowledge. The experiment is conducted on two skin lesion datasets and an external polyp lesion dataset. The results have shown that our model has the best segmentation performance, especially in the determination of challenging boundaries. The generalization ability on unseen images and different tasks has also been verified.
In the medical field, targets usually have ambiguous boundaries that are hard to determine, even for human beings. The challenges majorly come from the limitation of imaging techniques and would be solved in the future by the new evolution of advanced imaging techniques. However, in the current community, how to segment these challenging objects has huge significance for the diagnosis, quality control, and treatment planning of patients. Therefore, we thoroughly investigate and aim to solve the challenges in the skin lesion segmentation and preliminarily discuss the potential users on the other targets with similar characteristics.
How to fuse boundary information into the segmentation tasks is one of the most well-known topics in object segmentation. It can be achieved through designing boundary-aware loss objectives like HD loss. Recent studies show that it is more effective to transfer the boundary loss as boundary key-point map loss. In addition to the supervision, the predicted boundary key-point map can also be used as the spatial attention map. Following this direction, we propose XBound-Former, which takes the complementary usage of the attention-based network and boundary supervision. Based on this theory, we further explore the potential help in exploring cross-scale boundary knowledge. All our proposals are proved to be effective in our ablation experiment and the detailed discussion.
Our model still has some limitations that will further improve the segmentation if broken. First, in some extremely challenging images, the boundary key points are still unable to detect clearly. The false point detection may bring harmful guidance to the branch of lesion segmentation. Although they have the complementary advantage in most cases, we should consider the potential harm in some noisy cases. Second, boundary key-point detection is a different task that requires unique representations compared to lesion segmentation. In future work, utilizing different models for the two branches instead of sharing the same architecture may be helpful to guarantee the accuracy of the two branches.
We present a novel cross-scale boundary-aware transformer for skin lesion segmentation, and it can be extended to similar targets that have ambiguous boundaries. We perform comparison experiments on two skin lesion datasets to verify the segmentation accuracy and the generalization ability. The extensive experiment conducted on the polyp segmentation also indicates our feasibility on more tasks. The detailed ablation study proves that the improvement comes from our implicitly, explicitly, and cross-scale boundary modeling. Besides, it is also found that our model still fails on some extremely low-contrast lesions, which may be solved by fusing a deep learning-based model and low-level feature extractor.
This work is supported by the Ministry of Science and Technology of the People’s Republic of China under grant No. 2021ZD0201900 and 2021ZD0201904.
-  (2020) Attention deeplabv3+: multi-level context attention mechanism for skin lesion segmentation. In European Conference on Computer Vision, pp. 251–266. Cited by: §2.1.
-  (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43, pp. 99–111. Cited by: 3rd item.
-  (2021) Polyp-pvt: polyp segmentation with pyramidvision transformers. arXiv preprint arXiv:2108.06932v3. Cited by: §4.2, §4.4.3, Table 3.
-  (2022) ICL-net: global and local inter-pixel correlations learning network for skin lesion segmentation. IEEE Journal of Biomedical and Health Informatics. Cited by: §1, §1, §2.1, §2.2.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §2.2, §3.1.2.
-  (2021) TransUNet: transformers make strong encoders for medical image segmentation. External Links: Cited by: §1, §2.2, §4.4.1, Table 1, Table 2.
-  (2018-09) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §4.4.1, Table 1, Table 2.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2, §3.1.1.
-  (2020) PraNet: parallel reverse attention network for polyp segmentation. MICCAI. Cited by: 3rd item, §4.4.3, Table 3.
Convolutional sequence to sequence learning.
International Conference on Machine Learning, pp. 1243–1252. Cited by: §3.1.
-  (2020) CA-net: comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE transactions on medical imaging 40 (2), pp. 699–711. Cited by: §2.1, §4.4.1, Table 1, Table 2.
-  (2019) Ce-net: context encoder network for 2d medical image segmentation. IEEE transactions on medical imaging 38 (10), pp. 2281–2292. Cited by: §4.4.1, Table 1, Table 2.
-  (2016) Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1605.01397. Cited by: Table 1.
-  (2021) Hardnet-mseg: a simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 fps. arXiv preprint arXiv:2101.07172. Cited by: §4.4.3, Table 3.
-  (2020) Kvasir-seg: a segmented polyp dataset. In International Conference on Multimedia Modeling, pp. 451–462. Cited by: 3rd item.
-  (2019) Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on medical imaging 39 (2), pp. 499–513. Cited by: §2.3.
-  (2019) Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning, pp. 285–296. Cited by: §2.3.
Structure boundary preserving segmentation for medical image with ambiguous boundary.
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4816–4825. External Links: Cited by: §1, §1, §4.1.
-  (2018) Dense deconvolutional network for skin lesion segmentation. IEEE journal of biomedical and health informatics 23 (2), pp. 527–537. Cited by: §2.1.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §2.2, §3.1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.1.
-  (2018) Fixing weight decay regularization in adam. Cited by: §4.3.
-  (2020) Cancer statistics, 2020: report from national cancer registry programme, india. JCO Global Oncology 6, pp. 1063–1075. Cited by: §1.
-  (2013) PH 2-a dermoscopic image database for research and benchmarking. In 2013 35th annual international conference of the IEEE engineering in medicine and biology society (EMBC), pp. 5437–5440. Cited by: 1st item, Table 1.
-  (2021) Graph-based region and boundary aggregation for biomedical image segmentation. IEEE transactions on medical imaging. Cited by: §2.3.
-  (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §4.4.1, Table 1, Table 2.
-  (2021) Enhanced u-net: a feature enhancement network for polyp segmentation. In 2021 18th Conference on Robots and Vision (CRV), pp. 181–188. Cited by: §4.4.3, Table 3.
-  (2020) Attention-based transformers for instance segmentation of cells in microstructures. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 700–707. Cited by: §1.
-  (2020) Attention-based transformers for instance segmentation of cells in microstructures. Cited by: §3.1.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.4.1, Table 1, Table 2.
-  (2021) Cancer statistics, 2021. CA: a cancer journal for clinicians 71 (1), pp. 7–33. Cited by: §1.
-  (2014) Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery 9 (2), pp. 283–293. Cited by: 3rd item.
-  (1985) Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing 30 (1), pp. 32–46. Cited by: §3.3.
-  (2015) Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging 35 (2), pp. 630–644. Cited by: 3rd item.
-  (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. Cited by: §2.2.
-  (2010) Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (10), pp. 1744–1757. External Links: Cited by: §1, §2.1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.2, §3.1.1.
-  (2017) A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering 2017. Cited by: 3rd item.
-  (2021) Boundary-aware transformers for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 206–216. Cited by: §1, §1, §2.1, §2.2.
-  (2019) CT male pelvic organ segmentation using fully convolutional networks with boundary sensitive representation. Medical image analysis 54, pp. 168–178. Cited by: §2.3.
-  (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578. Cited by: §1, §3.1, §3.
-  (2022) Pvtv2: improved baselines with pyramid vision transformer. Computational Visual Media 8 (3), pp. 1–10. Cited by: §1, §2.2, §4.3.
-  (2021) Shallow attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 699–708. Cited by: §4.4.3, Table 3.
-  (2020) Automated skin lesion segmentation via an adaptive dual attention module. IEEE Transactions on Medical Imaging 40 (1), pp. 357–370. Cited by: §2.1, §2.3.
-  (2022) Duplex contextual relation network for polyp segmentation. In 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pp. 1–5. Cited by: §4.4.3, Table 3.
-  (2017) Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 472–480. Cited by: §1.
-  (2016) Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
-  (2017) Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE Transactions on Medical Imaging 36 (4), pp. 994–1004. External Links: Cited by: §1.
Tokens-to-token vit: training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986. Cited by: §2.2.
-  (2017) Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE transactions on medical imaging 36 (9), pp. 1876–1886. Cited by: §1, §2.1.
-  (2020) Adaptive context selection for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 253–262. Cited by: §4.4.3, Table 3.
-  (2021) TransFuse: fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Lecture Notes in Computer Science, Vol. 12901, pp. 14–24. External Links: Cited by: §1, §2.2, §4.4.1, Table 1, Table 2.
-  (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890. Cited by: §2.2.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: §4.4.1, Table 1, Table 2.