I Introduction
Colorectal cancer (CRC) is the third most common cancer affecting men and the second most common cancer affecting women globally according to the World Health Organization GLOBOCAN database [16]. Approximately 70% of cases occur in the colon and the remaining occur in the rectum [15]. Considering that colonoscopy is the primary screening modality and most prevalent diagnostic technique in gastrointestinal endoscopy, quality assurance is critical [14]. Detection of cancer at an early and curable stage and removal of precancerous adenomas or serrated lesions during colonoscopy is the key to colon cancer diagnosis. It is also associated with reduction in mortality [11, 24].
Colonoscopy is an expensive, resource-demanding, and unpleasant procedure. Many patients show unwillingness to participate in CRC screening program repeatably. According to a recent meta-analysis, up to 26% of colonoscopies may have missed lesions and adenomas [26]. This is because it is an operator driven procedure and solely dependent on the clinical acumen and skills of the endoscopist. With the current colonoscopy equipment, less experienced endoscopists cannot distinguish all of neoplastic and non-neoplastic polyps during routine colonoscopy examination [21, 2]. An automatic algorithm based real-time diagnosis of polyps (irrespective of their morphology) during colonoscopy could help endoscopists in identifying potential polyps for removal with improved efficiency and accuracy. It also reduces the access barrier to pathological services [22].

Recently, there has been a great interest in deep learning in CRC screening. Various studies aimed to develop CADx models for automatic polyp segmentation [18, 6, 8, 4, 19, 7, 27]. Among the recent deep learning architecture, Transformer [20] and U-Net [13] based architecture have attracted the most attention. Various extension of UNet have been proposed in the literature [27, 10, 8] for automatic polyp segmentation. Despite good results produced by these studies, more research needs to be done on descent sized polyp datasets to demonstrate the effectiveness of the proposed method for automatic polyp segmentation. One of the promising extensions of UNet is ResUNet [25]
. The architecture is built on residual units that uses identity mapping (shortcut connection). The residual unit eases the training of the deep neural network whereas the identity mapping facilitate the better flow of the gradients. Similarly, Yu et al.
[23] presented an efficient dilation convolution block to increase the context module, which helps to improve the accuracy of the semantic segmentation network.Method | Backone | DSC | mIoU | Recall | Precision | Accuracy | F2 | FPS |
---|---|---|---|---|---|---|---|---|
U-Net [13] | - | 0.8264 | 0.7472 | 0.8504 | 0.8703 | 0.9510 | 0.8353 | 156.83 |
ResU-Net [25] | - | 0.7642 | 0.6634 | 0.8025 | 0.8200 | 0.9341 | 0.7740 | 196.85 |
U-Net++ [27] | - | 0.8228 | 0.7419 | 0.8437 | 0.8607 | 0.9491 | 0.8295 | 126.14 |
ResU-Net++ [10] | - | 0.6453 | 0.5341 | 0.6964 | 0.7080 | 0.9044 | 0.6575 | 57.99 |
ColonSegNet [7] | - | 0.7920 | 0.6980 | 0.8193 | 0.8432 | 0.9415 | 0.7999 | 129.04 |
HarDNet-MSEG [6] | - | 0.8260 | 0.7459 | 0.8485 | 0.8652 | 0.9492 | 0.8358 | 42.00 |
DeepLabV3+[1] | ResNet50 | 0.8837 | 0.8173 | 0.9014 | 0.9028 | 0.9679 | 0.8904 | 102.62 |
DDANet [17] | - | 0.7415 | 0.6448 | 0.7953 | 0.7670 | 0.9326 | 0.7640 | 88.70 |
TransResU-Net (Ours) | ResNet50 | 0.8884 | 0.8214 | 0.9106 | 0.9022 | 0.9651 | 0.8971 | 48.61 |
Inspired by the successes of Transformers [3], residual unit [5], and dilated convolution [23], we develop a novel deep learning-based architecture, TransResU-Net. We tested the performance of TransResU-Net on two decent sized publicly available polyp datasets. It is to confirm if the proposed method can detect early signs of CRC with high performance and a real-time speed. The main contribution of our work can be summarized as follows:
-
We have proposed a novel deep segmentation architecture called TransResU-Net, which combines the strengths of the transformer block, dilated convolution layers with the pre-trained ResNet50, which has never been done before.
Ii Method
Figure 1 show the block diagram of our proposed TransResU-Net. The proposed architecture follows an encoder-decoder scheme, where we have a pre-trained ResNet50 as the encoder and four decoder blocks. The input image is fed to the pre-trained encoder, which consists of multiple bottlenecks residual blocks along with pooling layers which transform the input image into a spatially reduced feature representation. The output from the pre-trained encoder is then passed through a transformer encoder block and a dilated convolution block. The transformer encoder block [20]
consists of a self-attention network which is followed by a feed-forward neural network which helps the proposed TransResU-Net to learn a more robust representation. Meanwhile the dilated convolution block helps the convolution filters to increase their receptive field and thus enhance the effective capacity of the network.
The dilated convolution block consists of four parallel convolution layers, where each layer has a dilation rate of , , , and
respectively. These layers are then followed by batch normalization and a ReLU activation function. Next, we concatenate the features from all four layers and pass them through a
convolution layer to effectively reduce the number of feature channels. The output from both the transformer encoder block and the dilated convolution block are concatenated and passed to the first decoder block. The decoder block begins with a bilinear upsampling, which is followed by the concatenation with the skip connection from the encoder block. These skip connections help to get the feature maps directly from the encoder to the decoder block, which is important since some of the features are lost due to the depth of the network. These skip connections also help in better flow of the gradients during the backpropagation and thus help to improve the overall performance of the network. These concatenated feature maps are then passed through two residual blocks, which consist of the two
convolution layers and an identity mapping. Subsequently, the output from the first decoder block is passed to the second decoder block and so on. This way the feature maps are progressively transformed to more meaningful semantic features. The output from the last decoder block is passed through a convolution layer followed by a sigmoid activation function which generates a binary segmentation mask.Method | Backone | DSC | mIoU | Recall | Precision | Accuracy | F2 | FPS |
---|---|---|---|---|---|---|---|---|
U-Net [13] | - | 0.8286 | 0.7599 | 0.8295 | 0.8999 | 0.9903 | 0.8264 | 160.27 |
ResU-Net [25] | - | 0.7433 | 0.6580 | 0.7447 | 0.8711 | 0.9843 | 0.7387 | 197.94 |
U-Net++ [27] | - | 0.8275 | 0.7563 | 0.8388 | 0.8942 | 0.9895 | 0.8308 | 123.45 |
ResU-Net++ [10] | - | 0.7130 | 0.6280 | 0.7240 | 0.8578 | 0.9832 | 0.7132 | 55.86 |
ColonSegNet [7] | - | 0.7748 | 0.6881 | 0.7852 | 0.8711 | 0.9843 | 0.7746 | 122.42 |
HarDNet-MSEG [6] | - | 0.7627 | 0.6734 | 0.7532 | 0.8344 | 0.9863 | 0.7528 | 41.20 |
DeepLabV3+[1] | ResNet50 | 0.8937 | 0.8314 | 0.8870 | 0.9333 | 0.9937 | 0.8882 | 99.16 |
DDANet [17] | - | 0.7269 | 0.6507 | 0.7454 | 0.7575 | 0.9851 | 0.7335 | 86.46 |
TransResU-Net (Ours) | ResNet50 | 0.9154 | 0.8568 | 0.9142 | 0.9299 | 0.9938 | 0.9129 | 42.09 |

No. | Dataset | Method | DSC | mIoU | Recall | Precision |
---|---|---|---|---|---|---|
#1 | Kvasir-SEG [9] | TransResU-Net (w/o Transformer Encoder block & Dilated Conv block) | 0.8679 | 0.7979 | 0.8863 | 0.8964 |
#2 | TransResU-Net (Proposed) | 0.8884 | 0.8214 | 0.9106 | 0.9022 | |
#1 | BKAI-IGH [12] | TransResU-Net (w/o Transformer Encoder block & Dilated Conv block) | 0.8763 | 0.8108 | 0.8908 | 0.9013 |
#2 | TransResU-Net (Proposed) | 0.9154 | 0.8568 | 0.9142 | 0.9299 |
Iii Experimental setup
We have utilized Kvasir-SEG [9] and BKAI-IGH [12]
datasets to extensively evaluate the proposed TransResU-Net. All the models used in this study are implemented using the PyTorch framework and are trained on an NVIDIA RTX 3090 GPU. The images and masks from both datasets are first resized to
pixels and then split into training and testing. For the Kvasir-SEG, we are using the official split, where 880 images and masks are used for training while the rest are used for testing. For the BKAI dataset, we have split the entire dataset into 80:10:10, where dataset is used for the training, is used for the validation and remainingis used for the testing. All the models are trained for 200 epochs with an early stopping mechanism. An Adam optimizer, learning rate of 1e
with a batch size of 16 is used. A combination of binary cross-entropy loss and dice loss is used. We have trained all the models with the same set of hyperparameters for a fair comparison.
Iv Results and Discussions
We present the quantitative results in Table I and Table II. TransResU-Net has achieved a dice coefficient of 0.8884, mIoU of 0.8214, recall of 0.9106, precision of 0.9022, accuracy of 0.9651, F2 of 0.8971 and speed of 48.61 FPS on the Kvasir-SEG. The most competitive network to TransResU-Net was DeepLabv3+ [1] to which our architecture outperformed by 0.47% in DSC and 0.41% in mIoU. On the BKAI-IGH [12], TransRes-UNet achieved a high DSC of 0.9154 and mIoU of 0.8568 and outperformed DeepLabv3+ by 2.17% in DSC and 2.54% in mIoU.
Table III shows the results of ablation study. In the ablation study, we compared TransResU-Net (without Transformer encoder block and dilated convolution block) and the proposed TransResU-Net. The transformer encoder block and dilated convolution block increased the network performance by 2.05% in DSC and 2.35% in mIoU on the Kvasir-SEG. On the BKAI-IGH, TransUNet outperformed the prior method by 3.91% in DSC and 4.6% in mIoU. The recall and precision were also significantly higher for the proposed method. Examples of qualitative results of TransResU-Net along with its heatmaps are presented in Figure 2. Here, we show the results of UNet [13], DeepLabV3+ [1], and proposed TransRes-UNet on the examples such as a small or diminutive polyp, regular polyp, and flat polyp from Kvasir-SEG and BKAI-IGH dataset. The visual comparison demonstrated that the predicted mask produced by TransResU-Net is better at delineating boundaries than DeepLabv3+ and UNet. Similarly, UNet showed under-segmentation for flat polyps whereas DeepLabv3+ showed over-segmentation for diminutive polyps. TransResU-Net could characterize all types of polyps accurately. In the qualitative results, we also show the intermediate results (heatmap) of the proposed TransResU-Net. The red and yellow colors in the heatmap signify the most relevant features of TransRes-UNet, whereas the blue color shows the least significant feature produced by the architecture.
V Conclusion
In this paper, we propose the TransResU-Net architecture, which takes the advantages of transformer encoder block, residual block, and dilated convolution as its core component for real-time colonoscopy polyp segmentation. The self attention network present in the transformer, and dilated convolution block further boost the performance of the architecture. Our experimental results showed that the proposed architecture can efficiently segment polyp frames with a high dice coefficient of 0.8884 and 0.9154, respectively, on highly diverse and well-curated colonoscopy datasets. The proposed model achieved a real-time speed of 48.61 and 42.09 FPS respectively. The high performance of the algorithm on polyp segmentation tasks shows a positive signal for the development of the CADx system to be deployed in clinical settings in near future. In the future, we plan to integrate more transformer blocks in the proposed network to further boost the performance. Additionally, we will test our algorithm on the video sequence dataset to observe if the algorithm performs reasonably well on the video sequence frames as well.
Acknowledgement
This project is partially supported by the NIH funding: R01-CA246704 and R01-CA240639.
References
-
[1]
(2018)
Encoder-decoder with Atrous Separable Convolution for Semantic Image Segmentation.
In
Proceedings of the European conference on computer vision (ECCV)
, pp. 801–818. Cited by: TABLE I, TABLE II, §IV, §IV. - [2] (2015) ASGE technology committee systematic review and meta-analysis assessing the asge pivi thresholds for adopting real-time endoscopic assessment of the histology of diminutive colorectal polyps. Gastrointestinal endoscopy 81 (3), pp. 502–e1. Cited by: §I.
-
[3]
(2022)
Transformer based Generative Adversarial Network for Liver Segmentation
. InProceedings of the International Conference on Parallel Artificial Intelligence
, Cited by: §I. - [4] (2020) PraNet: Parallel Reverse Attention Network for Polyp Segmentation. In Proceedings of the International conference on medical image computing and computer-assisted intervention (MICCAI), pp. 263–273. Cited by: §I.
-
[5]
(2016)
Deep Residual Learning for Image Recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR))
, pp. 770–778. Cited by: §I. - [6] (2021) HarDNet-MSEG A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS. arXiv preprint arXiv:2101.07172. Cited by: TABLE I, §I, TABLE II.
- [7] (2021) Real-Time Polyp Detection, Localization and Segmentation in Colonoscopy using Deep Learning. IEEE Access 9, pp. 40496–40510. Cited by: TABLE I, §I, TABLE II.
-
[8]
(2020)
DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation
. In Proceedings of the International symposium on computer-based medical systems (CBMS), pp. 558–564. Cited by: §I. - [9] (2020) Kvasir-SEG: A Segmented Polyp Dataset. In Proceedings of the International Conference on Multimedia Modeling (MMM), pp. 451–462. Cited by: item 2, TABLE I, Fig. 2, TABLE III, §III.
- [10] (2019) ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the International Symposium on Multimedia (ISM), pp. 225–2255. Cited by: TABLE I, §I, TABLE II.
- [11] (1996) Randomised study of screening for colorectal cancer with faecal-occult-blood test. The Lancet 348 (9040), pp. 1467–1471. Cited by: §I.
- [12] (2021) NeoUNet: Towards Accurate Colon Polyp Segmentation and Neoplasm Detection. In International Symposium on Visual Computing, pp. 15–28. Cited by: item 2, Fig. 2, TABLE II, TABLE III, §III, §IV.
- [13] (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical image computing and computer-assisted intervention (MICCAI), pp. 234–241. Cited by: TABLE I, §I, TABLE II, §IV.
- [14] (2004) How many endoscopies are performed for colorectal cancer screening? results from cdc’s survey of endoscopic capacity. Gastroenterology 127 (6), pp. 1670–1677. Cited by: §I.
- [15] (2022) Cancer statistics, 2022. CA: a cancer journal for clinicians. Cited by: §I.
-
[16]
(2021)
Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries
. CA: a cancer journal for clinicians 71 (3), pp. 209–249. Cited by: §I. - [17] (2021) DDANet: Dual Decoder Attention Network for Automatic Polyp Segmentation. In Proceedings of the International Conference on Pattern Recognition workshop, pp. 307–314. Cited by: TABLE I, TABLE II.
- [18] (2022) TGANet: Text-guided Attention for Improved Polyp Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cited by: §I.
- [19] (2022) FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
- [20] (2017) Attention is all you need. Advances in neural information processing systems (NIPS) 30. Cited by: §I, §II.
- [21] (2020) Physician sentiment toward artificial intelligence (ai) in colonoscopic practice: a survey of us gastroenterologists. Endoscopy international open 8 (10), pp. E1379–E1384. Cited by: §I.
- [22] (2018) Access to pathology and laboratory medicine services: a crucial gap. The Lancet 391 (10133), pp. 1927–1938. Cited by: §I.
- [23] (2015) Multi-Scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122. Cited by: §I, §I.
- [24] (2012) Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths. N Engl J Med 366, pp. 687–696. Cited by: §I.
- [25] (2018) Road Extraction by Deep Residual U-Net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: TABLE I, §I, TABLE II.
- [26] (2019) Magnitude, risk factors, and factors associated with adenoma miss rate of tandem colonoscopy: a systematic review and meta-analysis. Gastroenterology 156 (6), pp. 1661–1674. Cited by: §I.
- [27] (2018) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: TABLE I, §I, TABLE II.