Log In Sign Up

TransResU-Net: Transformer based ResU-Net for Real-Time Colonoscopy Polyp Segmentation

by   Nikhil Kumar Tomar, et al.

Colorectal cancer (CRC) is one of the most common causes of cancer and cancer-related mortality worldwide. Performing colon cancer screening in a timely fashion is the key to early detection. Colonoscopy is the primary modality used to diagnose colon cancer. However, the miss rate of polyps, adenomas and advanced adenomas remains significantly high. Early detection of polyps at the precancerous stage can help reduce the mortality rate and the economic burden associated with colorectal cancer. Deep learning-based computer-aided diagnosis (CADx) system may help gastroenterologists to identify polyps that may otherwise be missed, thereby improving the polyp detection rate. Additionally, CADx system could prove to be a cost-effective system that improves long-term colorectal cancer prevention. In this study, we proposed a deep learning-based architecture for automatic polyp segmentation, called Transformer ResU-Net (TransResU-Net). Our proposed architecture is built upon residual blocks with ResNet-50 as the backbone and takes the advantage of transformer self-attention mechanism as well as dilated convolution(s). Our experimental results on two publicly available polyp segmentation benchmark datasets showed that TransResU-Net obtained a highly promising dice score and a real-time speed. With high efficacy in our performance metrics, we concluded that TransResU-Net could be a strong benchmark for building a real-time polyp detection system for the early diagnosis, treatment, and prevention of colorectal cancer. The source code of the proposed TransResU-Net is publicly available at


page 1

page 3


RUPNet: Residual upsampling network for real-time polyp segmentation

Colorectal cancer is among the most prevalent cause of cancer-related mo...

COTR: Convolution in Transformer Network for End to End Polyp Detection

Purpose: Colorectal cancer (CRC) is the second most common cause of canc...

Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries

Coronary artery disease (CAD) has posed a leading threat to the lives of...

Automatic Polyp Segmentation with Multiple Kernel Dilated Convolution Network

The detection and removal of precancerous polyps through colonoscopy is ...

Prostate Gland Segmentation in Histology Images via Residual and Multi-Resolution U-Net

Prostate cancer is one of the most prevalent cancers worldwide. One of t...

A hybrid 2-stage vision transformer for AI-assisted 5 class pathologic diagnosis of gastric endoscopic biopsies

Gastric endoscopic screening is an effective way to decide appropriate g...

I Introduction

Colorectal cancer (CRC) is the third most common cancer affecting men and the second most common cancer affecting women globally according to the World Health Organization GLOBOCAN database [16]. Approximately 70% of cases occur in the colon and the remaining occur in the rectum [15]. Considering that colonoscopy is the primary screening modality and most prevalent diagnostic technique in gastrointestinal endoscopy, quality assurance is critical [14]. Detection of cancer at an early and curable stage and removal of precancerous adenomas or serrated lesions during colonoscopy is the key to colon cancer diagnosis. It is also associated with reduction in mortality [11, 24].

Colonoscopy is an expensive, resource-demanding, and unpleasant procedure. Many patients show unwillingness to participate in CRC screening program repeatably. According to a recent meta-analysis, up to 26% of colonoscopies may have missed lesions and adenomas [26]. This is because it is an operator driven procedure and solely dependent on the clinical acumen and skills of the endoscopist. With the current colonoscopy equipment, less experienced endoscopists cannot distinguish all of neoplastic and non-neoplastic polyps during routine colonoscopy examination [21, 2]. An automatic algorithm based real-time diagnosis of polyps (irrespective of their morphology) during colonoscopy could help endoscopists in identifying potential polyps for removal with improved efficiency and accuracy. It also reduces the access barrier to pathological services [22].

Fig. 1: Block diagram of the proposed TransResU-Net architecture

Recently, there has been a great interest in deep learning in CRC screening. Various studies aimed to develop CADx models for automatic polyp segmentation [18, 6, 8, 4, 19, 7, 27]. Among the recent deep learning architecture, Transformer [20] and U-Net [13] based architecture have attracted the most attention. Various extension of UNet have been proposed in the literature [27, 10, 8] for automatic polyp segmentation. Despite good results produced by these studies, more research needs to be done on descent sized polyp datasets to demonstrate the effectiveness of the proposed method for automatic polyp segmentation. One of the promising extensions of UNet is ResUNet [25]

. The architecture is built on residual units that uses identity mapping (shortcut connection). The residual unit eases the training of the deep neural network whereas the identity mapping facilitate the better flow of the gradients. Similarly, Yu et al. 

[23] presented an efficient dilation convolution block to increase the context module, which helps to improve the accuracy of the semantic segmentation network.

Method Backone DSC mIoU Recall Precision Accuracy F2 FPS
U-Net [13] - 0.8264 0.7472 0.8504 0.8703 0.9510 0.8353 156.83
ResU-Net [25] - 0.7642 0.6634 0.8025 0.8200 0.9341 0.7740 196.85
U-Net++ [27] - 0.8228 0.7419 0.8437 0.8607 0.9491 0.8295 126.14
ResU-Net++ [10] - 0.6453 0.5341 0.6964 0.7080 0.9044 0.6575 57.99
ColonSegNet [7] - 0.7920 0.6980 0.8193 0.8432 0.9415 0.7999 129.04
HarDNet-MSEG [6] - 0.8260 0.7459 0.8485 0.8652 0.9492 0.8358 42.00
DeepLabV3+[1] ResNet50 0.8837 0.8173 0.9014 0.9028 0.9679 0.8904 102.62
DDANet [17] - 0.7415 0.6448 0.7953 0.7670 0.9326 0.7640 88.70
TransResU-Net (Ours) ResNet50 0.8884 0.8214 0.9106 0.9022 0.9651 0.8971 48.61
TABLE I: Quantitative results on the on Kvasir-SEG [9].

Inspired by the successes of Transformers [3], residual unit [5], and dilated convolution [23], we develop a novel deep learning-based architecture, TransResU-Net. We tested the performance of TransResU-Net on two decent sized publicly available polyp datasets. It is to confirm if the proposed method can detect early signs of CRC with high performance and a real-time speed. The main contribution of our work can be summarized as follows:

  1. We have proposed a novel deep segmentation architecture called TransResU-Net, which combines the strengths of the transformer block, dilated convolution layers with the pre-trained ResNet50, which has never been done before.

  2. We compared TransResU-Net with eight commonly used benchmark algorithms for the automated polyp segmentation tasks. TransResU-Net showed state-of-the-art performance on Kvasir-SEG [9] and BKAI-IGH [12] dataset.

Ii Method

Figure 1 show the block diagram of our proposed TransResU-Net. The proposed architecture follows an encoder-decoder scheme, where we have a pre-trained ResNet50 as the encoder and four decoder blocks. The input image is fed to the pre-trained encoder, which consists of multiple bottlenecks residual blocks along with pooling layers which transform the input image into a spatially reduced feature representation. The output from the pre-trained encoder is then passed through a transformer encoder block and a dilated convolution block. The transformer encoder block [20]

consists of a self-attention network which is followed by a feed-forward neural network which helps the proposed TransResU-Net to learn a more robust representation. Meanwhile the dilated convolution block helps the convolution filters to increase their receptive field and thus enhance the effective capacity of the network.

The dilated convolution block consists of four parallel convolution layers, where each layer has a dilation rate of , , , and

respectively. These layers are then followed by batch normalization and a ReLU activation function. Next, we concatenate the features from all four layers and pass them through a

convolution layer to effectively reduce the number of feature channels. The output from both the transformer encoder block and the dilated convolution block are concatenated and passed to the first decoder block. The decoder block begins with a bilinear upsampling, which is followed by the concatenation with the skip connection from the encoder block. These skip connections help to get the feature maps directly from the encoder to the decoder block, which is important since some of the features are lost due to the depth of the network. These skip connections also help in better flow of the gradients during the backpropagation and thus help to improve the overall performance of the network. These concatenated feature maps are then passed through two residual blocks, which consist of the two

convolution layers and an identity mapping. Subsequently, the output from the first decoder block is passed to the second decoder block and so on. This way the feature maps are progressively transformed to more meaningful semantic features. The output from the last decoder block is passed through a convolution layer followed by a sigmoid activation function which generates a binary segmentation mask.

Method Backone DSC mIoU Recall Precision Accuracy F2 FPS
U-Net [13] - 0.8286 0.7599 0.8295 0.8999 0.9903 0.8264 160.27
ResU-Net [25] - 0.7433 0.6580 0.7447 0.8711 0.9843 0.7387 197.94
U-Net++ [27] - 0.8275 0.7563 0.8388 0.8942 0.9895 0.8308 123.45
ResU-Net++ [10] - 0.7130 0.6280 0.7240 0.8578 0.9832 0.7132 55.86
ColonSegNet [7] - 0.7748 0.6881 0.7852 0.8711 0.9843 0.7746 122.42
HarDNet-MSEG [6] - 0.7627 0.6734 0.7532 0.8344 0.9863 0.7528 41.20
DeepLabV3+[1] ResNet50 0.8937 0.8314 0.8870 0.9333 0.9937 0.8882 99.16
DDANet [17] - 0.7269 0.6507 0.7454 0.7575 0.9851 0.7335 86.46
TransResU-Net (Ours) ResNet50 0.9154 0.8568 0.9142 0.9299 0.9938 0.9129 42.09
TABLE II: Quantitative results on the on BKAI-IGH [12].
Fig. 2: Qualitative results comparison along with the heatmap on the Kvasir-SEG [9] and BKAI-IGH [12] datasets.
No. Dataset Method DSC mIoU Recall Precision
#1 Kvasir-SEG [9] TransResU-Net (w/o Transformer Encoder block & Dilated Conv block) 0.8679 0.7979 0.8863 0.8964
#2 TransResU-Net (Proposed) 0.8884 0.8214 0.9106 0.9022
#1 BKAI-IGH [12] TransResU-Net (w/o Transformer Encoder block & Dilated Conv block) 0.8763 0.8108 0.8908 0.9013
#2 TransResU-Net (Proposed) 0.9154 0.8568 0.9142 0.9299
TABLE III: Ablation study of the proposed TransResU-Net on the publicly available polyp datasets

Iii Experimental setup

We have utilized Kvasir-SEG [9] and BKAI-IGH [12]

datasets to extensively evaluate the proposed TransResU-Net. All the models used in this study are implemented using the PyTorch framework and are trained on an NVIDIA RTX 3090 GPU. The images and masks from both datasets are first resized to

pixels and then split into training and testing. For the Kvasir-SEG, we are using the official split, where 880 images and masks are used for training while the rest are used for testing. For the BKAI dataset, we have split the entire dataset into 80:10:10, where dataset is used for the training, is used for the validation and remaining

is used for the testing. All the models are trained for 200 epochs with an early stopping mechanism. An Adam optimizer, learning rate of 1e

with a batch size of 16 is used. A combination of binary cross-entropy loss and dice loss is used. We have trained all the models with the same set of hyperparameters for a fair comparison.

Iv Results and Discussions

We present the quantitative results in Table I and Table II. TransResU-Net has achieved a dice coefficient of 0.8884, mIoU of 0.8214, recall of 0.9106, precision of 0.9022, accuracy of 0.9651, F2 of 0.8971 and speed of 48.61 FPS on the Kvasir-SEG. The most competitive network to TransResU-Net was DeepLabv3+ [1] to which our architecture outperformed by 0.47% in DSC and 0.41% in mIoU. On the BKAI-IGH [12], TransRes-UNet achieved a high DSC of 0.9154 and mIoU of 0.8568 and outperformed DeepLabv3+ by 2.17% in DSC and 2.54% in mIoU.

Table III shows the results of ablation study. In the ablation study, we compared TransResU-Net (without Transformer encoder block and dilated convolution block) and the proposed TransResU-Net. The transformer encoder block and dilated convolution block increased the network performance by 2.05% in DSC and 2.35% in mIoU on the Kvasir-SEG. On the BKAI-IGH, TransUNet outperformed the prior method by 3.91% in DSC and 4.6% in mIoU. The recall and precision were also significantly higher for the proposed method. Examples of qualitative results of TransResU-Net along with its heatmaps are presented in Figure 2. Here, we show the results of UNet [13], DeepLabV3+ [1], and proposed TransRes-UNet on the examples such as a small or diminutive polyp, regular polyp, and flat polyp from Kvasir-SEG and BKAI-IGH dataset. The visual comparison demonstrated that the predicted mask produced by TransResU-Net is better at delineating boundaries than DeepLabv3+ and UNet. Similarly, UNet showed under-segmentation for flat polyps whereas DeepLabv3+ showed over-segmentation for diminutive polyps. TransResU-Net could characterize all types of polyps accurately. In the qualitative results, we also show the intermediate results (heatmap) of the proposed TransResU-Net. The red and yellow colors in the heatmap signify the most relevant features of TransRes-UNet, whereas the blue color shows the least significant feature produced by the architecture.

V Conclusion

In this paper, we propose the TransResU-Net architecture, which takes the advantages of transformer encoder block, residual block, and dilated convolution as its core component for real-time colonoscopy polyp segmentation. The self attention network present in the transformer, and dilated convolution block further boost the performance of the architecture. Our experimental results showed that the proposed architecture can efficiently segment polyp frames with a high dice coefficient of 0.8884 and 0.9154, respectively, on highly diverse and well-curated colonoscopy datasets. The proposed model achieved a real-time speed of 48.61 and 42.09 FPS respectively. The high performance of the algorithm on polyp segmentation tasks shows a positive signal for the development of the CADx system to be deployed in clinical settings in near future. In the future, we plan to integrate more transformer blocks in the proposed network to further boost the performance. Additionally, we will test our algorithm on the video sequence dataset to observe if the algorithm performs reasonably well on the video sequence frames as well.


This project is partially supported by the NIH funding: R01-CA246704 and R01-CA240639.


  • [1] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with Atrous Separable Convolution for Semantic Image Segmentation. In

    Proceedings of the European conference on computer vision (ECCV)

    pp. 801–818. Cited by: TABLE I, TABLE II, §IV, §IV.
  • [2] B. K. A. Dayyeh, N. Thosani, V. Konda, M. B. Wallace, D. K. Rex, S. S. Chauhan, J. H. Hwang, S. Komanduri, M. Manfredi, J. T. Maple, et al. (2015) ASGE technology committee systematic review and meta-analysis assessing the asge pivi thresholds for adopting real-time endoscopic assessment of the histology of diminutive colorectal polyps. Gastrointestinal endoscopy 81 (3), pp. 502–e1. Cited by: §I.
  • [3] U. Demir, Z. Zhang, B. Wang, M. Antalek, E. Keles, D. Jha, A. Borhani, D. Ladner, and U. Bagci (2022)

    Transformer based Generative Adversarial Network for Liver Segmentation


    Proceedings of the International Conference on Parallel Artificial Intelligence

    Cited by: §I.
  • [4] D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao (2020) PraNet: Parallel Reverse Attention Network for Polyp Segmentation. In Proceedings of the International conference on medical image computing and computer-assisted intervention (MICCAI), pp. 263–273. Cited by: §I.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR))

    pp. 770–778. Cited by: §I.
  • [6] C. Huang, H. Wu, and Y. Lin (2021) HarDNet-MSEG A Simple Encoder-Decoder Polyp Segmentation Neural Network that Achieves over 0.9 Mean Dice and 86 FPS. arXiv preprint arXiv:2101.07172. Cited by: TABLE I, §I, TABLE II.
  • [7] D. Jha, S. Ali, N. K. Tomar, H. D. Johansen, D. Johansen, J. Rittscher, M. A. Riegler, and P. Halvorsen (2021) Real-Time Polyp Detection, Localization and Segmentation in Colonoscopy using Deep Learning. IEEE Access 9, pp. 40496–40510. Cited by: TABLE I, §I, TABLE II.
  • [8] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen (2020)

    DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation

    In Proceedings of the International symposium on computer-based medical systems (CBMS), pp. 558–564. Cited by: §I.
  • [9] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. d. Lange, D. Johansen, and H. D. Johansen (2020) Kvasir-SEG: A Segmented Polyp Dataset. In Proceedings of the International Conference on Multimedia Modeling (MMM), pp. 451–462. Cited by: item 2, TABLE I, Fig. 2, TABLE III, §III.
  • [10] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, and H. D. Johansen (2019) ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the International Symposium on Multimedia (ISM), pp. 225–2255. Cited by: TABLE I, §I, TABLE II.
  • [11] O. Kronborg, C. Fenger, J. Olsen, O. D. Jørgensen, and O. Søndergaard (1996) Randomised study of screening for colorectal cancer with faecal-occult-blood test. The Lancet 348 (9040), pp. 1467–1471. Cited by: §I.
  • [12] P. N. Lan, N. S. An, D. V. Hang, D. Van Long, T. Q. Trung, N. T. Thuy, and D. V. Sang (2021) NeoUNet: Towards Accurate Colon Polyp Segmentation and Neoplasm Detection. In International Symposium on Visual Computing, pp. 15–28. Cited by: item 2, Fig. 2, TABLE II, TABLE III, §III, §IV.
  • [13] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical image computing and computer-assisted intervention (MICCAI), pp. 234–241. Cited by: TABLE I, §I, TABLE II, §IV.
  • [14] L. C. Seeff, T. B. Richards, J. A. Shapiro, M. R. Nadel, D. L. Manninen, L. S. Given, F. B. Dong, L. D. Winges, and M. T. McKenna (2004) How many endoscopies are performed for colorectal cancer screening? results from cdc’s survey of endoscopic capacity. Gastroenterology 127 (6), pp. 1670–1677. Cited by: §I.
  • [15] R. L. Siegel, K. D. Miller, H. E. Fuchs, and A. Jemal (2022) Cancer statistics, 2022. CA: a cancer journal for clinicians. Cited by: §I.
  • [16] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray (2021)

    Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries

    CA: a cancer journal for clinicians 71 (3), pp. 209–249. Cited by: §I.
  • [17] N. K. Tomar, D. Jha, S. Ali, H. D. Johansen, D. Johansen, M. A. Riegler, and P. Halvorsen (2021) DDANet: Dual Decoder Attention Network for Automatic Polyp Segmentation. In Proceedings of the International Conference on Pattern Recognition workshop, pp. 307–314. Cited by: TABLE I, TABLE II.
  • [18] N. K. Tomar, D. Jha, U. Bagci, and S. Ali (2022) TGANet: Text-guided Attention for Improved Polyp Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Cited by: §I.
  • [19] N. K. Tomar, D. Jha, M. A. Riegler, H. D. Johansen, D. Johansen, J. Rittscher, P. Halvorsen, and S. Ali (2022) FANet: A Feedback Attention Network for Improved Biomedical Image Segmentation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems (NIPS) 30. Cited by: §I, §II.
  • [21] V. Wadhwa, M. Alagappan, A. Gonzalez, K. Gupta, J. R. G. Brown, J. Cohen, M. Sawhney, D. Pleskow, and T. M. Berzin (2020) Physician sentiment toward artificial intelligence (ai) in colonoscopic practice: a survey of us gastroenterologists. Endoscopy international open 8 (10), pp. E1379–E1384. Cited by: §I.
  • [22] M. L. Wilson, K. A. Fleming, M. A. Kuti, L. M. Looi, N. Lago, and K. Ru (2018) Access to pathology and laboratory medicine services: a crucial gap. The Lancet 391 (10133), pp. 1927–1938. Cited by: §I.
  • [23] F. Yu and V. Koltun (2015) Multi-Scale Context Aggregation by Dilated Convolutions. arXiv preprint arXiv:1511.07122. Cited by: §I, §I.
  • [24] A. G. Zauber, S. J. Winawer, M. J. O’Brien, I. Lansdorp-Vogelaar, M. van Ballegooijen, B. F. Hankey, W. Shi, J. H. Bond, M. Schapiro, J. F. Panish, et al. (2012) Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths. N Engl J Med 366, pp. 687–696. Cited by: §I.
  • [25] Z. Zhang, Q. Liu, and Y. Wang (2018) Road Extraction by Deep Residual U-Net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: TABLE I, §I, TABLE II.
  • [26] S. Zhao, S. Wang, P. Pan, T. Xia, X. Chang, X. Yang, L. Guo, Q. Meng, F. Yang, W. Qian, et al. (2019) Magnitude, risk factors, and factors associated with adenoma miss rate of tandem colonoscopy: a systematic review and meta-analysis. Gastroenterology 156 (6), pp. 1661–1674. Cited by: §I.
  • [27] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang (2018) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: TABLE I, §I, TABLE II.