NanoNet: Real-Time Polyp Segmentation in Video Capsule Endoscopy and Colonoscopy

by   Debesh Jha, et al.

Deep learning in gastrointestinal endoscopy can assist to improve clinical performance and be helpful to assess lesions more accurately. To this extent, semantic segmentation methods that can perform automated real-time delineation of a region-of-interest, e.g., boundary identification of cancer or precancerous lesions, can benefit both diagnosis and interventions. However, accurate and real-time segmentation of endoscopic images is extremely challenging due to its high operator dependence and high-definition image quality. To utilize automated methods in clinical settings, it is crucial to design lightweight models with low latency such that they can be integrated with low-end endoscope hardware devices. In this work, we propose NanoNet, a novel architecture for the segmentation of video capsule endoscopy and colonoscopy images. Our proposed architecture allows real-time performance and has higher segmentation accuracy compared to other more complex ones. We use video capsule endoscopy and standard colonoscopy datasets with polyps, and a dataset consisting of endoscopy biopsies and surgical instruments, to evaluate the effectiveness of our approach. Our experiments demonstrate the increased performance of our architecture in terms of a trade-off between model complexity, speed, model parameters, and metric performances. Moreover, the resulting model size is relatively tiny, with only nearly 36,000 parameters compared to traditional deep learning approaches having millions of parameters.



There are no comments yet.


page 1

page 3

page 4

page 6


Deep Learning Based Segmentation of Various Brain Lesions for Radiosurgery

Semantic segmentation of medical images with deep learning models is rap...

FDDWNet: A Lightweight Convolutional Neural Network for Real-time Sementic Segmentation

This paper introduces a lightweight convolutional neural network, called...

Towards a Computed-Aided Diagnosis System in Colonoscopy: Automatic Polyp Segmentation Using Convolution Neural Networks

Early diagnosis is essential for the successful treatment of bowel cance...

Comparative study of image registration techniques for bladder video-endoscopy

Bladder cancer is widely spread in the world. Many adequate diagnosis te...

PaXNet: Dental Caries Detection in Panoramic X-ray using Ensemble Transfer Learning and Capsule Classifier

Dental caries is one of the most chronic diseases involving the majority...

FIgLib SmokeyNet: Dataset and Deep Learning Model for Real-Time Wildland Fire Smoke Detection

The size and frequency of wildland fires in the western United States ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Gastrointestinal (GI) endoscopy is a widely used technique to diagnose and treat anomalies in the upper (esophagus, stomach, and duodenum) and the lower (large bowel and anus) GI tract. Among the other GI tract organs, colorectal cancer (CRC) has the highest cancer incidences and mortality rate [37]. There are several CRC screening options. Theses are usually divided into two categories, namely, invasive (visual examination-based test) and non-invasive based tests (stool, blood, and radiological test). Colonoscopy, the gold standard for examining the large bowel (colon and rectum), is an invasive examination used to detect, observe, and remove abnormalities (such as polyps). It detects colorectal cancer with both high sensitivity and specificity. Sigmoidscopy is another invasive test. Computed Tomography(CT) Colonoscopy, Fecal Occult Blood Test (FOBT) Fecal Immunochemical Test (FIT), and Video Capsule Endoscopy (VCE) are non-invasive tests. VCE is a technology for capturing the video inside the GI tract. It has evolved as an important tool for detecting small bowel diseases [28].

Deep Learning (DL) methods have made a significant breakthrough in several medical domain such as lung cancer detection [5], diabetic retinopathy progression [4], and obstructive hypertrophic cardiomyopathy detection [15]. It has provided new opportunities to solve challenges such as bleeding, light over/underexposure, smoke, and reflections [9]. However, DL normally needs a large annotated dataset for the implementation of methods. It is difficult to obtain a labeled medical dataset. First, it needs collaborations with the hospitals. For data collection, the doctors require approval from various authorities and patient consent. They need to set protocols for the collection, and the collected data must be anonymized and cleaned with the help of data engineers. Domain experts must label raw data, and after labeling, the annotations must be done depending upon the need of the task. The whole process requires an significant amount of expert time and is costly. Additionally, it is an operator-dependent process. The quality of the data labeling and annotation depends on the expertise of the clinicians. Therefore, it is challenging to curate a larger dataset.

One way of solving the dataset issue is to create synthetic images using Genearative Adversarial Network (GAN[14]

. However, generated synthetic images may not always capture all the properties and characteristics of real endoscopic images. Consequently, the model may only learn to predict the properties from the synthetic images and may not perform well on a real endoscopic dataset. Another solution could be domain adaptation from a similar endoscopic dataset. However, we lack large publicly available labeled endoscopic datasets. Thus, a viable and compelling approach to solve the semantic segmentation task is to reuse ImageNet pre-trained encoders in the segmentation model 

[10]. The predicted masks from the algorithm can provide reliable information to the endoscopic model.

A lightweight Convolutional Neural Network (CNN) model can be essential for the development of real-time and efficient semantic segmentation methods. Usually, lightweight models are computationally efficient and require less memory. A smaller number of parameters makes the network less redundant. Lightweight CNN models are mainly being deployed in mobile applications [27]. A lightweight model can play a crucial role from a system perspective with a limited resource constraint for real-time prediction in clinics. Consequently, we propose a novel architecture, NanoNet, optimized for faster inference and high accuracy. An extremely lightweight model with very few trainable parameters, faster inference, and higher performance would require less memory footprint to be incorporated with any devices. Therefore, we put forward this approach to address the challenges in endoscopy.

The main contributions of this work include the following:

  1. We proposed a novel architecture, named NanoNet, to segment video capsule endoscopy and colonoscopy images in real-time with high accuracy. The proposed architecture is very lightweight, and the model size is smaller, requiring less computational cost.

  2. VCE datasets are difficult to obtain with pixel-wise annotations. In this context, we have annotated 55 polyps from the “polyp” class of the Kvasir-Capsule dataset with the help of an expert gastroenterologist. We have made this dataset public and provided the benchmark.

  3. NanoNet achieves promising performance on the KvasirCapsule-SEG, Kvasir-SEG [21], 2020 Medico automatic polyp segmentation challenge [19], 2020 EndoTect challenge [18], and Kvasir-Instrument [22] datasets. All experiments conform with state-of-the-art (SOTA) in terms of parameter uses (size), speed, computation, and performance metrics.

  4. The model can be integrated with mobile and embedded devices because of fewer parameters used in the network.

Ii Related work

Ii-a Semantic segmentation of endoscopic images

Semantic segmentation of endoscopic images has been a well-established topic in medical image segmentation. Earlier work mostly relied on the handcrafted descriptors for feature learning [26, 3]. The handcrafted features such as color, shape, texture, and edges were extracted and fed to the Machine Learning (ML

) classifier, which separates lesions from the background. However, the traditional

ML methods based on handcrafted features suffer from low performance [8]. The recent works on polyp segmentation using both video capsule endoscopy and colonoscopy mostly relied on

Deep Neural Network

(DNN[25, 32, 38, 16, 2, 13, 20].

With the DNN methods, there is progress in the performance for segmenting endoscopic images (for example, polyps). However, the network architectures are often complex and requires high-end GPUs for training, and is computationally expensive [24, 23, 13]. Additionally, real-time lesion segmentation has often been ignored. Although there is some recent initiation for the real-time detection of endoscopic images, they have mostly used private datasets [29, 40, 31] for the experimentation. It is difficult to compare the new methods on these datasets and extend the benchmark. Therefore, there is a need for a benchmark on publicly available datasets to minimize the research gap towards building a clinically relevant model.

Ii-B Lightweight model

There are few works in the literature that have proposed lightweight models for image segmentation. Ni et al. [30] presented a novel bilinear attention network-based approach with an adaptive receptive field for the segmentation of surgical instruments. Wang et al. [39] proposed a lightweight encoder-decoder network (LEDNet), an encoder-decoder network that uses ResNet50 in the encoder block and attention pyramidal network in the decoder block. Beheshti et al. [6] proposed SqueezeNet. The architecture of the SqueezeNet is inspired by UNet [33]. The proposed model obtained a 12 reduction in model size and showed efficient performance in multiplication accumulation (mac) and memory uses.

From the above-related work, we identify a need for a real-time polyp segmentation method. A real-time polyp segmentation method can be achieved by building a lightweight network architecture by designing an efficient network with blocks that require fewer parameters. A lower number of network parameters will reduce the network complexity, leading to real-time or faster inference. In this respect, we propose NanoNet, which uses a lightweight pre-trained network MobileNetV2 [35], and simple convolutional blocks such as residual block and squeeze and excite block.

Fig. 1: Overview of the proposed NanoNet architecture

Iii Network architecture

The architecture of NanoNet follows an encoder-decoder approach as shown in Figure 1. As depicted in Figure 1

, the network architecture uses a pre-trained model as an encoder, followed by the three decoder blocks. Using pre-trained ImageNet 


models for transfer learning has become the best choice for many

CNN architectures [23, 10]. It helps the model converge much faster and achieves high performance compared to the non-pre-trained model. The proposed architecture uses a MobileNetV2 [35] model pre-trained on the ImageNet [11] dataset as the encoder. The decoder is built using a modified version of the residual block, which was initially introduced by He et al. [17]. The encoder is used to capture the required contextual information from the input, whereas the decoder is used to generate the final output by using the contextual information extracted by the encoder.

Iii-a MobileNetV2

The MobileNetV2 [35] is an architecture that is primarily designed for mobile and embedded devices. The architecture performed well on a variety of different datasets while maintaining high accuracy, despite having fewer parameters. The architecture of MobileNetV2 is based on the architecture of MobileNetV1, which uses depth-wise separable convolutions as the main building block. A depth-wise separable convolution consists of depth-wise convolution followed by a point-wise convolution. The MobileNetV2 introduces two main ideas: inverted residual block and linear bottleneck block [35].

The inverted residual block is based on the bottleneck residual block as described in the [17], which consists of three standard convolutions, which are , , and . Every convolution layer is followed by a Rectified Linear Unit (ReLU) non-linearity. In the first standard convolution, the number of feature channels are reduced, and in the last standard convolution, the number of feature channels are expanded. After that, an element-wise addition with the identity mapping is performed. The inverted residual block also has three convolution layers: a standard convolution, a depth-wise convolution, and a standard convolution. Every convolution has a ReLUactivation function. Here, the exact opposite of the bottleneck residual block is performed. The first standard convolution expands the number of feature channels, and the last standard convolution reduces the number of feature channels. Due to this opposite functionality, it is referred to as an inverted residual block. The linear bottleneck block is the same as the inverted residual block, except the last standard convolution has a linear activation before an element-wise addition is performed with the identity mapping.

Iii-B Modified Residual Block

The original residual block uses two

standard convolutions, where the first convolution is followed by a batch-normalization and a

ReLU activation function. After that, the second convolution is followed only by a batch-normalization. An element-wise addition is performed between the output of the batch-normalization and the identity mapping, followed by another ReLU activation function. An identity mapping consists of a standard convolution and a batch-normalization over the original input.

We have modified the residual block for our network. The modified residual block starts with a convolution followed by a convolution. In both of these convolutions, we reduce the number of filters by , which are then followed by the batch normalization and the ReLU activation function. We have a convolution with batch normalization. Now, we perform an element-wise addition with the identity mapping. Finally, we apply a ReLU activation function followed by the squeeze and excitation block. The squeeze and excitation block improves the quality of feature maps by increasing their sensitivity towards essential features.

Dataset No. of Images Imaging Type Availability
KvasirCapsule-SEG 55 Video capsule endoscopy
Kvasir-SEG [21] 1000 Colonoscopy
2020 Medico automatic polyp segmentation challenge [19] 160 Colonoscopy
Endotect Challenge Dataset [18] 200 Colonoscopy
Kvasir-Instrument [7] 590 Colonoscopy
test images
TABLE I: Publicly available endoscopic datasets used in our experiments

Iii-C The NanoNet architecture

Figure 1 shows the block diagram of the NanoNet architecture. The NanoNet architecture starts with a pre-trained MobileNetV2 as an encoder followed by a decoder. There is a modified residual block between the encoder and the decoder, which acts like a bridge that connects the encoder and the decoder. In the first step, we feed the image data into the pre-trained encoder. The pre-trained encoder starts with a standard convolution with 32 feature channels, followed by the bottleneck layer with ReLU6 as the activation function. All the convolution operations use a standard

kernel size. The entire encoder network progressively downsamples the feature maps by using strided convolution and slowly increases the number of feature channels alternatively.

The output from the pre-trained encoder passes through the modified residual block, which is fed to the decoder. Every step in the decoder uses a bilinear upsampling to increase the spatial dimension (height and width) of the input feature maps. After that, it is concatenated with the appropriate feature maps from the pre-trained encoder using the skip connections. These skip connections pass information that may be lost sometimes between the layers and are used to improve the quality of the feature maps. These concatenated feature maps are passed through the modified residual block, which further increases the generalization capacity of the decoder. After the feature maps pass through all the three decoder block, the output of the last decoder block is fed to a convolution with a number of classes as the feature channels. This is followed by the sigmoid activation if it is a binary segmentation task, else we use the softmax activation function.

We have demonstrated three different NanoNet architectures: NanoNet-A, NanoNet-B, and NanoNet-C. Each architecture consists of different feature channels in its decoder block. NanoNet-A consists of , and feature channels. In NanoNet-B, the number of feature channels is reduced to , , and . In NanoNet-C, these feature channels are further reduced to , , and . The reduction in the number of feature channels leads to less trainable parameters, which simplifies the model complexity leading to a light-weight network.

Iv Experimental setup

In this section, we will describe the dataset, evaluation metrics, implementation details, and data augmentation techniques used.

Fig. 2: Polyps and corresponding masks from KvasirCapsule-SEG

Iv-a Datasets

To address the polyp segmentation problem from video capsule endoscopy images, we have selected the polyp class from labelled images folder of the Kvasir-Capsule dataset [36] and annotated it with the help of an expert gastroenterologist. The Kvasir-Capsule is an open-access dataset that contains 13 classes of labelled anomalies and findings. It only includes 55 polyp frames out of 44,228 medically verified video capsule frames present in the Kvasir-Capsule. We have annotated the polyp class of Kvasir-Capsule and generated corresponding ground truth masks. Examples of polyps and their corresponding masks from KvasirCapsule-SEG can be found in Figure 2. Furthermore, we also provide bounding box information to be used for video capsule endoscopy detection and localization tasks. The Kvasir-Capsule can be downloaded from here 111 and KvasirCapsule-SEG can be downloaded from here 222

Table I shows the detailed information about the open imaging dataset used in our experiments. Each of the datasets presented in Table I also has the corresponding ground truth. The link for each of the datasets is provided in the table. The standard setting for the “Medico automatic polyp segmentation challenge” and “Endotect challenge” is that they use the Kvasir-SEG for training. The challenge organizers have provided unseen 160 images in the “Medico automatic polyp segmentation challenge” and released 200 images in the “Endotect challenge” to test the participant’s approaches. For the Kvasir-instrument dataset, we experimented with the official split provided by the organizers. The detail explanation of these datasets and the baseline results can be found in [21, 22, 19, 18].

Iv-B Evaluation metrics

For the evaluation of our model, we have chosen standard computer vision metrics such as

Dice Coefficient (DSC), mean Intersection over Union (mIoU), Precision, Recall, Specificity, Accuracy, and Frame-per-second (FPS). More explanation of these metrics can be found in [21, 22, 19, 18].

Method Parameters DSC mIoU Recall Precision F2 Accuracy FPS
ResUNet (GRSL’18) [41] 8,227,393 0.9532 0.9137 0.9785 0.9325 0.9677 0.9386 17.96
ResUNet++ (ISM’19)[24] 4,070,385 0.9499 0.9087 0.9762 0.9296 0.9648 0.9334 15.39
NanoNet-A (Ours) 235,425 0.9493 0.9059 0.9693 0.9325 0.9609 0.9351 28.35
NanoNet-B (Ours) 132,049 0.9474 0.9028 0.9682 0.9308 0.9593 0.9324 27.39
NanoNet-C (Ours) 36,561 0.9465 0.9021 0.9754 0.9238 0.9629 0.9297 29.48
TABLE II: Performance evaluation of the proposed networks and recent SOTA methods on KvasirCapsule-SEG
Method Parameters DSC mIoU Recall Precision F2 Accuracy FPS
ResUNet (GRSL’18) [41] 8,227,393 0.7203 0.6106 0.7602 0.7624 0.7327 0.9251 17.72
ResUNet++ (ISM’19) [24] 4,070,385 0.7310 0.6363 0.7925 0.7932 0.7478 0.9223 19.79
NanoNet-A (Ours) 235,425 0.8227 0.7282 0.8588 0.8367 0.8354 0.9456 26.13
NanoNet-B (Ours) 132,049 0.7860 0.6799 0.8392 0.8004 0.8067 0.9365 29.73
NanoNet-C (Ours) 36,561 0.7494 0.6360 0.8081 0.7738 0.7719 0.9290 32.17
TABLE III: Performance evaluation of the proposed networks and recent SOTA methods on Kvasir-SEG[21]
Method Parameters DSC mIoU Recall Precision F2 Accuracy FPS
ResUNet (GRSL’18) [41] 8,227,393 0.6846 0.5599 0.7235 0.7236 0.6961 0.9231 18.54
ResUNet++ (ISM’19) [24] 4,070,385 0.6925 0.5849 0.8249 0.6840 0.7434 0.8995 19.47
NanoNet-A (Ours) 235,425 0.7364 0.6319 0.8566 0.7310 0.7804 0.9166 28.07
NanoNet-B (Ours) 132,049 0.7378 0.6247 0.8283 0.7373 0.7685 0.9223 29.04
NanoNet-C (Ours) 36,651 0.7070 0.5866 0.8095 0.7089 0.7432 0.9148 32.66
TABLE IV: Performance evaluation of the proposed networks and recent SOTA methods on the Medico 2020 dataset [19]
Method Parameters DSC mIoU Recall Precision F2 Accuracy FPS
ResUNet (GRSL’18) [34] 8,227,393 0.6640 0.5408 0.7510 0.6841 0.6943 0.9075 26.55
ResUNet++ (ISM’19) [24] 4,070,385 0.6940 0.5838 0.8797 0.6591 0.7597 0.8841 18.58
NanoNet-A (Ours) 235,425 0.7508 0.6466 0.8238 0.7744 0.7773 0.9255 27.19
NanoNet-B (Ours) 132,049 0.7362 0.6238 0.8109 0.7532 0.7646 0.9252 29.91
NanoNet-C (Ours) 36,651 0.7001 0.5792 0.8000 0.7159 0.7380 0.9091 32.98
TABLE V: Performance evaluation of the proposed networks and recent SOTA methods on the Endotect 2020 dataset [18]
Method Parameters DSC mIoU Recall Precision F2 Accuracy FPS
UNet (Baseline) [34] - 0.9158 0.8578 0.9487 0.8998 0.9320 0.9864 20.46
DoubleUNet (Baseline) [23] - 0.9038 0.8430 0.9275 0.8966 0.9147 0.9838 10.00
ResUNet++ (ISM’19) [24] 4,070,385 0.9140 0.8635 0.9103 0.9348 0.9140 0.9866 17.87
NanoNet-A (Ours) 235,425 0.9251 0.8768 0.9142 0.9540 0.9251 0.9887 28.00
NanoNet-B (Ours) 132,049 0.9284 0.8790 0.9205 0.9482 0.9284 0.9875 29.82
NanoNet-C (Ours) 36,561 0.9139 0.8600 0.9037 0.9452 0.9139 0.9863 32.18
TABLE VI: Performance evaluation of the proposed networks and recent SOTA methods on Kvasir-Instrument [22]

Iv-C Implementation details

We have implemented the NanoNet using Keras


with TensorFlow 

[1] as backend. The experiments were run on the Experimental Infrastructure for Exploration of Exascale Computing (eX3), NVIDIA DGX-2 machine. The code implementation of NanoNet can be found here444 As the model has very few low trainable parameters, we have set a batch size of 16. We have resized the dataset images to

pixels for better utilization of the GPU, and it also helps to reduce the training time. The model is trained on 200 epochs with the Nadam optimizer 


and dice coefficient as the loss function. The learning rate for the optimizer is set to 1e

. We prefer to choose a low learning rate to update the parameters slowly and carefully. The learning rate is reduced by a factor of 0.1 when the validation loss does not decrease in consecutive epochs. It helps to improve model performance. Additionally, we have used an early stopping mechanism to prevent over-fitting.

Iv-D Data augmentation

We use data-augmentation on the training set to increase diversity and to improve the generalization of our model. Data augmentation techniques such as random cropping, random rotation, horizontal flipping, vertical flipping, grid distortion, and many more are used. We have used an offline data augmentation technique. The validation and testing set is not augmented and is directly resized into .

Fig. 3: Qualitative results of NanoNet-A on five different datasets

V Result and Discussion

In this section, we provide the experimental results for the segmentation task of the endoscopic image dataset. For the evaluation, we have used performance metrics such as DSC and mIoU, and FPS as the main evaluation metrics. We also calculate recall, precision, F2, and overall accuracy to support a complete set of metrics. Table II, Table III, Table IV, Table V, and Table VI show the results of the NanoNet model experiments using different parameters. The results are compared with the recent SOTA computer vision methods.

The quantitative results in these tables show that NanoNet consistently outperforms or performs nearly equal to its competitors in terms of performance. The quantitative results also show that NanoNet can produce real-time segmentation (i.e., produces at least close to 30 FPS for each dataset present in the Tables). This is one of the major contributions of the work. The other strength of the work lies in the parameter use. From Table II, we can observe that the best performing NanoNet (i.e., NanoNet-A) uses nearly 35 times less parameters as ResUNet [41]. Similarly, NanoNet-C uses 225 times less parameters as compared to that of ResUNet and also produces better DSC, mIoU and FPS with the Kvasir-SEG.

The qualitative results are displayed in Figure 3. The first, second, and third columns show the image, ground truth, and prediction masks, respectively. Similarly, the name of the dataset is provided on the left side. One example image for each dataset is shown. The qualitative results with diversified classes of medical datasets show that NanoNet can produce accurate segmentation results with different types of lesions (polyps) and therapeutic tools. The example images and the prediction also show that NanoNet produces good segmentation masks for large, medium, and small polyps (see Figure 3). From the qualitative results, we can derive and conclude that NanoNet produces good results with small-sized polyps but produces over-segmentation for the large-sized lesions upon detail dissection. For future work, one could create a specific dataset consisting of a set of small and large-sized polyps to explore this further.

From both evaluation metrics and qualitative results, the improvement is remarkable. Thus, the proposed NanoNet architecture is simple, compact, and provides a robust solution for real-time applications, as it produces satisfactory performance despite having fewer parameters.

Vi Conclusion

In this paper, we proposed a novel lightweight architecture for real-time video capsule endoscopy and colonoscopy image segmentation. The proposed NanoNet architecture utilizes a pre-trained MobileNetV2 model and a modified residual block. The depthwise separable convolution is the main building block of the network and allows the model to achieve high performance with minuscule trainable parameters. The experimental results on varied endoscopy datasets demonstrate the strength of our model compared to state-of-the-art models with respect to their speed and performance. The presented model has the potential to enable easier roll out of deep learning models in clinical systems due to fewer parameters, competitive accuracy, and low-latency. In addition, the model does not require any sort of initialization, post-processing, or temporal regularization, considered as another strength of this work. In the future, we will design an encoder lighter than the currently used pre-trained MobilNetV2. Moreover, we aspire to utilize the currently built segmentation module in the clinic and study the efficacy of our designed model.


The research is partially funded by the PRIVATON project (263248) and the Autocap project (282315) from the Research Council of Norway (RCN). Our experiments were performed on the Experimental Infrastructure for Exploration of Exascale Computing (eX3) system, which is financially supported by RCN under contract 270053.


  • [1] M. Abadi et al. (2016) Tensorflow: a system for large-scale machine learning. In Proc. of USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 265–283. Cited by: §IV-C.
  • [2] S. Ali et al. (2021) Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy. Medical Image Analysis, pp. 102002. Cited by: §II-A.
  • [3] S. Ameling, S. Wirth, D. Paulus, G. Lacey, and F. Vilarino (2009) Texture-based polyp detection in colonoscopy. In Bildverarbeitung für die Medizin 2009, pp. 346–350. Cited by: §II-A.
  • [4] F. Arcadu et al. (2019) Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ digital medicine 2 (1), pp. 1–9. Cited by: §I.
  • [5] D. Ardila et al. (2019) End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature medicine 25 (6), pp. 954–961. Cited by: §I.
  • [6] N. Beheshti and L. Johnsson (2020) Squeeze u-net: a memory and energy efficient image segmentation network. In

    Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    pp. 364–365. Cited by: §II-B.
  • [7] J. Bernal et al. (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43, pp. 99–111. Cited by: TABLE I.
  • [8] J. Bernal, J. Sánchez, and F. Vilarino (2012) Towards automatic polyp detection with a polyp appearance model. Pattern Recognition 45 (9), pp. 3166–3182. Cited by: §II-A.
  • [9] S. Bodenstedt et al. (2018) Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery. arXiv preprint arXiv:1805.02475. Cited by: §I.
  • [10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §I, §III.
  • [11] J. Deng et al. (2009) Imagenet: a large-scale hierarchical image database. In Proc. of IEEE conference on computer vision and pattern recognition (CVPR), pp. 248–255. Cited by: §III.
  • [12] T. Dozat (2016)

    Incorporating nesterov momentum into adam

    In Proc. of International Conference on Learning Representations, Cited by: §IV-C.
  • [13] D. Fan et al. (2020) Pranet: parallel reverse attention network for polyp segmentation. In Proc. of International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 263–273. Cited by: §II-A, §II-A.
  • [14] I. J. Goodfellow et al. (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §I.
  • [15] E. M. Green et al. (2019) Machine learning detection of obstructive hypertrophic cardiomyopathy using a wearable biosensor. NPJ digital medicine 2 (1), pp. 1–4. Cited by: §I.
  • [16] Y. Guo, J. Bernal, and B. J Matuszewski (2020) Polyp segmentation with fully convolutional deep neural networks—extended evaluation study. Journal of Imaging 6 (7), pp. 69. Cited by: §II-A.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778. Cited by: §III-A, §III.
  • [18] S. A. Hicks et al. (2020) The endotect 2020 challenge: evaluation and comparison of classification, segmentation and inference time for endoscopy. In Proceedings of ICPR 2020 Workshops and Challenges, Cited by: item 3, TABLE I, §IV-A, §IV-B, TABLE V.
  • [19] D. Jha, S. A. Hicks, K. Emanuelsen, H. Johansen, D. Johansen, T. de Lange, M. A. Riegler, and P. Halvorsen (2020) Medico multimedia task at mediaeval 2020: automatic polyp segmentation. In CEUR Proceedings of MediaEval Workshop, Cited by: item 3, TABLE I, §IV-A, §IV-B, TABLE IV.
  • [20] D. Jha et al. A Comprehensive Study on Colorectal Polyp Segmentation with ResUNet++, Conditional Random Field and Test-Time Augmentation. IEEE Journal of Biomedical and Health Informatics. Cited by: §II-A.
  • [21] D. Jha et al. (2020) Kvasir-seg: a segmented polyp dataset. In Proc. of International Conference on Multimedia Modeling (MMM), pp. 451–462. Cited by: item 3, TABLE I, §IV-A, §IV-B, TABLE III.
  • [22] D. Jha et al. (2021) Kvasir-instrument: diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy. In Proc. of Multimedia Modeling (MMM), Cited by: item 3, §IV-A, §IV-B, TABLE VI.
  • [23] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen (2020) DoubleU-Net: A Deep Convolutional Neural Network for Medical Image Segmentation. In Proc. of International Conference on Multimedia Modeling (MMM), pp. 451–462. Cited by: §II-A, §III, TABLE VI.
  • [24] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, and H. D. Johansen (2019) ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proc. of IEEE International Symposium on Multimedia (ISM), pp. 225–2255. Cited by: §II-A, TABLE II, TABLE III, TABLE IV, TABLE V, TABLE VI.
  • [25] X. Jia, X. Xing, Y. Yuan, L. Xing, and M. Q. Meng (2019) Wireless capsule endoscopy: a new tool for cancer screening in the colon with deep-learning-based polyp recognition. Proceedings of the IEEE 108 (1), pp. 178–197. Cited by: §II-A.
  • [26] S. A. Karkanis, D. K. Iakovidis, D. E. Maroulis, D. A. Karras, and M. Tzivras (2003) Computer-aided tumor detection in endoscopic video using color wavelet features. IEEE transactions on information technology in biomedicine 7 (3), pp. 141–152. Cited by: §II-A.
  • [27] Y. Kim et al. (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §I.
  • [28] A. Kornbluth, P. Legnani, and B. S. Lewis (2004) Video capsule endoscopy in inflammatory bowel disease: past, present, and future. Inflammatory Bowel Diseases 10 (3), pp. 278–285. Cited by: §I.
  • [29] J. Y. o. Lee (2020) Real-time detection of colon polyps during colonoscopy using deep learning: systematic validation with four independent datasets. Scientific reports 10 (1), pp. 1–9. Cited by: §II-A.
  • [30] Z. Ni et al. (2020) BARNet: bilinear attention network with adaptive receptive field for surgical instrument segmentation. arXiv preprint arXiv:2001.07093. Cited by: §II-B.
  • [31] C. C. Poon et al. (2020) AI-doscopist: a real-time deep-learning-based algorithm for localising polyps in colonoscopy videos with edge computing devices. NPJ Digital Medicine 3 (1), pp. 1–8. Cited by: §II-A.
  • [32] V. Prasath (2017) Polyp detection and segmentation from video capsule endoscopy: a review. Journal of Imaging 3 (1), pp. 1. Cited by: §II-A.
  • [33] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo (2017) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: §II-B.
  • [34] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. of International Conference on Medical image computing and computer-assisted intervention (MICCAI), pp. 234–241. Cited by: TABLE V, TABLE VI.
  • [35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proc. of IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §II-B, §III-A, §III.
  • [36] P. H. Smedsrud et al. (2021) Kvasir-capsule, a video capsule endoscopy dataset. Springer Nature Scientific Data. Cited by: §IV-A.
  • [37] H. Sung et al. (2021)

    Global cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries

    CA: a cancer journal for clinicians. Cited by: §I.
  • [38] N. K. Tomar et al. (2021) FANet: a feedback attention network for improved biomedical image segmentation. arXiv preprint arXiv:2103.17235. Cited by: §II-A.
  • [39] Y. Wang et al. (2019) Lednet: a lightweight encoder-decoder network for real-time semantic segmentation. In Proc. of IEEE International Conference on Image Processing (ICIP), pp. 1860–1864. Cited by: §II-B.
  • [40] M. Yamada et al. (2019) Development of a real-time endoscopic image diagnosis support system using deep learning technology in colonoscopy. Scientific reports 9 (1), pp. 1–9. Cited by: §II-A.
  • [41] Z. Zhang, Q. Liu, and Y. Wang (2018) Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters 15 (5), pp. 749–753. Cited by: TABLE II, TABLE III, TABLE IV, §V.