AFP-Net: Realtime Anchor-Free Polyp Detection in Colonoscopy

by   Dechun Wang, et al.
UMass Lowell

Colorectal cancer (CRC) is a common and lethal disease. Globally, CRC is the third most commonly diagnosed cancer in males and the second in females. For colorectal cancer, the best screening test available is the colonoscopy. During a colonoscopic procedure, a tiny camera at the tip of the endoscope generates a video of the internal mucosa of the colon. The video data are displayed on a monitor for the physician to examine the lining of the entire colon and check for colorectal polyps. Detection and removal of colorectal polyps are associated with a reduction in mortality from colorectal cancer. However, the miss rate of polyp detection during colonoscopy procedure is often high even for very experienced physicians. The reason lies in the high variation of polyp in terms of shape, size, textural, color and illumination. Though challenging, with the great advances in object detection techniques, automated polyp detection still demonstrates a great potential in reducing the false negative rate while maintaining a high precision. In this paper, we propose a novel anchor free polyp detector that can localize polyps without using predefined anchor boxes. To further strengthen the model, we leverage a Context Enhancement Module and Cosine Ground truth Projection. Our approach can respond in real time while achieving state-of-the-art performance with 99.36 and 96.44



There are no comments yet.


page 2

page 3

page 4

page 7


Colonoscopy Polyp Detection and Classification: Dataset Creation and Comparative Evaluations

Colorectal cancer (CRC) is one of the most common types of cancer with a...

Reducing Label Noise in Anchor-Free Object Detection

Current anchor-free object detectors label all the features that spatial...

Anchor Retouching via Model Interaction for Robust Object Detection in Aerial Images

Object detection has made tremendous strides in computer vision. Small o...

Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection

Most deep learning object detectors are based on the anchor mechanism an...

Colonoscopy polyp detection with massive endoscopic images

We improved an existing end-to-end polyp detection model with better ave...

Improve Object Detection by Data Enhancement based on Generative Adversarial Nets

The accuracy of the object detection model depends on whether the anchor...

Y-Net: A deep Convolutional Neural Network for Polyp Detection

Colorectal polyps are important precursors to colon cancer, the third mo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Colorectal Cancer (CRC) is the third most common cancer worldwide and the second most lethal cancer in USA[33]

, with an estimate of 1.8 million new diagnosed colorectal cancer cases and 881,000 deaths in 2018

[7]. Colorectal cancer often begins with the growth of tissues known as polyps. Most of these polyps are initially benign but some of them will become malignant over time. Early diagnosis via colonoscopy is widely believed to be the best way for the prevention of colorectal cancer. During a colonoscopy, specialized physicians will carefully inspect the intestinal wall. However, the operations can have a false negative rate of 25% [19]. Multiple independent factors contribute to the miss rate. Some of the factors are related to the polyp appearance, eg. a small and flat polyp is less likely to be perceived, while some other causes can be operational: the camera is moving too fast, leading physicians to fail to catch the suspicious regions which require more efforts. For these reasons, real time automated polyp detection can act as a complementary tool to assist physicians in improving the sensitivity of the diagnosis.

Fig. 1: The network architecture of our proposed framework. The network includes a top-down up-sampling path to enrich semantic features at different scales, followed by a Context Enhancement Module (CEM) to increase the receptive field. The enhanced feature maps are then fed into Anchor-free detection heads to predict bounding boxes.

In the past, automated polyp detection mainly utilized hand-crafted color, shape and texture features to distinguish polyps from normal mucosae [3, 16]. Bernal et. al [4] used the appearance of the polyps as a key factor. They assumed all polyps have protruding surface and used valley detection as the feature extractor. However, polyp and normal mucosae can share similar features in edge and texture when polyps have flat surfaces as normal mucosae. In addition, some mucosae with vally-reach structures can appear similar to polyps. All these facts result in poor performance for these approaches in real applications.

Recently, Convolutional Neural Network (CNN) based object detectors such as SSD

[25] and Faster R-CNN [30]

dominate the object detection task across different image modalities. Compared with traditional approaches where features are fully designed based on human knowledge, these CNN based detectors can learn rich features automatically in the deep backbone networks. This is the key contributing factor for deep learning to thrive

[41, 11, 1, 8, 21, 36].

To make it more clear, we divide CNN based detectors into the backbone part (eg. ResNet [14], VGG [34]) and the head part which is more specific to individual detectors. For the detection head part, the “anchor” concept is shared by both single-stage [25, 29, 23] and two-stage [13, 30] detectors. Anchors represent different box templates (at different scales). This strategy works by enumerating box templates of different scales and aspect ratios with respect to the dataset in process. In other words, anchors should adapt to the characteristics of the dataset. However, designing anchor sets and assigning objects to specific anchors require extensive experience. It is also indicated in [10] that the choice of anchors is especially important for small objects. Moreover, when we assign objects to certain anchors, IoU is usually the major criterion. Different IoU thresholds can result in significant performance variations.

Motivated by these observations, the “Anchor-Free” approaches [18, 9, 40, 44, 42] have received much attention recently, where the anchor mechanism is removed and objects are represented as keypoints. For instance, CornerNet [18] represents an object as a pair of keypoints (top-left and bottom-right corners). To better detect these corners, a special corner pooling is devised to enrich the context information. Different from CornerNet, Zhou et. al [44] represent an object as a single keypoint located in the center. In this way, the time-consuming pairing is removed and the model achieves the best trade-off between performance and inference speed.

In the polyp detection, we observe that the objects (polyps) under concern actually do not overlap much with each other and the shapes do not vary much either. Therefore, we believe the idea of “objects as keypoints” fits well with our application scenario. Moreover, removing the anchor mechanism can reduce the parameters in detection heads thus result in fewer highly overlapped proposals during inference, which can potentially accelerate the inference speed.

In this paper, we propose a novel anchor-free detector for fast polyp detection. Similar to [44], we formulate objects as center points, yet removing the time-consuming center pooling in CenterNet [9] to achieve real time response. The role of the center pooling is replaced by a context enhancement module as well as a feature pyramid design. In addition, we devise a special cosine ground-truth projection strategy to compensate for the potential drop in recall caused by the removal of the anchor mechanism. Our proposed polyp detector outperforms previous studies and achieves the state of the art performance in terms of both accuracy and inference speed.

Ii Related Work

Ii-a Polyp Detection

With the success of deep learning in natural image processing, CNN based polyp detectors were proposed in the last few years. Compared to hand-crafted features, CNN based detectors automate the process of extracting abstract and discriminative features. They are more robust and require less domain knowledge, making them particularly suitable for this task. As our approach is also CNN based, we only discuss with details CNN based detectors in this paper.

Zhang et. al [43] proposed a two-step pipeline for polyp detection in endoscopic videos. In the first step, they use a pre-trained ResYOLO to detect suspicious polyps. The polyps are assumed to be stable without a sudden move from one location to another between two consecutive frames. Therefore, in the second step, a Discriminative Correlation Filter based tracking approach was proposed to leverage the temporal information. This tracking based method refines the detection results given by ResYOLO and is capable of locating polyps missed by the detector in consecutive frames.

Mohammed et. al [27] proposed Y-Net for this task. It consists of two fully convolutional encoders followed by a fully convolutional decoder. The motivation is that a model pre-trained on natural images may not generalize well on medical images. To mitigate performance degradation due to the domain-shift (natural images to medical images), they slowly fine-tune the first encoder from the pre-trained network while aggressively train from scratch for the second encoder. Two encoders are of the same network architecture and the outputs are combined with a sum-skip-concatenation connection before coming to the decoder network.

However, seemly contradicting with [27], Mo et. al [26] proved that fine-tuning a pre-trained Faster R-CNN can work considerably well. In addition, Shin et. al [32] proposed a post learning scheme to enhance the Faster R-CNN detector. This post learning scheme automatically collects hard negative samples and retrains the network with selected polyp-like false positives, which functions similarly to boosting.

Our proposed method is the first one to apply the anchor-free approach to automated polyp detection. We believe that anchor-free design is a very practical solution in our case in terms of speed and accuracy. Unlike natural images where pre-defined anchors are introduced to attack the occlusion issue, in medical images such as Computed Tomography and Colonoscopy Images, occlusion is rare. Another concern for this polyp detection task is the real time requirement. With anchors, a large number of overlapped proposals would be proposed, putting significant pressure on the post-processing step (Non-Maximum Suppression). Therefore, the anchor free mechanism would fit better in this polyp detection task.

Ii-B Anchor Free Detectors

While almost all the state of the art object detectors employ pre-defined anchors, anchor-free object detectors [18, 9, 40, 39, 17] have received much attention in recent years because of their better adaptability towards different datasets. Representative approaches include CornerNet [18] and CenterNet [9].

In CornerNet [18]

, objects are represented as pairs of keypoints: top-left corners and bottom-right corners. The network is trained to predict a heatmap for all top-left corners and for all bottom-right corners respectively in parallel. To associate them in pairs, for each corner an embedding vector is also learned: pushing away corners belonging to different objects while pulling together corners in the same group. To further strengthen the “corner learning” process, the authors also devise a special Corner Pooling. This pooling consists of two separate one-directional pooling (horizontal and vertical) before combining them together at the end. In this way, more context information is gathered to identify corners. However, because of the pairing process, the inference speed drops significantly.

Zhou et. al [44] use the center keypoint to represent an object. In this way, the burden of learning to group corners is removed and the model achieves the best performance-speed trade-off. Our work follows this idea. But we add more constraints when assigning the label to keypoints in the context of the feature pyramid. These constraints are later proved to be critical in our experiments. Moreover, as indicated in CornerNet, enriching context information plays a key role in the success of anchor-free detectors. We also explore in this direction by adding a Context Enhancement Module.

Iii Methodology

In this section, we will describe our proposed anchor-free approach in detail including the network design and the associated technical components.

Iii-a Network Architecture

Figure 1

illustrates our framework design. It is a fully convolutional network that classifies and localizes objects on each enhanced feature map. Our network uses VGG16

[34] as the backbone Our framework selects feature maps from the backbone: conv4_3, conv5_3, conv6_2, conv7_2, conv8_2 and conv9_2, responsible for detecting objects at different scales. In order to increase semantics information for each feature map, we build a feature pyramid similarly to FPN [22]

. Note that we use Deconvolution (Transposed Convolution) for up-sampling instead of interpolation. Upstream and downstream features of the same scale are combined by element-wised addition. For each feature map

where , we first use 1x1 convolution to smooth the upper feature map , which is then up-sampled to the size of . Finally, we add up-sampled to . The enhanced feature map can be mathematically represented as follows:


To further increase context information for small objects, we feed each feature map to a context module before forwarding them to the anchor-free detection heads. These detection heads are single-staged and have similar structures to the heads in SSD where two parallel subnets are dedicated for classification and localization respectively.

Fig. 2: Illustration of the context enhancement module. The input feature channels were first divided into three chunks, each fed into dilated convolution layers of different depths before they are concatenated. In this way, the effective receptive field is enlarged while detailed information is still retained.

Iii-B Context Enhancement Module (CEM)

In order to increase context information for small objects, we apply a similar context enhancement module (illustrated in Fig. 2) in [28, 20]. Since our anchor-free detection heads are of a fully convolutional manner, increasing context information is equivalent to enlarge the receptive field of our detection heads. However, instead of using 5x5 or 7x7 filters to enlarge the receptive field, we adopt the dilated convolution following [20, 37]. One good property about dilated convolutions is that it brings in fewer parameters to achieve the same size of the receptive field. Note that the input channels are equally split into three branches with each branch being of different depths. Outputs of all branches are concatenated at the end. By employing this context enhancement module, we find in experiments that it can considerably increase the precision performance.

(a) Standard anchor based ground truth matching.
(b) Our anchor free ground truth matching.
Fig. 3: Comparison between anchor-based detectors (a) and our anchor free detector (b) in ground truth target assigning.
(a) conv4_3
(b) conv5_3
(c) conv6_2
(d) conv7_2
(e) conv8_2
(f) conv9_2
Fig. 4: Ground truth projection on different feature maps (a - f represent different scales). The blue dots are the projected points from the feature map. The green rectangles and red rectangles represent the positive and non-negative region respectively. In this figure, the ground-truth’s was originally assigned to (c) conv6_2. We project its positive and non-negative region to different levels of feature map with reduced sizes. The amount of reduction is based on the distance to (c) conv6_2.

Iii-C Anchor-free Label Assignment

We adopt a very different label assigning procedure in our anchor-free design compared with the anchor-based framework (Fig. 3). In anchor-based detectors (Fig. 2(a)), the green dashed rectangle is a positive anchor with and red dashed rectangle represents a negative anchor with . The white dashed anchor would be ignored as its is in between. Unlike anchor-based approaches relying on manually set up overlap thresholds, in our anchor-free design, center points are labeled solely based on the location and the size of ground truth boxes. The whole process is demonstrated in Fig. 2(b). The blue dotted grid represents center points. Any center points falling outside of the red rectangle, inside of the green rectangle and in the gap between the two will be assigned as negative, positive and ignored respectively.

More formally, consider an object , where is the center position of the bounding box; are the width and height respectively. Suppose this object was assigned to an arbitrary feature map with size

and stride

during training, we have (we assume the image is a square for simplicity) center points from set where and . We define an positive region and non-negative region . We mark center points as positive if , negative if and ignored if . The size of the positive region and non-negative region is controlled by scale factors and in proportion to the ground truth, i.e. . We use and in this paper.

Iii-D Cosine Ground-truth Projection

It has been shown that distributing objects of different sizes to different scales can greatly improve the detector performance. In this way, objects will be detected at the best granularity with respect to their size. On the other hand, numerically, bounding box regression would also benefit from this technique as the loss could be better constrained. Almost all current state-of-the-art object detectors [25, 30, 29, 18, 22] employ this strategy. We also adopt this multi-scale strategy in our design.

However, simply assigning ground truth to a single scale may not fully utilize the capacity of a network, resulting in slower convergence and lower recall. Yet, if we use the same positive and non-negative region across all feature maps, it will cause a scale mismatch problem: low level feature maps are learning to regress large objects and small objects at the same time, contradicting with our initial idea of dedicating feature maps with the right scale to different objects.

To solve this issue, we define a cosine Ground-truth Projection to maximize recall and speed up the training process. Figure 4 illustrates the projection of positive and non-negative region across different feature maps when is the best feature scale for the ground truth. The projection sizes of the positive and non-negative region in neighboring feature maps are penalized with a cosine function, based on how far away they are from the best feature map . Suppose we have feature maps, let be the distance between current feature map and the best feature map . We can calculate the penalizing factor on feature map as follows:


where decides to what extent the projection is penalized. We use in this paper. Our new positive region and non-negative region can now be defined as:


By using this cosine function we penalize the positive region and non-negative region less on neighboring feature maps than on distanced feature map as shown in Figure 4. As a result, we are able to receive more responses from neighboring feature maps which further improves the recall performance of our model.

Iii-E Box Regression

Unlike anchor-based methods which predict the offset by using reference anchor boxes, the output of our bounding box regression is the offset of the center point and size encoded by the stride of the feature map. In particular, given an object with category , the center position of the ground truth can be calculated as , with being the width and height of the box. The offset vector for center point at pixel location with stride can be defined as:


To remove overlapped bounding boxes, we apply IoU-based Non Maximum Suppression (NMS) with a threshold of 0.1.

Similar to Faster R-CNN [30], our localization loss for bounding box regression is Smooth L1 loss [12]:


where and denote the predicted bounding boxes, ground truth bounding boxes and the number of positive labels respectively. In this paper, we have .

Iii-F Classification Loss

Different from the situation we have in natural image modalities where objects are often dense (in terms of averaged number of objects in one image) and of regular sizes, in colonoscopy, polyps are often very sparse and small. This fact determines that we are facing an extremely small positive/negative sample ratio. Thus, we feel that Focal Loss fits better in our case compared with the Online Hard Negative Mining (OHEM) mechanism. However, we only apply Focal Loss [24] to negative samples while adopting cross entropy with a penalty term for positive samples. We assume center points closer to the ground truth centroid have a more precise view of an object and thus contribute more to the loss.

In particular, the penalty weight of a certain positive center point is determined by its Euclidean distance to the ground truth centroid and the size of the object. We use an unnormalized 2-D Gaussian to generate the penalty weight. Formally, given an object and a point at pixel location within positive region, we can generate its weight by:


We use in this paper. In sum, we will have the following loss defined for the classification end:


where , is the ground truth. Finally we combined regression loss and classification loss to formulate our multitask loss .

Iv Experiments

We conduct our experiments mainly on GIANA [5], CVC-CLINIC [4] and ETIS-LARIB[31] dataset. We use GIANA as the training set, CVC-CLINIC and ETIS-LARIB as the testing set.

Iv-a Data Preparation and Augmentation

GIANA[5], a database from MICCAI2017 endoscopic sub-challenge. The dataset contains four tasks, including Polyp detection, Polyp segmentation, Small Bowel Lesion Detection and Small Bowel Lesion Localization. For the polyp detection task, it contains 18 short videos for training and 20 short videos for testing. For the polyp segmentation task contains, it provides 300 images for training and 612 for testing and additionally more than 150 high definition images. However, the 612 testing image for the polyp segmentation task is identical to CVC-CLINIC.

To build our training set, we combined the datasets designed for Polyp detection and Polyp segmentation tasks from GIANA. Image frames are extracted from the 18 short videos. Segmentation annotations are converted to bounding boxes. As for the data augmentation purpose, we apply random rotation, zoom crop/expand, horizontal/vertical flip and distortions by using Augmentor[6]. In the end, we obtain a training set of 25K images.


contains 612 still frames from 29 endoscopic videos. Each image comes with manually labeled pixel-level ground truth by Computer Vision Center (CVC), Barcelona, Span. All these 612 images are used as our testing set.

ETIS-LARIB[31] contains 192 high resolution images with the resolution of 1225 996. Annotations are also provided at the pixel level. We convert these pixel-level annotations to bounding boxes for our detection task. We also evaluate our model on this dataset.

Experiment FPN CEM OHEM Focal Loss Cosine Projection Gaussian Penalty Precision Recall F1-score F2-score
1 97.27 93.65 95.43 94.35
2 97.31 95.51 96.33 95.84
3 95.20 92.11 93.63 92.71
4 98.56 95.36 96.93 95.98
5 98.42 96.28 97.34 96.70
6 99.36 96.44 97.88 97.01
TABLE I: Ablation study of each technical components. All the models use VGG16 as the backbone.
Backbone Precision Recall F1-score F2-score
ResNet-50 99.04 95.51 97.24 96.20
ResNet-101 98.40 95.36 96.86 95.95
VGG16 99.36 96.44 97.88 97.01
TABLE II: Effectiveness of different backbones. For all these experiments, all technical components are added: FPN, CEM, Cosine projection and Gaussian penalty.

Iv-B Training

Our backbone network is initialized by the pre-trained VGG16 on ImageNet. All additional convolution layers added on top of the backbone network are initialized by the Xavier method. We train our model on NVIDIA V100 with a batch size of 64. We use SGD with 0.9 momentum, 0.0005 weight decay and initial learning rate of 0.001. During the training process, we apply cosine annealing


for learning rate decay with an interval of 20 epochs. Following SSD, we also apply the online data augmentation photometric distortions and random sampling during training.

Iv-C Evaluation metrics

We follow the protocol of MICCAI2015 [5]

challenge by using the precision, recall, F1, F2 score as the major evaluation metrics. We define the following terms to calculate the performance:

True Positive (TP): When the centroid of the predicted bounding box falls in the polyp ground truth, it will count as a True Positive detection.

False Positive (FP): When the centroid of the predicted bounding box falls outside of polyp ground truth, it will be viewed as False Positive.

False Negative (FN): When a polyp ground truth is not detected by our detector, it will count as a False Negative.

Note that, if multiple predicted bounding boxes fall within one polyp ground truth, only one TP will be counted. In sum, precision (), recall (), F1 and F2 scores are formulated as:

Fig. 5: Precision Recall Curve for different networks.

Iv-D Ablation Study

We also conduct a full ablation study to isolate the effects of each technical component. All these models use VGG16 as the backbone and are evaluated with the CVC-Clinic dataset and results are summarized in Table I.

FPN improve Recall significantly. In Experiment 1 (Table I), the downstream pathway is removed as well as the CEM. Therefore the network is similar to SSD. By comparing Experiment 1&2, we find the recall rate is improved significantly. We attribute this to the enlarged effective receptive field from downstream features that carry more context information. This observation is further supported by the experiment on the Context Enhancement Module.

Context Enhancement Module plays a critical role in improve the model. By comparing Experiment 2&6 in Table I, the overall performance drops significantly, 2.1% for recall and 0.9% for precision, when CEM is removed. This is consistent with the effects of the Corner Pooling in CornerNet [18] and in turn, confirms the importance of enriching context in anchor-free detectors.

Focal Loss works better than OHEM. As we can observe from Table I

that (Experiment 3&6), Focal Loss brings in 4.1%, 4.3% increase in precision and recall over OHEM. Moreover, in practice, training would be much faster and more stable with Focal Loss than with OHEM.

Method Testing Dataset TP FP FN Precision Recall F1-score F2-score Inference time
Y-Net[27] ASU-MAYO* 3582 513 662 87.4 84.4 85.9 85.0 N/A
RYCO[43] ASU-MAYO* 3087 398 1226 88.6 71.6 79.2 74.4 N/A
ASU [38] CVC-Clinic-test N/A N/A N/A 97 85.2 90.8 87.4 N/A
CVC-CLINIC[4] CVC-Clinic-test N/A N/A N/A 83.5 83.1 83.3 83.2 N/A
OUS[5] ETIS-LARIB 131 57 77 69.7 63.0 66.1 64.2 N/A
RCNN-Mask[35] ETIS-LARIB 167 62 41 72.93 80.29 76.43 78.70 N/A
FRCNNPL[32] ETIS-LARIB 167 26 41 86.5 80.3 83.3 81.5 5.0 FPS†
AFP-Net (ours) ETIS-LARIB 168 21 40 88.89 80.77 84.63 82.27 52.6 FPS
Mask-RCNN [2] CVC-Clinic-train N/A N/A N/A 83.49 92.95 87.96 90.89 N/A
Faster-RCNN [26] CVC-Clinic-train 523 81 8 86.6 98.5 92.2 95.9 N/A
CenterNet-104 CVC-Clinic-train 603 30 43 95.27 93.35 94.30 93.73 7.8 FPS
SSD-baseline** CVC-Clinic-train 618 8 28 98.72 95.67 97.17 96.26 37.0 FPS
AFP-Net (ours) CVC-Clinic-train 623 4 23 99.36 96.44 97.88 97.01 52.6 FPS
TABLE III: Results of proposed method compare to others. * dataset no longer available. ** baseline SSD (FPN + CEM + SSD Anchors) with input size of 320. runs with a NVIDIA GTX TITAN X.
Fig. 6: Examples of False Negative (a & b) and False Positive (c & d). The first row shows the raw input images while the second row presents the corresponding detection results. The green rectangles are our predicted bounding boxes. The red rectangles are the ground truth boxes.

Cosine projection results in performance boost. In Experiment 4, we remove the cosine projection, then a ground truth would only be assigned with one feature scale. Comparing Experiment 4&6 in Table I, we can observe that the recall rate is increased by a considerable margin.

Gaussian penalty improves precision. In order to investigate the effectiveness of the Gaussian penalty, we train one network without it (Experiment 5). As we can see by comparing Experiment 5&6, the Gaussian penalty results in 0.94%, 0.18% increase in precision and recall respectively.

We also explored the effect of different backbones in our model (Table II), finding that VGG seems to work better than ResNets. In the meantime, to give a better view of the performance, we show the ROC curve with respect to precision and recall in Fig. 5.

Iv-E Overall Performance

We compared our proposed method with other models reported in the MICCAI2015 challenge on polyp detection [5] and recent work [26, 27, 32]. We run our test on CVC-Clinic training dataset and ETIS-LARIB dataset, with an input size of 320. We set the IoU threshold to 0.1 for the non-maximum suppression. All results are summarized in Table III.

We refer to our model as AFP-Net (Experiment 6 in Table I). Our proposed method outperforms all previous approaches in terms of F1 and F2 scores on both testing datasets. In the meantime, we also compared our approach with another anchor-free design (CenterNet), showing the superior performance of our design. Note that our anchor-free detector even outperforms its anchor-based baseline model SSD-baseline (FPN + CEM + SSD anchors) where only the anchor part is different.

We test all models on a NVIDIA RTX-2080TI with an input size of 320 to investigate the inference speed (FPS, frame per second) of our models. Among all anchor-based models, the “SSD-baseline” (differ with AFP-Net only in the anchor design) represents the fairest and direct reference. As shown in Table III, our model gives a speed boost around 30%. This speed boost is consistent when comparing with other anchor-based designs such as FRCNNPL [32]. However, we must note that in FRCNNPL, the inference time is evaluated with a NVIDIA GTX TITAN X which is slightly slower than our GPU. Nevertheless, we are still confident to claim that our model runs faster, given the large gap in inference time (52.6 FPS vs. 5 FPS). On the other hand, to compare with other anchor-free designs, we retrained a CenterNet. As shown in Table III, the speed advantage of our model still holds.

Iv-F Visualization

We visualize some of the hard cases in Fig. 6 where our model can make a mistake. Our predicted boxes are marked as green rectangles while the ground truth for each image is marked as red rectangles. As we can see the missed polyps in Figure 5(a) and 5(b) are very challenging cases because they share similar texture and shapes with normal mucosae. This is partially due to the view angle during image capturing. When the light hits on top of a polyp directly, the polyp may blend into the background and becomes difficult to detect. Figure 5(c) and 5(d) show some samples of false positive detection. Normal mucosae with some fold structures are very hard to be distinguished from polyps, especially the early stage of polyps, as they can share the same texture as shown in Figure 5(d).

V Conclusion and Future work

In this paper, we proposed a novel anchor-free polyp detector. It is faster than the anchor-based design while achieving state-of-the-art performance. In addition to the improved performance, we also remove the hassle of manually fine-tuning anchor related hyper-parameters.

Enriching context information is critical for anchor-free detectors. Feature Pyramid, Context Enhancement Module all contribute in this way. At the Loss end, Focal Loss, distance based Gaussian Penalty and our proposed cosine ground truth all play important roles in improving the performance. With all these technical components, the potential recall drop caused by removing anchors are well compensated and we achieved the state-of-the-art performance.

We believe our cosine ground truth projection and Gaussian penalty will provide a vital building block for future anchor-free design. In our future work, we intend to further improve the classification strategies for small flat polyps that are very hard to be distinguished from normal mucosae.


  • [1] M. F. Alcantara, Y. Cao, C. Liu, B. Liu, M. Brunette, N. Zhang, T. Sun, P. Zhang, Q. Chen, Y. Li, C. M. Albarracin, J. Peinado, E. S. Garavito, L. L. Garcia, and W. H. Curioso (2017) Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor communities in perãº. Smart Health 1-2 (Supplement C), pp. 66 – 76. Note: Connected Health: Applications, Systems and Engineering Technologies (CHASE 2016) External Links: ISSN 2352-6483, Document, Link Cited by: §I.
  • [2] H. Ali Qadir, Y. Shin, J. Solhusvik, J. Bergsland, L. Aabakken, and I. Balasingham (2019-05) Polyp detection and segmentation using mask r-cnn: does a deeper feature extractor cnn always perform better?. pp. 1–6. External Links: Document Cited by: TABLE III.
  • [3] S. Ameling, S. Wirth, D. Paulus, G. Lacey, and F. Vilariño (2009-01) Texture-based polyp detection in colonoscopy. pp. 346–350. External Links: Document Cited by: §I.
  • [4] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. 43, pp. 99 – 111. External Links: ISSN 0895-6111, Document, Link Cited by: §I, §IV-A, TABLE III, §IV.
  • [5] J. Bernal, N. Tajbakhsh, F. Javier Sanchez, B. J. Matuszewski, H. Chen, L. Yu, Q. Angermann, O. Romain, B. Rustad, I. Balasingham, K. Pogorelov, S. Choi, Q. Debard, L. Maier-Hein, S. Speidel, D. Stoyanov, P. Brandao, H. Córdova, C. Sánchez-Montes, and A. Histace (2017-02) Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge. PP, pp. 1–1. External Links: Document Cited by: §IV-A, §IV-C, §IV-E, TABLE III, §IV.
  • [6] M. D. Bloice, P. M. Roth, and A. Holzinger (2019-04) Biomedical image augmentation using Augmentor. External Links: ISSN 1367-4803, Document, Link, Cited by: §IV-A.
  • [7] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal (2018) Global cancer statistics 2018: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians 68 (6), pp. 394–424. External Links: Document, Link, Cited by: §I.
  • [8] Y. Cao, C. Liu, B. Liu, M. J. Brunette, N. Zhang, T. Sun, P. Zhang, J. Peinado, E. S. Garavito, L. L. Garcia, and W. H. Curioso (2016-06) Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor and marginalized communities. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. , pp. 274–281. External Links: Document, ISSN Cited by: §I.
  • [9] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: keypoint triplets for object detection. abs/1904.08189. External Links: Link, 1904.08189 Cited by: §I, §I, §II-B.
  • [10] C. Eggert, S. Brehm, A. Winschel, D. Zecha, and R. Lienhart (2017) A closer look: small object detection in faster r-cnn. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pp. 421–426. Cited by: §I.
  • [11] Y. Gao, N. Zhang, H. Wang, X. Ding, X. Ye, G. Chen, and Y. Cao (2016-06) IHear food: eating detection using commodity bluetooth headsets. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. , pp. 163–172. External Links: Document, ISSN Cited by: §I.
  • [12] R. B. Girshick (2015) Fast R-CNN. abs/1504.08083. External Links: Link, 1504.08083 Cited by: §III-E.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §I.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I.
  • [15] B. Y. Hsueh, W. Li, and I. Wu (2018) Stochastic gradient descent with hyperbolic-tangent decay. abs/1806.01593. External Links: Link, 1806.01593 Cited by: §IV-B.
  • [16] S. Hwang, J. Oh, W. Tavanapong, J. Wong, and P. C. de Groen (2007-Sep.) Polyp detection in colonoscopy video using elliptical shape feature. In 2007 IEEE International Conference on Image Processing, Vol. 2, pp. II – 465–II – 468. External Links: Document, ISSN 1522-4880 Cited by: §I.
  • [17] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) FoveaBox: beyond anchor-based object detector. Cited by: §II-B.
  • [18] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. CoRR abs/1808.01244. External Links: Link, 1808.01244 Cited by: §I, §II-B, §II-B, §III-D, §IV-D.
  • [19] A. Leufkens, M. V. Oijen, F. Vleggaar, and P. Siersema (2012) Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 (05), pp. 470?475. External Links: Document Cited by: §I.
  • [20] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and F. Huang (2018) DSFD: dual shot face detector. abs/1810.10220. External Links: Link, 1810.10220 Cited by: §III-B.
  • [21] P. Li, Y. Luo, N. Zhang, and Y. Cao (2015-08)

    HeteroSpark: a heterogeneous cpu/gpu spark platform for machine learning algorithms

    In 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), Vol. , pp. 347–348. External Links: Document, ISSN Cited by: §I.
  • [22] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2016) Feature pyramid networks for object detection. abs/1612.03144. External Links: Link, 1612.03144 Cited by: §III-A, §III-D.
  • [23] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Vol. 1, pp. 4. Cited by: §I.
  • [24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §III-F.
  • [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2015) SSD: single shot multibox detector. CoRR abs/1512.02325. External Links: Link, 1512.02325 Cited by: §I, §I, §III-D.
  • [26] X. Mo, K. Tao, Q. Wang, and G. Wang (2018-09) An Efficient Approach for Polyps Detection in Endoscopic Videos Based on Faster R-CNN. pp. arXiv:1809.01263. External Links: 1809.01263 Cited by: §II-A, §IV-E, TABLE III.
  • [27] A. K. Mohammed, S. Yildirim, I. Farup, M. Pedersen, and Ø. Hovde (2018) Y-net: A deep convolutional neural network for polyp detection. CoRR abs/1806.01907. External Links: Link, 1806.01907 Cited by: §II-A, §II-A, §IV-E, TABLE III.
  • [28] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis (2017) SSH: single stage headless face detector. abs/1708.03979. External Links: Link, 1708.03979 Cited by: §III-B.
  • [29] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Link, 1506.02640 Cited by: §I, §III-D.
  • [30] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Link, 1506.01497 Cited by: §I, §I, §III-D, §III-E.
  • [31] J. S. Silva, A. Histace, O. Romain, X. Dray, and B. Granado (2013-09) Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. 9, pp. . External Links: Document Cited by: §IV-A, §IV.
  • [32] Y. Shin, H. Ali Qadir, L. Aabakken, J. Bergsland, and I. Balasingham (2018-07) Automatic colon polyp detection using region based deep cnn and post learning approaches. PP, pp. 1–1. External Links: Document Cited by: §II-A, §IV-E, §IV-E, TABLE III.
  • [33] R. L. Siegel, K. D. Miller, S. A. Fedewa, D. J. Ahnen, R. G. S. Meester, A. Barzi, and A. Jemal (2017) Colorectal cancer statistics, 2017. CA: A Cancer Journal for Clinicians 67 (3), pp. 177–193. External Links: Document, Link, Cited by: §I.
  • [34] K. Simonyan and A. Zisserman (2014-09) Very deep convolutional networks for large-scale image recognition. pp. . Cited by: §I, §III-A.
  • [35] S. Sornapudi, F. Meng, and S. Yi (2019-06) Region-based automated localization of colonoscopy and wireless capsule endoscopy polyps. 9, pp. . External Links: Document Cited by: TABLE III.
  • [36] X. Sun, N. Zhang, Q. Chen, Y. Cao, and B. Liu (2019) People re-identification by multi-branch cnn with multi-scale features. In 2019 26th IEEE International Conference on Image Processing (ICIP), Cited by: §I.
  • [37] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015-06) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Document, ISSN 1063-6919 Cited by: §III-B.
  • [38] N. Tajbakhsh, S. Gurudu, and J. Liang (2015-10) Automated polyp detection in colonoscopy videos using shape and context information. 35, pp. . External Links: Document Cited by: TABLE III.
  • [39] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. Cited by: §II-B.
  • [40] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. Cited by: §I, §II-B.
  • [41] N. Zhang, Y. Cao, B. Liu, and Y. Luo (2017) Improved multimodal representation learning with skip connections. In Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, New York, NY, USA, pp. 654–662. External Links: ISBN 978-1-4503-4906-2, Link, Document Cited by: §I.
  • [42] N. Zhang, D. Wang, X. Sun, P. Zhang, C. Zhang, Y. Cao, and B. Liu (2019) 3D anchor-free lesion detector on computed tomography scans. External Links: 1908.11324 Cited by: §I.
  • [43] R. Zhang, Y. Zheng, C. C.Y. Poon, D. Shen, and J. Y.W. Lau (2018) Polyp detection during colonoscopy using a regression-based convolutional neural network with a tracker.

    Pattern RecognitionarXiv e-printsIEEE AccessarXiv 1409.1556CoRRCoRRCoRRCoRRCoRRCoRRComputerized Medical Imaging and GraphicsIEEE Transactions on Medical ImagingarXiv preprint arXiv:1901.03278arXiv preprint arXiv:1904.01355arXiv preprint arXiv:1904.03797CoRRIEEE transactions on medical imagingApplied SciencesInternational journal of computer assisted radiology and surgeryCoRRBioinformaticsInternational Journal of Pattern Recognition and Artificial Intelligence

    83, pp. 209 – 219.
    External Links: ISSN 0031-3203, Document, Link Cited by: §II-A, TABLE III.
  • [44] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. abs/1904.07850. External Links: Link, 1904.07850 Cited by: §I, §I, §II-B.