Colorectal Cancer (CRC) is the third most common cancer worldwide and the second most lethal cancer in USA
, with an estimate of 1.8 million new diagnosed colorectal cancer cases and 881,000 deaths in 2018. Colorectal cancer often begins with the growth of tissues known as polyps. Most of these polyps are initially benign but some of them will become malignant over time. Early diagnosis via colonoscopy is widely believed to be the best way for the prevention of colorectal cancer. During a colonoscopy, specialized physicians will carefully inspect the intestinal wall. However, the operations can have a false negative rate of 25% . Multiple independent factors contribute to the miss rate. Some of the factors are related to the polyp appearance, eg. a small and flat polyp is less likely to be perceived, while some other causes can be operational: the camera is moving too fast, leading physicians to fail to catch the suspicious regions which require more efforts. For these reasons, real time automated polyp detection can act as a complementary tool to assist physicians in improving the sensitivity of the diagnosis.
In the past, automated polyp detection mainly utilized hand-crafted color, shape and texture features to distinguish polyps from normal mucosae [3, 16]. Bernal et. al  used the appearance of the polyps as a key factor. They assumed all polyps have protruding surface and used valley detection as the feature extractor. However, polyp and normal mucosae can share similar features in edge and texture when polyps have flat surfaces as normal mucosae. In addition, some mucosae with vally-reach structures can appear similar to polyps. All these facts result in poor performance for these approaches in real applications.
Recently, Convolutional Neural Network (CNN) based object detectors such as SSD and Faster R-CNN 
dominate the object detection task across different image modalities. Compared with traditional approaches where features are fully designed based on human knowledge, these CNN based detectors can learn rich features automatically in the deep backbone networks. This is the key contributing factor for deep learning to thrive[41, 11, 1, 8, 21, 36].
To make it more clear, we divide CNN based detectors into the backbone part (eg. ResNet , VGG ) and the head part which is more specific to individual detectors. For the detection head part, the “anchor” concept is shared by both single-stage [25, 29, 23] and two-stage [13, 30] detectors. Anchors represent different box templates (at different scales). This strategy works by enumerating box templates of different scales and aspect ratios with respect to the dataset in process. In other words, anchors should adapt to the characteristics of the dataset. However, designing anchor sets and assigning objects to specific anchors require extensive experience. It is also indicated in  that the choice of anchors is especially important for small objects. Moreover, when we assign objects to certain anchors, IoU is usually the major criterion. Different IoU thresholds can result in significant performance variations.
Motivated by these observations, the “Anchor-Free” approaches [18, 9, 40, 44, 42] have received much attention recently, where the anchor mechanism is removed and objects are represented as keypoints. For instance, CornerNet  represents an object as a pair of keypoints (top-left and bottom-right corners). To better detect these corners, a special corner pooling is devised to enrich the context information. Different from CornerNet, Zhou et. al  represent an object as a single keypoint located in the center. In this way, the time-consuming pairing is removed and the model achieves the best trade-off between performance and inference speed.
In the polyp detection, we observe that the objects (polyps) under concern actually do not overlap much with each other and the shapes do not vary much either. Therefore, we believe the idea of “objects as keypoints” fits well with our application scenario. Moreover, removing the anchor mechanism can reduce the parameters in detection heads thus result in fewer highly overlapped proposals during inference, which can potentially accelerate the inference speed.
In this paper, we propose a novel anchor-free detector for fast polyp detection. Similar to , we formulate objects as center points, yet removing the time-consuming center pooling in CenterNet  to achieve real time response. The role of the center pooling is replaced by a context enhancement module as well as a feature pyramid design. In addition, we devise a special cosine ground-truth projection strategy to compensate for the potential drop in recall caused by the removal of the anchor mechanism. Our proposed polyp detector outperforms previous studies and achieves the state of the art performance in terms of both accuracy and inference speed.
Ii Related Work
Ii-a Polyp Detection
With the success of deep learning in natural image processing, CNN based polyp detectors were proposed in the last few years. Compared to hand-crafted features, CNN based detectors automate the process of extracting abstract and discriminative features. They are more robust and require less domain knowledge, making them particularly suitable for this task. As our approach is also CNN based, we only discuss with details CNN based detectors in this paper.
Zhang et. al  proposed a two-step pipeline for polyp detection in endoscopic videos. In the first step, they use a pre-trained ResYOLO to detect suspicious polyps. The polyps are assumed to be stable without a sudden move from one location to another between two consecutive frames. Therefore, in the second step, a Discriminative Correlation Filter based tracking approach was proposed to leverage the temporal information. This tracking based method refines the detection results given by ResYOLO and is capable of locating polyps missed by the detector in consecutive frames.
Mohammed et. al  proposed Y-Net for this task. It consists of two fully convolutional encoders followed by a fully convolutional decoder. The motivation is that a model pre-trained on natural images may not generalize well on medical images. To mitigate performance degradation due to the domain-shift (natural images to medical images), they slowly fine-tune the first encoder from the pre-trained network while aggressively train from scratch for the second encoder. Two encoders are of the same network architecture and the outputs are combined with a sum-skip-concatenation connection before coming to the decoder network.
However, seemly contradicting with , Mo et. al  proved that fine-tuning a pre-trained Faster R-CNN can work considerably well. In addition, Shin et. al  proposed a post learning scheme to enhance the Faster R-CNN detector. This post learning scheme automatically collects hard negative samples and retrains the network with selected polyp-like false positives, which functions similarly to boosting.
Our proposed method is the first one to apply the anchor-free approach to automated polyp detection. We believe that anchor-free design is a very practical solution in our case in terms of speed and accuracy. Unlike natural images where pre-defined anchors are introduced to attack the occlusion issue, in medical images such as Computed Tomography and Colonoscopy Images, occlusion is rare. Another concern for this polyp detection task is the real time requirement. With anchors, a large number of overlapped proposals would be proposed, putting significant pressure on the post-processing step (Non-Maximum Suppression). Therefore, the anchor free mechanism would fit better in this polyp detection task.
Ii-B Anchor Free Detectors
While almost all the state of the art object detectors employ pre-defined anchors, anchor-free object detectors [18, 9, 40, 39, 17] have received much attention in recent years because of their better adaptability towards different datasets. Representative approaches include CornerNet  and CenterNet .
In CornerNet 
, objects are represented as pairs of keypoints: top-left corners and bottom-right corners. The network is trained to predict a heatmap for all top-left corners and for all bottom-right corners respectively in parallel. To associate them in pairs, for each corner an embedding vector is also learned: pushing away corners belonging to different objects while pulling together corners in the same group. To further strengthen the “corner learning” process, the authors also devise a special Corner Pooling. This pooling consists of two separate one-directional pooling (horizontal and vertical) before combining them together at the end. In this way, more context information is gathered to identify corners. However, because of the pairing process, the inference speed drops significantly.
Zhou et. al  use the center keypoint to represent an object. In this way, the burden of learning to group corners is removed and the model achieves the best performance-speed trade-off. Our work follows this idea. But we add more constraints when assigning the label to keypoints in the context of the feature pyramid. These constraints are later proved to be critical in our experiments. Moreover, as indicated in CornerNet, enriching context information plays a key role in the success of anchor-free detectors. We also explore in this direction by adding a Context Enhancement Module.
In this section, we will describe our proposed anchor-free approach in detail including the network design and the associated technical components.
Iii-a Network Architecture
illustrates our framework design. It is a fully convolutional network that classifies and localizes objects on each enhanced feature map. Our network uses VGG16 as the backbone Our framework selects feature maps from the backbone: conv4_3, conv5_3, conv6_2, conv7_2, conv8_2 and conv9_2, responsible for detecting objects at different scales. In order to increase semantics information for each feature map, we build a feature pyramid similarly to FPN 
. Note that we use Deconvolution (Transposed Convolution) for up-sampling instead of interpolation. Upstream and downstream features of the same scale are combined by element-wised addition. For each feature mapwhere , we first use 1x1 convolution to smooth the upper feature map , which is then up-sampled to the size of . Finally, we add up-sampled to . The enhanced feature map can be mathematically represented as follows:
To further increase context information for small objects, we feed each feature map to a context module before forwarding them to the anchor-free detection heads. These detection heads are single-staged and have similar structures to the heads in SSD where two parallel subnets are dedicated for classification and localization respectively.
Iii-B Context Enhancement Module (CEM)
In order to increase context information for small objects, we apply a similar context enhancement module (illustrated in Fig. 2) in [28, 20]. Since our anchor-free detection heads are of a fully convolutional manner, increasing context information is equivalent to enlarge the receptive field of our detection heads. However, instead of using 5x5 or 7x7 filters to enlarge the receptive field, we adopt the dilated convolution following [20, 37]. One good property about dilated convolutions is that it brings in fewer parameters to achieve the same size of the receptive field. Note that the input channels are equally split into three branches with each branch being of different depths. Outputs of all branches are concatenated at the end. By employing this context enhancement module, we find in experiments that it can considerably increase the precision performance.
Iii-C Anchor-free Label Assignment
We adopt a very different label assigning procedure in our anchor-free design compared with the anchor-based framework (Fig. 3). In anchor-based detectors (Fig. 2(a)), the green dashed rectangle is a positive anchor with and red dashed rectangle represents a negative anchor with . The white dashed anchor would be ignored as its is in between. Unlike anchor-based approaches relying on manually set up overlap thresholds, in our anchor-free design, center points are labeled solely based on the location and the size of ground truth boxes. The whole process is demonstrated in Fig. 2(b). The blue dotted grid represents center points. Any center points falling outside of the red rectangle, inside of the green rectangle and in the gap between the two will be assigned as negative, positive and ignored respectively.
More formally, consider an object , where is the center position of the bounding box; are the width and height respectively. Suppose this object was assigned to an arbitrary feature map with size
and strideduring training, we have (we assume the image is a square for simplicity) center points from set where and . We define an positive region and non-negative region . We mark center points as positive if , negative if and ignored if . The size of the positive region and non-negative region is controlled by scale factors and in proportion to the ground truth, i.e. . We use and in this paper.
Iii-D Cosine Ground-truth Projection
It has been shown that distributing objects of different sizes to different scales can greatly improve the detector performance. In this way, objects will be detected at the best granularity with respect to their size. On the other hand, numerically, bounding box regression would also benefit from this technique as the loss could be better constrained. Almost all current state-of-the-art object detectors [25, 30, 29, 18, 22] employ this strategy. We also adopt this multi-scale strategy in our design.
However, simply assigning ground truth to a single scale may not fully utilize the capacity of a network, resulting in slower convergence and lower recall. Yet, if we use the same positive and non-negative region across all feature maps, it will cause a scale mismatch problem: low level feature maps are learning to regress large objects and small objects at the same time, contradicting with our initial idea of dedicating feature maps with the right scale to different objects.
To solve this issue, we define a cosine Ground-truth Projection to maximize recall and speed up the training process. Figure 4 illustrates the projection of positive and non-negative region across different feature maps when is the best feature scale for the ground truth. The projection sizes of the positive and non-negative region in neighboring feature maps are penalized with a cosine function, based on how far away they are from the best feature map . Suppose we have feature maps, let be the distance between current feature map and the best feature map . We can calculate the penalizing factor on feature map as follows:
where decides to what extent the projection is penalized. We use in this paper. Our new positive region and non-negative region can now be defined as:
By using this cosine function we penalize the positive region and non-negative region less on neighboring feature maps than on distanced feature map as shown in Figure 4. As a result, we are able to receive more responses from neighboring feature maps which further improves the recall performance of our model.
Iii-E Box Regression
Unlike anchor-based methods which predict the offset by using reference anchor boxes, the output of our bounding box regression is the offset of the center point and size encoded by the stride of the feature map. In particular, given an object with category , the center position of the ground truth can be calculated as , with being the width and height of the box. The offset vector for center point at pixel location with stride can be defined as:
To remove overlapped bounding boxes, we apply IoU-based Non Maximum Suppression (NMS) with a threshold of 0.1.
Iii-F Classification Loss
Different from the situation we have in natural image modalities where objects are often dense (in terms of averaged number of objects in one image) and of regular sizes, in colonoscopy, polyps are often very sparse and small. This fact determines that we are facing an extremely small positive/negative sample ratio. Thus, we feel that Focal Loss fits better in our case compared with the Online Hard Negative Mining (OHEM) mechanism. However, we only apply Focal Loss  to negative samples while adopting cross entropy with a penalty term for positive samples. We assume center points closer to the ground truth centroid have a more precise view of an object and thus contribute more to the loss.
In particular, the penalty weight of a certain positive center point is determined by its Euclidean distance to the ground truth centroid and the size of the object. We use an unnormalized 2-D Gaussian to generate the penalty weight. Formally, given an object and a point at pixel location within positive region, we can generate its weight by:
We use in this paper. In sum, we will have the following loss defined for the classification end:
where , is the ground truth. Finally we combined regression loss and classification loss to formulate our multitask loss .
Iv-a Data Preparation and Augmentation
GIANA, a database from MICCAI2017 endoscopic sub-challenge. The dataset contains four tasks, including Polyp detection, Polyp segmentation, Small Bowel Lesion Detection and Small Bowel Lesion Localization. For the polyp detection task, it contains 18 short videos for training and 20 short videos for testing. For the polyp segmentation task contains, it provides 300 images for training and 612 for testing and additionally more than 150 high definition images. However, the 612 testing image for the polyp segmentation task is identical to CVC-CLINIC.
To build our training set, we combined the datasets designed for Polyp detection and Polyp segmentation tasks from GIANA. Image frames are extracted from the 18 short videos. Segmentation annotations are converted to bounding boxes. As for the data augmentation purpose, we apply random rotation, zoom crop/expand, horizontal/vertical flip and distortions by using Augmentor. In the end, we obtain a training set of 25K images.
contains 612 still frames from 29 endoscopic videos. Each image comes with manually labeled pixel-level ground truth by Computer Vision Center (CVC), Barcelona, Span. All these 612 images are used as our testing set.
ETIS-LARIB contains 192 high resolution images with the resolution of 1225 996. Annotations are also provided at the pixel level. We convert these pixel-level annotations to bounding boxes for our detection task. We also evaluate our model on this dataset.
|Experiment||FPN||CEM||OHEM||Focal Loss||Cosine Projection||Gaussian Penalty||Precision||Recall||F1-score||F2-score|
Our backbone network is initialized by the pre-trained VGG16 on ImageNet. All additional convolution layers added on top of the backbone network are initialized by the Xavier method. We train our model on NVIDIA V100 with a batch size of 64. We use SGD with 0.9 momentum, 0.0005 weight decay and initial learning rate of 0.001. During the training process, we apply cosine annealing
for learning rate decay with an interval of 20 epochs. Following SSD, we also apply the online data augmentation photometric distortions and random sampling during training.
Iv-C Evaluation metrics
We follow the protocol of MICCAI2015 
challenge by using the precision, recall, F1, F2 score as the major evaluation metrics. We define the following terms to calculate the performance:
True Positive (TP): When the centroid of the predicted bounding box falls in the polyp ground truth, it will count as a True Positive detection.
False Positive (FP): When the centroid of the predicted bounding box falls outside of polyp ground truth, it will be viewed as False Positive.
False Negative (FN): When a polyp ground truth is not detected by our detector, it will count as a False Negative.
Note that, if multiple predicted bounding boxes fall within one polyp ground truth, only one TP will be counted. In sum, precision (), recall (), F1 and F2 scores are formulated as:
Iv-D Ablation Study
We also conduct a full ablation study to isolate the effects of each technical component. All these models use VGG16 as the backbone and are evaluated with the CVC-Clinic dataset and results are summarized in Table I.
FPN improve Recall significantly. In Experiment 1 (Table I), the downstream pathway is removed as well as the CEM. Therefore the network is similar to SSD. By comparing Experiment 1&2, we find the recall rate is improved significantly. We attribute this to the enlarged effective receptive field from downstream features that carry more context information. This observation is further supported by the experiment on the Context Enhancement Module.
Context Enhancement Module plays a critical role in improve the model. By comparing Experiment 2&6 in Table I, the overall performance drops significantly, 2.1% for recall and 0.9% for precision, when CEM is removed. This is consistent with the effects of the Corner Pooling in CornerNet  and in turn, confirms the importance of enriching context in anchor-free detectors.
Focal Loss works better than OHEM. As we can observe from Table I
that (Experiment 3&6), Focal Loss brings in 4.1%, 4.3% increase in precision and recall over OHEM. Moreover, in practice, training would be much faster and more stable with Focal Loss than with OHEM.
|Method||Testing Dataset||TP||FP||FN||Precision||Recall||F1-score||F2-score||Inference time|
|AFP-Net (ours)||ETIS-LARIB||168||21||40||88.89||80.77||84.63||82.27||52.6 FPS|
|AFP-Net (ours)||CVC-Clinic-train||623||4||23||99.36||96.44||97.88||97.01||52.6 FPS|
Cosine projection results in performance boost. In Experiment 4, we remove the cosine projection, then a ground truth would only be assigned with one feature scale. Comparing Experiment 4&6 in Table I, we can observe that the recall rate is increased by a considerable margin.
Gaussian penalty improves precision. In order to investigate the effectiveness of the Gaussian penalty, we train one network without it (Experiment 5). As we can see by comparing Experiment 5&6, the Gaussian penalty results in 0.94%, 0.18% increase in precision and recall respectively.
Iv-E Overall Performance
We compared our proposed method with other models reported in the MICCAI2015 challenge on polyp detection  and recent work [26, 27, 32]. We run our test on CVC-Clinic training dataset and ETIS-LARIB dataset, with an input size of 320. We set the IoU threshold to 0.1 for the non-maximum suppression. All results are summarized in Table III.
We refer to our model as AFP-Net (Experiment 6 in Table I). Our proposed method outperforms all previous approaches in terms of F1 and F2 scores on both testing datasets. In the meantime, we also compared our approach with another anchor-free design (CenterNet), showing the superior performance of our design. Note that our anchor-free detector even outperforms its anchor-based baseline model SSD-baseline (FPN + CEM + SSD anchors) where only the anchor part is different.
We test all models on a NVIDIA RTX-2080TI with an input size of 320 to investigate the inference speed (FPS, frame per second) of our models. Among all anchor-based models, the “SSD-baseline” (differ with AFP-Net only in the anchor design) represents the fairest and direct reference. As shown in Table III, our model gives a speed boost around 30%. This speed boost is consistent when comparing with other anchor-based designs such as FRCNNPL . However, we must note that in FRCNNPL, the inference time is evaluated with a NVIDIA GTX TITAN X which is slightly slower than our GPU. Nevertheless, we are still confident to claim that our model runs faster, given the large gap in inference time (52.6 FPS vs. 5 FPS). On the other hand, to compare with other anchor-free designs, we retrained a CenterNet. As shown in Table III, the speed advantage of our model still holds.
We visualize some of the hard cases in Fig. 6 where our model can make a mistake. Our predicted boxes are marked as green rectangles while the ground truth for each image is marked as red rectangles. As we can see the missed polyps in Figure 5(a) and 5(b) are very challenging cases because they share similar texture and shapes with normal mucosae. This is partially due to the view angle during image capturing. When the light hits on top of a polyp directly, the polyp may blend into the background and becomes difficult to detect. Figure 5(c) and 5(d) show some samples of false positive detection. Normal mucosae with some fold structures are very hard to be distinguished from polyps, especially the early stage of polyps, as they can share the same texture as shown in Figure 5(d).
V Conclusion and Future work
In this paper, we proposed a novel anchor-free polyp detector. It is faster than the anchor-based design while achieving state-of-the-art performance. In addition to the improved performance, we also remove the hassle of manually fine-tuning anchor related hyper-parameters.
Enriching context information is critical for anchor-free detectors. Feature Pyramid, Context Enhancement Module all contribute in this way. At the Loss end, Focal Loss, distance based Gaussian Penalty and our proposed cosine ground truth all play important roles in improving the performance. With all these technical components, the potential recall drop caused by removing anchors are well compensated and we achieved the state-of-the-art performance.
We believe our cosine ground truth projection and Gaussian penalty will provide a vital building block for future anchor-free design. In our future work, we intend to further improve the classification strategies for small flat polyps that are very hard to be distinguished from normal mucosae.
-  (2017) Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor communities in perãº. Smart Health 1-2 (Supplement C), pp. 66 – 76. Note: Connected Health: Applications, Systems and Engineering Technologies (CHASE 2016) External Links: Cited by: §I.
-  (2019-05) Polyp detection and segmentation using mask r-cnn: does a deeper feature extractor cnn always perform better?. pp. 1–6. External Links: Cited by: TABLE III.
-  (2009-01) Texture-based polyp detection in colonoscopy. pp. 346–350. External Links: Cited by: §I.
-  (2015) WM-dova maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. 43, pp. 99 – 111. External Links: Cited by: §I, §IV-A, TABLE III, §IV.
-  (2017-02) Comparative validation of polyp detection methods in video colonoscopy: results from the miccai 2015 endoscopic vision challenge. PP, pp. 1–1. External Links: Cited by: §IV-A, §IV-C, §IV-E, TABLE III, §IV.
-  (2019-04) Biomedical image augmentation using Augmentor. External Links: Cited by: §IV-A.
-  (2018) Global cancer statistics 2018: globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians 68 (6), pp. 394–424. External Links: Cited by: §I.
-  (2016-06) Improving tuberculosis diagnostics using deep learning and mobile health technologies among resource-poor and marginalized communities. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. , pp. 274–281. External Links: Cited by: §I.
-  (2019) CenterNet: keypoint triplets for object detection. abs/1904.08189. External Links: Cited by: §I, §I, §II-B.
-  (2017) A closer look: small object detection in faster r-cnn. In Multimedia and Expo (ICME), 2017 IEEE International Conference on, pp. 421–426. Cited by: §I.
-  (2016-06) IHear food: eating detection using commodity bluetooth headsets. In 2016 IEEE First International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), Vol. , pp. 163–172. External Links: Cited by: §I.
-  (2015) Fast R-CNN. abs/1504.08083. External Links: Cited by: §III-E.
-  (2017) Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2980–2988. Cited by: §I.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I.
-  (2018) Stochastic gradient descent with hyperbolic-tangent decay. abs/1806.01593. External Links: Cited by: §IV-B.
-  (2007-Sep.) Polyp detection in colonoscopy video using elliptical shape feature. In 2007 IEEE International Conference on Image Processing, Vol. 2, pp. II – 465–II – 468. External Links: Cited by: §I.
-  (2019) FoveaBox: beyond anchor-based object detector. Cited by: §II-B.
-  (2018) CornerNet: detecting objects as paired keypoints. CoRR abs/1808.01244. External Links: Cited by: §I, §II-B, §II-B, §III-D, §IV-D.
-  (2012) Factors influencing the miss rate of polyps in a back-to-back colonoscopy study. Endoscopy 44 (05), pp. 470?475. External Links: Cited by: §I.
-  (2018) DSFD: dual shot face detector. abs/1810.10220. External Links: Cited by: §III-B.
HeteroSpark: a heterogeneous cpu/gpu spark platform for machine learning algorithms. In 2015 IEEE International Conference on Networking, Architecture and Storage (NAS), Vol. , pp. 347–348. External Links: Cited by: §I.
-  (2016) Feature pyramid networks for object detection. abs/1612.03144. External Links: Cited by: §III-A, §III-D.
-  (2017) Feature pyramid networks for object detection. In CVPR, Vol. 1, pp. 4. Cited by: §I.
-  (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §III-F.
-  (2015) SSD: single shot multibox detector. CoRR abs/1512.02325. External Links: Cited by: §I, §I, §III-D.
-  (2018-09) An Efficient Approach for Polyps Detection in Endoscopic Videos Based on Faster R-CNN. pp. arXiv:1809.01263. External Links: Cited by: §II-A, §IV-E, TABLE III.
-  (2018) Y-net: A deep convolutional neural network for polyp detection. CoRR abs/1806.01907. External Links: Cited by: §II-A, §II-A, §IV-E, TABLE III.
-  (2017) SSH: single stage headless face detector. abs/1708.03979. External Links: Cited by: §III-B.
-  (2015) You only look once: unified, real-time object detection. CoRR abs/1506.02640. External Links: Cited by: §I, §III-D.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497. External Links: Cited by: §I, §I, §III-D, §III-E.
-  (2013-09) Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. 9, pp. . External Links: Cited by: §IV-A, §IV.
-  (2018-07) Automatic colon polyp detection using region based deep cnn and post learning approaches. PP, pp. 1–1. External Links: Cited by: §II-A, §IV-E, §IV-E, TABLE III.
-  (2017) Colorectal cancer statistics, 2017. CA: A Cancer Journal for Clinicians 67 (3), pp. 177–193. External Links: Cited by: §I.
-  (2014-09) Very deep convolutional networks for large-scale image recognition. pp. . Cited by: §I, §III-A.
-  (2019-06) Region-based automated localization of colonoscopy and wireless capsule endoscopy polyps. 9, pp. . External Links: Cited by: TABLE III.
-  (2019) People re-identification by multi-branch cnn with multi-scale features. In 2019 26th IEEE International Conference on Image Processing (ICIP), Cited by: §I.
-  (2015-06) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1–9. External Links: Cited by: §III-B.
-  (2015-10) Automated polyp detection in colonoscopy videos using shape and context information. 35, pp. . External Links: Cited by: TABLE III.
-  (2019) FCOS: fully convolutional one-stage object detection. Cited by: §II-B.
-  (2019) Region proposal by guided anchoring. Cited by: §I, §II-B.
-  (2017) Improved multimodal representation learning with skip connections. In Proceedings of the 2017 ACM on Multimedia Conference, MM ’17, New York, NY, USA, pp. 654–662. External Links: Cited by: §I.
-  (2019) 3D anchor-free lesion detector on computed tomography scans. External Links: Cited by: §I.
Polyp detection during colonoscopy using a regression-based convolutional neural network with a tracker.
Pattern RecognitionarXiv e-printsIEEE AccessarXiv 1409.1556CoRRCoRRCoRRCoRRCoRRCoRRComputerized Medical Imaging and GraphicsIEEE Transactions on Medical ImagingarXiv preprint arXiv:1901.03278arXiv preprint arXiv:1904.01355arXiv preprint arXiv:1904.03797CoRRIEEE transactions on medical imagingApplied SciencesInternational journal of computer assisted radiology and surgeryCoRRBioinformaticsInternational Journal of Pattern Recognition and Artificial Intelligence83, pp. 209 – 219. External Links: Cited by: §II-A, TABLE III.
-  (2019) Objects as points. abs/1904.07850. External Links: Cited by: §I, §I, §II-B.