WCE Polyp Detection with Triplet based Embeddings

12/10/2019 ∙ by Pablo Laiz, et al. ∙ Universitat de Barcelona 19

Wireless capsule endoscopy is a medical procedure used to visualize the entire gastrointestinal tract and to diagnose intestinal conditions, such as polyps or bleeding. Current analyses are performed by manually inspecting nearly each one of the frames of the video, a tedious and error-prone task. Automatic image analysis methods can be used to reduce the time needed for physicians to evaluate a capsule endoscopy video, however these methods are still in a research phase. In this paper we focus on computer-aided polyp detection in capsule endoscopy images. This is a challenging problem because of the diversity of polyp appearance, the imbalanced dataset structure and the scarcity of data. We have developed a new polyp computer-aided decision system that combines a deep convolutional neural network and metric learning. The key point of the method is the use of the triplet loss function with the aim of improving feature extraction from the images when having small dataset. The triplet loss function allows to train robust detectors by forcing images from the same category to be represented by similar embedding vectors while ensuring that images from different categories are represented by dissimilar vectors. Empirical results show a meaningful increase of AUC values compared to baseline methods. A good performance is not the only requirement when considering the adoption of this technology to clinical practice. Trust and explainability of decisions are as important as performance. With this purpose, we also provide a method to generate visual explanations of the outcome of our polyp detector. These explanations can be used to build a physician's trust in the system and also to convey information about the inner working of the method to the designer for debugging purposes.



There are no comments yet.


page 2

page 3

page 5

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

According to the Global Health Organization, colorectal cancer is the third highest type of cancer with 1.8 million people diagnosed in 2018 [cancer_statistics]. The early detection of cancer, when it is still small and has not spread, is essential for the treatment and the survival of the patient. The detection and removal of intestinal polyps, an abnormal growth of the tissue that can evolve to cancer, is specially important. According to the American Cancer Society, screening tests of the gastrointestinal (GI) tract have significantly increased the survival rate of colorectal cancer patients111https://www.cancer.org/cancer/colon-rectal-cancer/detection-diagnosis-staging/survival-rates.html.

The standard clinical procedure for the screening of the rectum and the colon is a colonoscopy. In spite of the fact that this procedure is widely accepted, it presents important drawbacks: it requires qualified personnel in expensive medical facilities and may result in patient discomfort.

Wireless Capsule Endoscopy (WCE), originally developed by [capsule_endoscopy], is an alternative technique designed to record the inside of the digestive tract with minimal patient discomfort and reducing the amount of needed resources. Patients ingest a vitamin-size capsule that contains a camera and an array of LEDs powered by a battery, to record and send the captured images to an external device for a posterior analysis.

WCE has the potential to revolutionize the screening of conditions such as polyposis syndromes, bleeding or Crohn’s disease [ai_capsule_here] but presents an important drawback in practice: resulting videos contain more than 200.000 images, per patient, that must be screened by clinical specialists. This review is complex, tedious and time-consuming, often lasting 2 to 3 hours per video. Moreover, and also because of the fatigue caused by the visual inspection of these videos, it is common that procedures have to be reviewed more than once in order to ensure that no pathological images are missed [ai_capsule_here]. All these inconveniences hinder the adoption of the WCE procedure, exposing the need of computer-aided decision (CAD) support systems [aid_polyps, aid_ulcer].

In the literature, we can find several AI-based CAD systems aimed at detecting suspicious or abnormal WCE frames. Most of these methods are aimed at reducing visualization and diagnosis time of the experts by detecting specific GI events with high performance machine learning systems.

With regard to the specific goal of polyp detection, most of the published systems have been built and validated as automatic detection methods. However, because of legal and practical reasons, these systems cannot be used for automatic diagnosis and can only be deployed as decision support systems which filter the whole set of frames to allocate physician’s attention to those images that show potential polyp structures. In most cases, this is a needle-in-haystack problem since the occasional appearance of images with these pathologies. Figure 1 shows two sequences from different procedures where a polyp is observed. It is important to point out that, in both procedures, those are the only images, from more than 200k images of the whole procedure, where a polyp can be seen. Figure 2 shows some random images from the same procedures.

Figure 1: Illustration of two polyp sequences extracted from different patients. In the first sequence can be seen how the polyp appears in all the frames approximately in the same location. However, in the second sequence, the polyp location change while the WCE moves through the GI tract.
Figure 2: Illustration of 16 random samples obtained from the same procedures that represent the huge diversity of the dataset. For example, some of the frames present turbid, GI walls or wrinkles among others.

Polyp detection has been an active research topic, as it can be seen in Table 2, however, to our knowledge, there is no agreement about a common evaluation methodology to allow the community to compare different CAD methods. Most of these methods have been developed and evaluated with private datasets and using different evaluation methodologies, which are suited for an image detection systems but not fully informative for CAD systems in medical applications.

In this paper, we propose and validate a CAD system for the detection of polyps in WCE videos. The proposed system is based on Deep Convolution Neural Networks (CNN). It is well known that CNNs have become state-of-art in many visual recognition problems, but their application in the medical field has been rather limited with some exceptions like dermatology or breast x-rays. The main reason for this is that medical databases are comparably poor and small due to the high costs involved in data acquisition, their complex labelling, and because the use of these data often involves confidentiality issues summarized in [deepLearningChallengesMIA]

. Small size and imbalanced data are two of the main obstacles to develop reliable deep learning classifiers, because if not properly addressed, they may lead to training overfitting and poor generalization. Several tricks and techniques, such as dropout

[dropout], sampling strategies [sampling_strategy], image augmentation [image_augmentation], try to alleviate this problem, but it is still an open and important challenge in the medical field as described in [deepLearningChallengesMIA]. To this aim, and to overcome the small amount of available data for training the CNN, in this paper we propose an optimization strategy based on a deep metric learning that uses the triplet loss function [triplet_loss]. The obtained results show that this learning strategy outperforms the classical learning strategy using the cross-entropy loss function in our problem.

Our contributions are as follows:

  • We propose an evaluation methodology that involves quantitative metrics as well as the reporting of qualitative database information in order to allow fair comparisons between different systems.

  • We show how to build and end-to-end CNN polyp detection system, based on the triplet loss function, that overcomes the problem of imbalanced datasets.

  • Finally, we propose the use of classifier interpretation techniques as a mechanism to build trust in the system.

The paper is organized as follows: first, we give an overview of the related work in the field. This is followed by a description of our methodology, presenting the system architecture, parameter optimization and evaluation methodology, followed by experimentally setup and results. Finally, we conclude the paper and give directions for future work.

2 Related Work

Since the presentation of WCE, several computational systems have been proposed to reduce its inherent drawbacks in clinical settings [liedlgruber2011computer, belle2013biomedical]. Generally, these systems are designed either for efficient video visualization [mackiewicz2008wireless, chu2010epitomized, iakovidis2013efficient, drozdzal2013adaptable] or to automatically detect different intestinal abnormalities such as bleeding [wce_3_dcnn_bleeding, wce_4_haemorrhage], polyp [rotation_invariant, ploosonePolyp], tumor [cobrin2006tumor], ulcer [Ciaccio2013ulcer], motility disorders [Malagelada2015motility, sseguiWrinkles] and other general pathologies [ciaccio2010distinguishing, 6051474, malagelada2012functional, Chen2013]. Deep learning nowadays represents the state-of-the art to most of these problems. Table 1 shows detailed information of those systems that have been implemented using deep learning methods.

Reference (Year) Class Dataset Validation Architecture Metrics
Videos Images Method Patient Separation
[8_classifying_digestive_organs] Localization 25 75k 60k / 15k Unknown AlexNet A
[9_hybrud_conv] Digestive organs 25 1M 60k / 15k Unknown CNN + ELM A
[wce_2_generic_feature] Scene classification 50 120k 100k / 20k Unknown CNN A-G
[wce_3_dcnn_bleeding] Bleeding - 10k 8.2k / 1.8k Unknown AlexNet B-F-H
[wce_4_haemorrhage] Haermorrhage - 11.9k 9.6k / 2.24k Unknown LeNet, AlexNet, F-B-C-H
GoogleNet, VGG-Net
Table 1:

Comparison of existing Deep Learning methods for the classification problem in WCE. In the last column, Metrics, the legend used is: Accuracy (A), Sensitivity - Recall - TPR (B), Specificity - TNR (C), ROC (D), AUC (E), Precision (F), Confusion Matrix (G), F1-Score (H).

Among possible WCE uses, polyp detection has been one of the problems that has attracted a lot of attention from the researchers. Table 2

presents a set of methods, published in high reputation conferences or journals, aimed at detecting polyps by using some of the GI examination modalities. As it can see seen, prior to 2015, most of the published methods were based on conventional computer vision and machine learning techniques, which are based on the extraction of handcrafted visual features followed by a classifier. These systems have used several image features such as color, texture and shape to deal with the classification task. Since 2015, most of the methods are based on deep learning.

Reference (Year) Modality Dataset Validation Feature Metrics
Videos Polyp Non-polyp Method Patient Separation
[2_intestinal_polyp_recognition] WCE 2 150 150 3-fold Unknown Colour and shape A-B-C
[1_poylp_detection_color_texture_features] WCE 2 - - 5-fold Unknown Colour A-D-E
[5_automatic_polyp_detection] WCE 10 600 600 10-fold Unknown Texture A
[3_feature_polyp_detection] WCE 10 436 436 10 random splits Unknown Texture A-B-C
[7_polyp_detection_imbalanced] Endoscopy 141 1k 100k 5-fold Unknown Shape B-E-F
[4_automatic_polyp_detection] Colonoscopy 20 2k 3k - Unknown Texture B
[10_lesion_dtection] Endoscopy - 6.5k 50k 10-fold Unknown CNN A-B-C
[endo_3_automatic_detection] Colonoscopy - 826 1.1k Random test Unknown CNN A-C-F-H
[endo_4_automatic_polyp_detection] Colonoscopy 6 37k 36k Random test Unknown CNN (AlexNet) A-B-G
[endo_5_integrating_online] Colonoscopy 20 3.7k - 18 dif. videos Separate FCN G-F-B-H
[11_deep_polyp_recognition] WCE 35 1k 3k - Unknown CNN (SSAE) A-G
[rotation_invariant] WCE 62 1.5k 1.5k 2.4k / 600 Unknown CNN (DenseNet) A-B-F-H
[ploosonePolyp] Colonoscopy 215 404 - 50 rand. images Unknown CNN G-F-B-H
This Paper WCE 120 2.1k 250k 5-fold Separate CNN (ResNet)
Table 2: Overview of our proposed and existing method for polyp detection. The nomenclature is the same as in Table 1.

There are three features of these methods that are worth to analyze in order to define a fair comparison methodology: database size, validation strategy and evaluation metrics.

Databases: As it can be seen, in most of the cases the number of polyps in the dataset is relatively small. The paper that uses the largest number of polyps uses a total of 37k images, while the minimum uses just 25. If we consider only those papers which work with WCE images, the number of images is significantly smaller. The paper that uses the largest dataset uses a total of 1.5k polyp images obtained from 62 different patients, which means an average of 25 polyp frames per procedure. It also important to point out that the number of procedures is from 2 to 1000 times smaller than the number of polyps. This means that several images from the same polyp are used in the dataset, but this information is not usually reported. Besides this, to understand how challenging the dataset is it would also be important to report the size and type of polyps.

Regarding negative samples, the paper that uses the largest databases uses a total of 100k images while the paper that uses the smallest set uses 75 images. No information about the sampling strategy that was used to obtain these negative images is reported in any case.

Training and validation strategy:

As pointed out before, datasets usually contain several positive images from the same patient, and in most cases several images from the same polyp. For this reason it is very important to ensure that the training and the validation sets do not share images from the same procedure. If the partition of the training and validation is not properly done, it would be highly probable to have consecutive and practically identical frames in both sets. This fact clearly contaminates any validation result based on those datasets. Only the method presented by

[endo_5_integrating_online] reports this information.

Evaluation metrics: In order to validate these systems, authors use a variety of evalutaion metrics: accuracy, precision, sensitivity/recall, specificity, ROC-Curve, AUC, F1-Score as well as the confusion matrix. The diversity of evaluation metrics clearly hinders a clear comparison between methods, thus showing the need for a good and unified evaluation strategy.

3 Method

We designed and evaluated the approach not as a classical classification problem, but as an information retrieval problem: Given a WCE video, the problem is to rank the images of the video according to some criterion so that the more relevant images appear early in the result list displayed to the physician. This allows the visual screening of a reduced set of images and at the same time ensures the detection of a maximum number of positive images.

The description of the method has three parts with the following contents:

  • System Architecture: Introduction of the CNN architecture used to detect polyp images.

  • Parameter Optimization: Explanation of how the chosen architecture is optimized. Presentation of the problems derived from the database and how to adapt the learning process to achieve better results.

  • Evaluation Methodology: Presentation of standard metrics and discussion about how to evaluate polyp detection systems to be able to compare with other works.

3.1 System Architecture

The proposed deep learning method is based on the Deep Residual Network (ResNet) architecture, presented by [resnet_paper]. This network has shown outstanding results in important image recognition problems.

The main novelty of this architecture is the use of a high number of layers that progressively allow to learn more complex features. The first layers learn edges, shapes or colors while the last ones are able to learn concepts. In order to learn, this architecture needs the introduction of a set of residual blocks that avoid the problem of vanishing gradients when the number of layers increases. These blocks are built by using skip connections or identity shortcuts, that reuse the outputs from previous layers.

The residual block has the following form:


where represents stacked non-linear layers and the identity function.

Taking into account the performance of this architecture in other image classification problems, we used the fifty-layers ResNet version, known as ResNet-50.

Figure 3: Overview of the proposed CNN structure. The upper part of the scheme appear the ResNet architecture with our methodology applied in it. The background colour reflects the layers affected by each one of the gradient generated by the main losses. The lower part of the figure shows how the class activation mapping is built.

3.2 Parameter optimization

ResNet-50 has over 23 million trainable parameters. The robust estimation of these parameters needs of millions of images as described in

[resnet_paper], but these parameters have been shown useful for a variety of visual recognition problems. The original paper used the cross-entropy loss function with a L2 regularization term to estimate all these parameters. Binary cross-entropy loss function decreases as the prediction converges to the true label, and increases otherwise, as its function indicates:


where is the true label of the sample and is the estimated probability of the sample to belong to class .

In our case, to deal with a small and imbalanced dataset, we propose an optimization of the ResNet in two stages. First, images are projected into an embedding space using the Triplet Loss (TL) as described in [triplet_loss]. Then, we consider the cross-entropy loss function in the embedding space. The proposed methodology is shown in the upper part of Figure 3.

TL, a deep metric learning (ML) method, has shown great generalization results when dealing with a large amount of classes as for instance in the problem of face re-identification. The goal of the TL is to optimize a latent feature space such that examples from the same class are closer to each other than to those belonging to different classes.

In order to learn this embedding representation, the triplet loss aims to ensure that an anchor image, , is closer to all other images from the same class, , than any image from a different class, . This concept, illustrated in Figure 4, can be formulated as follows:


where is the embedding of , is the Euclidean distance and is a margin, which define the minimum distance between elements of different classes.

In order to train the network and reach the sought condition, the triplet loss function is defined as follows:

Figure 4: Behaviour representation of the of triplet loss using one triplet. The arrows of each image indicate the direction in which the embedding will move following the gradient.

Training a neural network using the TL is not simple. At training time, the network receives triplets of images. For small datasets, the generation of each triplet is feasible, but when the amount of images increases, the number of possible triplets grows with cubic complexity. If we try to generate all of them, it becomes intractable and inefficient, making it impossible to optimize the loss. As a consequence, a sampling strategy for the images becomes an essential part of the learning method. The right choice of triplets can increase the speed of convergence, the probability to find a lower minimum and even the possibility of getting better generalization.

In the literature we can find two different methodologies to face the problem of triplet sampling for each batch. The first one is called Batch All, being introduced in [batch_methodologies], . In this case, for each sample in the batch, we consider all possible triplets. This results in elements. The loss function is:


The use of the previous methodology declined from the appearance of the Batch Hard [batch_methodologies] approach, . It takes each anchor and generate triplets by seeking in the batch for the hardest positive sample , defined as farthest positive sample , and the hardest negative sample , defined as the closest negative sample . The loss function is:


3.3 Evaluation

In the field of the medical imaging, and in particular when databases are protected and not released to the public domain, the evaluation of different proposals is perhaps the hardest and most critical part. However, and as shown in the related work, a unified procedure which allows a objective comparison of methods does not exist. We can see that a diversity of metrics are used and in most of the cases, not the ideal ones for the problem. Moreover, in most of the papers the used or the reported information about the dataset is not sufficient to understand the relevance of the proposal. To this aim, in this section we study and propose a methodology to be used in order to validate computer-aided polyp detection systems. The proposed evaluation methodology is divided into three fundamental points:

  • Databases and cross validation strategy: How to properly build it and what information must be reported to understand the relevance of the proposal and allow a detailed comparison of models.

  • Quantitative Results: Standard metrics in computer vision problems have several drawbacks that can affect the understating of the performance of the methods. For this reason we propose and justify a set of metric to be used.

  • Qualitative Results: Aside the numeric results, it is important to consider qualitative results to trust in the system. To this aim, we propose the use of a method to understand the output of the model.

3.3.1 Databases and cross-validation strategy

The creation of medical databases is an essential step before training and validating any type of system. Both, positive and negative samples must be collected in the best way. With respect to the size and diversity, training data can follow any distribution that one deems appropriate, however, it is crucial that results are reported using a test set large enough to also capture the diversity of non-pathological images that can be found in the GI tract. In order to capture this diversity, we consider a uniform time sampling strategy as the best option for creating the negative set. As negative samples are cheap, since we have as many as needed, a minimum number of images per videos should be used, being 2k a reasonable number.

The second important point to consider when creating the database and its evaluation methodology, is that although all polyps have common visual characteristics, the appearance of different polyps from the same patient must be regarded. The first row of Figure 5 shows three different polyps from the same patient, while the second row shows three polyps from different patients. As it can be observed, those polyps from the same patient are generally similar in shape and texture while the polyps from other patients are more diverse. It is for this reason, that training and test set must not use images from the same procedures.

Figure 5: Example of polyps extracted from the procedures. In the first row, the polyps come from the same procedures, while the polyps from the second row come from different ones.

Additionally, since the datasets are small, it is recommendable to perform cross-validation to avoid data selection problems. As mentioned before, it is important that the folds of the the cross validation procedure are done by leaving procedures out, not by leaving images out, therefore ensuring that images from the same procedure never belong to two different partitions.

Lastly, and since in most of the cases databases are not released to the public domain, it is fundamental to have a detailed description of the dataset in order to understand the complexity and impact of the solution. We consider that the following information should necessarily be reported:

  • Number of procedures/cases used in the dataset.

  • How many of them suffer a pathology.

  • Distribution of unique pathological events.

  • Frames per each pathological event.

3.3.2 Quantitative Results

Evaluation metrics illustrate the performance of the system and allow to comparisons. For this reason, they require a high capability of discrmination among models and they must match with the aim of the system.

Accuracy is the most frequently used metric to validate polyp detection systems. However, it does not reflects what is expected from the system since it does not necessarily correlate with time needed to reach a diagnosis by the physician. Accuracy depends on the defined threshold of the system, without giving the full picture of system output. Moreover, in imbalanced problems accuracy is mostly affected by the predominant class. Weighted accuracy is a more suitable metric, although it is still dependent on a fixed threshold.

Precision and recall (also known as sensitivity), have also been widely used by the community. Precision is the fraction of true polyp images (TP) among all the positives obtained by the system (TP + FP), while recall is the fraction of true polyp images (TP) that have been detected from the total amount of polyp images (TP + FN). However, with accuracy, these measurements are also affected by the threshold of the classifier. Since the goal of the system is to reduce the time needed for the physician, it is interesting to report the obtained recall scores at different specificity (TNR) values instead of using the best trade-off between both metrics. The recall at these fixed specificity values allows to understand the expected amount of images that are needed to be visualized by the physician in order to obtain certain performance, i.e., a recall at specificity of 95% measures the percentage of detected polyps (TPR) if only 5% of the images are reviewed. The recall for specificity values of , and are analyzed for this paper.

The Area Under the Curve (AUC) is another interesting measurements. The AUC is computed from the ROC curve which relates the specificity and the recall obtained for all the possible thresholds of the classifier. The AUC value can be understood as the probability of the classifier to predict a true positive element as a positive with higher probability than as a negative; therefore the larger the AUC value is, the better the classifier is. A limitation of the AUC is that both negative and positive classes have the same impact on the output, so FP and FN penalize equally.

3.3.3 Qualitative Results

Although deep learning has shown impressive results, its application to the medical filed carries worries and criticisms since computerized diagnostic methods are seen as black boxes which do not show how the data is analyzed or how the output is obtained. In medical imaging problems, and particularly on the topic of polyp detection, it is transcendent to trust and understand the obtained predictions by the system. Understanding how the outcome was obtained allows to: 1) understand why something is considered pathological; 2) provide the needed trust of physicians and scientists in the system; 3) debug and identify errors of the system in a easier way.

To this aim, we consider that a qualitative evaluation, showing where and why the system is failing is a very important element. It is not the same to fail in a small and or partly-occluded polyp than missing a large polyp. It is also important to show False Positive (FP) cases, since these images with shapes or textures that are similar to polyps may be understandable errors and increase the confidence in the system.

To study where the system detected a polyp in a frame can be useful for two main reasons: 1) identify if the detector is focusing in the area of interest and 2) help physicians in the review process.

Class activation map (CAM) presented by [cam], is a generic localized deep representation that could be used to interpret the prediction decision made by the system. This method indicates the implicit attention that the network gives to the pixels of an image considering the class where it belongs.

To obtain the class activation map a linear combination is computed between the feature maps and the classifier weights, since they connect the output and the last feature maps, which identify the importance of each response obtained. For a given class , the formalization of the class activation map is defined as:


where is the number of channels, are the different responses of the filters and is the weight that relates class with filter , which is activated by some visual pattern within its receptive field. Briefly, this technique is a weighted linear sum of these visual patterns at different spatial locations, which give the most relevant regions for the system.

3.4 Guidelines

After analyzing and discussing each one of the previous aspects, in order to get a full validation of the system, we propose that the following items should be included in the validation methodology:

  1. A fully detailed report of the dataset used.

  2. Training and validation set must not contain images from same procedure.

  3. The negative images in the validation set must represent the diversity of the domain. A random sampling or uniform time sampling from the same videos are good strategies (patients and control cases).

  4. The number of negative images in the validation set must be higher than the number of positive images. At least 2k times the number of positive images should be considered.

  5. For small datasets it is necessary to apply a cross validation method.

  6. Recall@80, Recall@90 and Recall@95 should be used in order to make the system comparable with other methods in the community.

  7. A qualitative evaluation is recommended to build trust in the system.

4 Experimental Setup and Results

4.1 Dataset Details

The database used in this paper is composed of procedures from different patients. All these procedures have been performed using Medtronic PillCam COLON2 or PillCam SB3. After an exhaustive visual revision done by expert physicians and trained nurses, polyps were found in out of the analyzed procedures. A total of different polyps from those procedures were annotated and used as a positive set. Table 3 summarizes the amount of found polyps per procedure. As it can be observed, the number of polyps per procedure is diverse, in a majority of procedures the experts have not reported any polyp, being 1.37 the average of reported polyps per procedure and 11 the maximum number in a single procedure. Table 4 shows the number of frames where each polyp is visualized within the video. Since most polyps are observed in more than one frame, a total images with polyps have been considered as positive images. More details of the database are reported in Table 5, that summarizes the morphology and size of the polyps. The size of the polyps was determined using the PillCam Software V9.

Regarding the negative set, a total of images were selected. These images were obtained using a random time sampling from the total set of procedures, and then revised by trained nurses.

All the images have resolution and the time stamp and device information was removed.

# Polyps 0 1 2 3 4 5 6 7 11
# Procedures 68 17 11 8 3 5 3 2 3
Table 3: Amount of polyps per procedure.
# Frames 1-2 3-4 5-6 7-10 11-20 21+
# Polyps 33 32 20 19 31 30
Table 4: Amount of frames per polyp.
Sessile Pedunculated Undefined Total
(2-6 mm)
65 4 19 88
(7-11 mm)
29 4 20 53
(12+ mm)
8 3 13 24
Total 102 11 52 165
Table 5: Morphology - Size of the polyps

4.2 Architecture and Evaluation Details

In all of the experiments, a pretrained model with ImageNet dataset is used to alleviate the problem of data scarcity. Moreover, in order to enlarge the amount of available images, data augmentation for training is performed by applying rotations of

degrees, horizontal and vertical flips and changes in the brightness of the images with a random probability.

Networks are optimized using Stochastic Gradient Descent with a fixed learning rate of

during 100 epochs. The hyper-parameter margin of the triplet loss has been fixed to


To assess performance, results are reported following the 5-fold cross validation strategy. It is important to remark that the stratified partitions have been done not by individual frames but by patients, thus images from the same patient must belong to the same partition of the validation set.

4.3 Quantitative Results

In the first experiment, we aim to compare the performance of each one of the methodologies explained previously: ResNet, and . As shown in Table 6, our methodology has outperformed the obtained results by the standard optimization methodology of ResNet, while has not worked at all. In the case of the , the poor performance could be a consequence of the high diversity of positive and negative images. In each batch, for each anchor image, it uses the most dissimilar positive image and the most similar negative image, making the optimization problem extremely difficult. The obtained AUC value of our methodology has a increase compared to ResNet, achieving . The system enhancement is also reflected in the obtained sensitivity scores, which increased between and points compared to the next best model, which illustrates that with the same percentage of frames reviewed by the experts, the system finds more pathological frames. Figure 6 shows the ROC curves of the three studied models. On the left side of the curves, the model obtains a higher recall value than the other methods considering the same specificity. This difference means that detects more frames containing polyps, while at the right side of the curve both systems, and ResNet, work similarly. It is remarkable to notice that in the

experiment, the low standard deviation values indicate that the model is more robust than the others.

Parameter Accuracy Sensitivity Specificity AUC Sensitivity (%)
Optimization (%) (%) (%) (%) Spec. at 95% Spec. at 90% Spec. at 80%
Table 6: Performance comparison of the methods: ResNet, and . Each method has been evaluated with a 5-fold validation in the classification task.
Figure 6: ROC Curve of the three models. Each vertical line represents a specificity value that indicates the percentage of true negative images predicted in the video, and the percentage of polyps that the system is expected to detect.

When deep metric losses are used, it is common to add an extra dense layer between the extracted features and the classification layer. This layer introduces more versatility in data representation while it compresses the information in the embedding. In the second experiment, we contrast the performance of against the same methodology but adding a new extra layer, that will act as the embedding layer. As shown in Table 7, the embedding sizes used in these experiments are: , , and . Despite the fact that the new networks have more parameters, none of them exceeds the previous results in the AUC score. The obtained AUC value and sensitivity values show a correlation between the embedding size and the obtained scores. model with an embedding size of 2048 have exceed the others models, because the variation on the embedding size allow the network to have a better representation to detect the polyps.

Accuracy Sensitivity Specificity AUC Sensitivity (%)
Embedding (%) (%) (%) (%) Spec. at 95% Spec. at 90% Spec. at 80%
Table 7: Performance of the methods: and different versions of the same adding an extra dense layer and changing the embedding size. Each method has been evaluated with a 5-fold validation in the classification task.

The margin hyper-parameter of the triplet loss has been set until now at as it is set in other works like [triplet_loss] or [defense_triplet]. As the domain of the problem is different to previous applications of the triplet loss method, our third experiment evaluates the behaviour of the system with the following margins: , and . As shown in the obtained results summarized in Table 8, any of these margins outperforms all the metrics. Margins and obtain standard deviation values which are higher than the small margins, indicating that the model is less robust. However, the margin that achieves the best results on almost all the reported metrics is .

Accuracy Sensitivity Specificity AUC Sensitivity (%)
Margin (%) (%) (%) (%) Spec. at 95% Spec. at 90% Spec. at 80%
0.2 ()
Table 8: Performance of method changing the margin parameter. Each network has been evaluated with a 5-fold validation in the classification task.

The comparison of models is shown in Tables 6, 7 and 8 demonstrating that is the best computer-aided decision support system for polyp detection in terms of accuracy and sensitivity.

From a medical point a view, the computer aided system should help to detect polyps but not necessarily detect all the images where a given polyp is seen. For this reason, we analyzed the performance of our proposed system over polyps. A global overview of the numerical results is summarized in Table 9, where each score represents the percentage of detected polyps in different scenarios of the entire dataset. Each of them is computed with a different specificity value: , and . The first row of the table contains the percentage of detected polyps, that grows when the specificity decreases. Setting the specificity at , the system only misses out of polyps; if we decrease specificity to and , the missed polyps are and out of respectively. A complete view of the curve is reported in Figure 7.

The second set of results in Table 9 present the detection of the system according to polyp size. When we consider small polyps using a specificity of or higher, the system is not able to detect out of , but with lower specificity, the amount of polyps detected increases. In the case of medium sized polyps, with higher specificity values, the system misses out of . In the case of larger polyps, one unique polyp is lost for the three specificity values, however, with a slight decrease in specificity, the polyp is detected.

Finally, the lasts rows of the Table 9 show the detection rated based on polyp morphology: sessile, pedunculated or undefined. As it has been reported previously, most polyps are labeled as sessile, obtaining high detection scores despite the loss of out of . Pedunculated polyps are relatively rare, and the system detected all except one at the three sensitivity values.

detection Specificity@95 Specificity@90 Specificity@80
Small Polyps 93.18% 93.18% 96.59%
Medium Polyps 92.45% 96.23% 98.11%
Large Polyps 95.83% 95.83% 95.83%
Sessile Polyps 96.08% 96.08% 98.04%
Pedunculated Polyps 90.91% 90.91% 90.91%
Undefined Morphology 88.46% 92.31% 96.15%
Table 9: Detection vs. Specificity with model
Figure 7: Percentage of polyps detected with two different models.

4.4 Qualitative Results and Polyp Localization

CAM visualization was applied to the output of the network. This method generates a heat map, where the red tones show the regions of the image that obtain a high response from the filters. Figure 8 shows in the first row eight polyps frames where the different morphology and size of the polyps may be observed. In the second row, the CAM visualization method highlights the location where the system focused to predict that there was a polyp.

Figure 8: The first row contains eight examples of TP of our proposed method, with polyps of different morphology and size. The second row incorporates the CAM representation that locates each one of the polyps over the original image.

Figure 9 shows eight images without polyps where the system has erroneously detected a polyp. In these samples, some of the regions highlighted by the network contain regions with signs of polyps such as growths of tissue, mucous membranes or areas with reddish colour from the wall, that might indicate the existence of it.

Figure 9: The first row contains eight examples of FP of our proposed method, where the system has detect polyps. In some images abnormal tissue can be seen, some mucous membrane or reddish zone, that are features related with polyps. The second row shows the CAM representation that locates where these features are located.

Figure 10 shows eight polyp images where the system has not obtained enough features to predict the frame as polyp. Each image shows a boundary with the location of the polyps. These difficult cases are complex to detect in single images by the system. The evaluation of a whole sequence of images where the polyp is seen facilitates detection by the human eye. Due to the complexity of polyp detection, sometimes is easier for humans to detected them though the sequence.

Figure 10: The images correspond to eight examples of FN of our proposed method, where a polyp is in the frame, but the system couldn’t detect it. To help the reader to find the polyps in the images, the outline of the polyp has been drawn in white color.

Figure 11 shows the second sequence of images in Figure 1 with the output of the system represented by adding a green square around the frames where the system has detected a polyp. Although in this example the system missed two frames where the polyp is present, the detection in four frames is sufficient for the physician to establish the diagnosis.

Figure 11: Polyp sequence where the green squares denote the presence of polyps detected by the system. In this sequence, there are two frames where the polyp is not detected, despite this, the support system has found the polyp in the previous frames, allowing the doctor to diagnose the patient.

5 Conclusion

The methodology proposed in this study improves automatic polyp detection in WCE images and additionally enables localization of the polyp in each image. The reported experiments demonstrate that the triplet loss method improves feature extraction outperforming previous results and that the limited and unbalanced data availability may be alleviated with the appropriate losses. Furthermore, the qualitative output of the system may increase trust in the prediction.

Future research will focus on the detection of other intestinal pathologies to develop a complete computer-aided detection system for WCE videos.


The authors would like to thank the team from CorporateHealth International ApS for their feedback and economic support and NVIDIA for their GPU donations. This work has been also supported by MINECO Grant RTI2018-095232-B-C21 and SGR 1742.