A Comparative Study on Polyp Classification using Convolutional Neural Networks

07/12/2020 ∙ by Krushi Patel, et al. ∙ The University of Kansas 0

Colorectal cancer is the third most common cancer diagnosed in both men and women in the United States. Most colorectal cancers start as a growth on the inner lining of the colon or rectum, called 'polyp'. Not all polyps are cancerous, but some can develop into cancer. Early detection and recognition of the type of polyps is critical to prevent cancer and change outcomes. However, visual classification of polyps is challenging due to varying illumination conditions of endoscopy, variant texture, appearance, and overlapping morphology between polyps. More importantly, evaluation of polyp patterns by gastroenterologists is subjective leading to a poor agreement among observers. Deep convolutional neural networks have proven very successful in object classification across various object categories. In this work, we compare the performance of the state-of-the-art general object classification models for polyp classification. We trained a total of six CNN models end-to-end using a dataset of 157 video sequences composed of two types of polyps: hyperplastic and adenomatous. Our results demonstrate that the state-of-the-art CNN models can successfully classify polyps with an accuracy comparable or better than reported among gastroenterologists. The results of this study can guide future research in polyp classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Colorectal cancer is the third most common cancer diagnosed in both men and women in the United States. Most colorectal cancers start as a growth on the inner lining of the colon or rectum, called ‘polyp’. Not all polyps are cancerous, but some can develop into cancer. Early detection and recognition of the type of polyps is critical to prevent cancer and change outcomes. However, visual classification of polyps is challenging due to varying illumination conditions of endoscopy, variant texture, appearance, and overlapping morphology between polyps. More importantly, evaluation of polyp patterns by gastroenterologists is subjective leading to a poor agreement among observers. Deep convolutional neural networks have proven very successful in object classification across various object categories. In this work, we compare the performance of the state-of-the-art general object classification models for polyp classification. We trained a total of six CNN models end-to-end using a dataset of 157 video sequences composed of two types of polyps: hyperplastic and adenomatous. Our results demonstrate that the state-of-the-art CNN models can successfully classify polyps with an accuracy comparable or better than reported among gastroenterologists. The results of this study can guide future research in polyp classification.

Introduction

Colorectal cancer is the third most common cancer diagnosed in both men and women in the united states [1]. According to the American Cancer Society, a total of 101,420 new cases of colon cancer and 44,180 new cases of rectal cancer occurred in 2019. The lifetime risk of developing colorectal cancer is about 4.99% for men and 4.15% for women [1]. Colorectal cancer is the second leading cause of cancer-related deaths. Colon cancer is expected to cause about 51,020 death in the United States during 2020.

Polyps are considered the harbinger of colorectal cancer. Early detection and recognition of polyps can reduce death caused by colorectal cancers. Broadly speaking, colorectal polyps can be divided into two categories: non-neoplastic (Hyperplastic) and neoplastic (Adenomatous) [2]. Hyperplastic polyps do not predispose to cancer, whereas adenomatous polyps are considered pre-cancerous as they account for approximately 85% [3] of sporadic colorectal cancers via the adenoma-carcinoma pathway. Therefore, adenomatous polyps are removed during colonoscopy to prevent future cancer. Therefore, differentiating the two types of polyp histology is critical to determine which patient needs close follow up at shorter intervals and which patient can be surveyed every 10 years.

Colonoscopy is the main diagnostic procedure to detect and recognize polyps located on colorectal walls. The accurate detection and correct classification depend on the skills and experience of the endoscopists, however, even for experienced endoscopists, working on conventional colonoscopy for long hours leads to mental and physical fatigue and degraded analysis and diagnosis. Other factors that may affect the classification results include varying illumination conditions, variant texture and appearance, and occlusion. Moreover, different types of polyps are hard to differentiate since they may exhibit a very similar appearance with a subtle difference, as shown in Fig 1. It requires a thorough examination of fine details to distinguish one category form the other. Therefore, an accurate and effective automatic computer-aided system for colonoscopy is required to help endoscopists to detect and classify the type of polyps. This automated recognition mechanism can also be used as a second opinion to determine whether a further biopsy is required for diagnosis, which in turn will greatly reduce the cost of diagnosis. In addition, such an intelligent system can also be used as an educational resource for gastroenterology trainees to reduce the learning curve and cost.

Fig 1: Example of polyps from different class with subtle difference: Upper: three examples of Adenomatous polyps. (b) Lower: three examples of Hyperplastic polyps. They are visually very similar although from different categories.

In recent years, deep learning algorithms have shown their outstanding performance on various generic datasets

[4]

. In some computer vision tasks, including strategic board games, Atari games, and generic object recognition, deep learning even outperforms human accuracy. However, there is a significant difference between generic images and medical images, as medical images contain more quantitative information and the object have no canonical orientation. In addition, acquiring medical data is expensive and labeling them requires the involvement of domain experts. In this work, although we have used a total of 27,048 images to train our models, they are extracted from only 119 video sequences with each sequence contains one polyp. In short, we have only 119 different polyp images taken from various viewpoints with varying lighting conditions to train our models.

Based on the result of our previous study [5][6] and the results of MICCAI Endoscopic Vision Challenge [7], we can see that the state-of-the-art object detection models can already yield a very high precision in polyp detection. In this study, we assume the polyps have been detected and focus our study only on classification.

In our previous work [6], we have collected and annotated a collection of endoscopic dataset, which contains 157 video sequences and a total of 35,981 frames. We have also labeled the ground truth of the polyp location and histogram class. In order to evaluate the performance of different classification models, we generate two polyp datasets from the annotated endoscopic dataset. As shown in Fig 2, one dataset (set-1) only contains the cropped polyp patches from the original video frames; the other dataset (set-2) contains not only the cropped polyps but also around 55% background around the polyps. As described in [8], polyps have different surrounding and vascular patterns and color in vessels and background according to the type of polyps. Therefore, we generate set-2 to study the effect of background features [8] in polyp classification.

(a) (b) (c)


Fig 2: Type of polyp input: Same polyp frame with different versions of input. (a) Full frame, where the actual polyp feature is less compare to its background features. (b) The cropped polyp. (c) The cropped polyp with around 55% of background. We generate data set-1 using (b) and set-2 using (c) in this study.

Fig 2 illustrates the difference between the two generated datasets. We have evaluated and compared the performance of six classification models on these two datasets. Our results show that there is no significant difference in classification accuracy between the two datasets. We have also analyzed the performance based on both individual frames and individual sequences. The major contribution of this work include:

  • We have generated two datasets for polyp classification. To the best of our knowledge, there are no such datasets available in the literature,

  • we have implemented six state-of-the-art deep learning-based image classification models and compared their performance on the two datasets. This is the first comparative evaluation for polyp classification using different convolutional neural network (CNN) models.

  • This study can serve as a baseline for future studies on polyp classification. The trained classification models, as well as the test dataset will be available for free to the research community on the author’s website.

Related Work

Various approaches and models have been proposed for polyp detection in colonoscopy. Previous comparative validation study on MICCAI 2015 polyp detection challenge shows the proposed models using handcrafted features as well as deep learning models. However, to the best of our knowledge, most previous works were focused on polyp detection, rather than classification, due to the unavailability of the dataset. There have been very few models proposed for polyp classification which classify the polyp into the hyperplastic and adenomatous type. Previous polyp classification approaches can be broadly divided into two categories: handcrafted feature based and deep learning based model.

Conventional Computer Vision Approaches: Most of the polyp classification work in the literature are based on handcrafted features. Some approaches employ a pit pattern classification scheme to classify the polyp [9] into two classes: normal mucosa and hyperplastic. Hafner et al. [10] went beyond the conventional pit patterns approach and exploited fractal dimension based (LFD) strategy. Uhl et al. proposed a blob-adapted local fractal dimension(BA-LFD) approach [11] to classifying polyps. Maximal-minimal filter bank strategy proposed by [12] outperformed the BA-LFD based approach.

Neural Network Based Approaches: The study [13] provided a first review of various deep learning based models for polyp classification. They compared the performance of VGG-VD [14], CNN-F [15], CNN-M [15], CNN-S [15], AlexNet [16], and GoogleLeNet [17] on i-Scan1, i-Scan2 and i-Scan3 database. The paper [18] utilized CNN model to classify the polyp, but in their experiments they employed whole side images instead. The study [19] classified the polyps into informative and non informative categories instead of hyperplastic and adenomatous.

Deep Learning Models: Inspired by the success of AlexNet [16]

in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, convolutional neural networks (CNN) have attracted a lot of attention and been successfully applied to image classification

[20][21][22], object detection [23][4][24]

, depth estimation

[25][26], image transformation [27][28], and crowd counting [29]citesajid2020plug. VGGNets [14], and GoogleNet [17], the ILSVRC winners of 2014 and 2015, proved that deeper models could significantly increase the ability of representations. ResNet [30]

proposed a skip connection based residual module to solve the vanishing gradient problem of very deep models. Highway networks

[31] proposed a gating mechanism to regulate the flow of information in short - connections. ResNetxt [32] proposed to employ multi-branch architecture and proved the cardinality as an essential factor in the CNN architecture. Huang et al. proposed DenseNet [33] where each layer is connected to all subsequent layers. The winner of ILSVRC 2017, SENet [34], achieved 82.7% top-1 accuracy by improving channel interdependencies at almost no computational cost. Recently, EfficientNet [35] has been proposed, which introduced a new scaling method for CNN and achieved improved performance.

Most of the proposed CNN models are based on the following three approaches: (1) Increasing the depth (number of layers) and/or width of the block architecture; (2) introducing an attention module; and (3) using a neural architecture search mechanism. The models chosen in this work are the classical models using all these three approaches. In the task of object detection, classification models are used as a backbone network, and the performance of object detection largely relied on the backbone network. The most widely adopted backbone networks including VGG, ResNet, and DenseNet. Therefore, we include all these three models in our study. In addition, we also include SENet and MnasNet. SENet employs a novel channel-wise attention mechanism, while MnasNet uses a neural architecture search. These models will demonstrate the performance of the state-of-the-art CNN models in polyp classification.

Materials and methods

Convolutional neural networks have been widely applied to various computer vision tasks including object detection and classification. A general CNN network consists of different blocks, including an input layer, an output layer, and a number of hidden layers made up of convolution layers, pooling layers, and activation layers. CNNs adaptively learn spatial hierarchies of features via back propagation through these building blocks. In this section, we make a brief review of the classical object classification models used in this comparative study. These models include VGG [14], ResNet [30], DenseNet [33], Squeeze-and-Excitation Network (SENet) [34] and MnasNet [36].

Vgg

VGG Net [14] was proposed by Simonyan and Zisserman to improve the classification performance by adding more convolutional layers to increase the depth of the network. This could be possible by replacing a large filter size ( and ) with

multiple kernel sized filter stacked together. Max pooling layer is used to reduce spatial dimensions at every few layers. There are three back-to-back fully connected and a softmax layer respectively followed by stacking the

convolution layers at the end. VGG is the first network structure that adopts block-based architecture. ReLU non-linearity has been added to all hidden layers. The number of weight parameters in VGG is larger than the previously proposed AlexNet, though it takes fewer epochs to converge because of implicit regularization imposed by its depth and small convolution filter size.

ResNet

To address the problem of vanishing gradients in deep neural networks, He at al. [30] proposed ResNet which was implemented using the idea of Residual - Blocks, with skip connection to fit the input from the previous layer to the next layer without modifying it. In addition, the residual block structure was structured for different deep variants of ResNet, ResNet-50, and ResNet-101, by including bottleneck design. For each residual block, they used a stack of 3 layers instead of 2 layers, which includes convolution layer back and forth of layer. Here layer is responsible for adjusting the dimensions. Though ResNet is deeper than the VGG net, it has fewer filters and lower complexity. ResNet-34 has 3.6 billion Flops which is only 18 % of VGG-19.

DenseNet

Huang at al. [33] proposed DenseNet based on the observation that deep network is efficient to train if they contain shorter connections between layers close to the input and layers close to the output. DenseNet is made up of several dense blocks and the feature maps from all previous layers are used as an input, and its own feature map is used as input to all subsequent layers. DenseNet uses concatenation operation to add the features from previous layers instead of using element-wise addition. In DenseNet, each layer has fewer number of filters(12 filters), which makes the network thinner and compact. In addition to fewer weight parameters, DenseNet is easy to train because of improved information flow and gradients throughout the network.

As each layer produces feature maps. convolution layer is used to reduce the number of input feature map before applying it to a convolution layer. With this unique design architecture, DenseNet has succeeded to reduce the vanishing gradient problem as well as strengthen feature propagation and encourage feature reuse.

SENet

Researchers have tried to improve the accuracy by stacking layers in different ways. Hu at al. [34] proposed a new architecture block squeeze and excitation based on the observation that not all feature maps are equally important. In conventional convolutional networks, the output feature maps are equally weighted, whereas SENet block weights each channel adaptively in a kind of content-aware mechanism. In more formal terms: SE block employs global information to selectively emphasize informative features and suppress less useful ones. The SE block is made up of two different operations: Squeeze and excitation. The squeeze operation uses global average pooling to generate channel-wise statistics which is a

-dimensional feature vector where

is the number of channels. The excitation operation utilizes this -dimensional feature vector, passes through two fully connected layers, and generates the same length vector. This resultant vector is used to weight the original feature maps. This squeeze and excitation block can be embedded into any state-of-the-art object classification models at a slightly additional cost. The squeeze and excitation network won the first place in ILSVRC 2017 classification and reduced the top-5 error to 2.251%.

MnasNet

MnasNet [36]

, proposed by Google Brain, is an automated mobile neural architecture search approach, based on reinforcement learning, which can identify a model that could achieve a good trade-off between accuracy and latency. MnasNet introduced a hierarchical search space that provides layer diversity throughout the network instead of repeatedly stack the same cells through the network. The main components of MnasNet include (i) RNN-Controller used for sampling model architecture; (ii) a trainer used to trained model sampled by RNN-controller; and (iii) a mobile phone-based inference engine for measuring latency. MnasNet has been implemented on the ImageNet

[37] and COCO [38] database. In this work, we used the architecture which was searched by MnasNet on the ImageNet[37] dataset.

Implementation

Dataset Preparation

In order to evaluate the performance of different models on the classification of polyps. We collected and labelled the following datasets.

  1. MICCAI 2017 Dataset: This dataset was published at the GIANA Endoscopic Vision Challenge held at MICCAI 2017. It contains 18 short videos for training and 20 videos for testing[7]. Each frame in the training set has its associated ground truth in the form of segmentation mask.

  2. CVC ColonDB Dataset: This dataset was published by Bernel at al. [39], which contains 15 short colonoscopy video sequence, with the ground truth of polyp segmentation mask.

  3. ISIT-UMR Colonoscopy Dataset: This dataset was published by Mesejo at al. [40]. It contains 76 short video sequences. Each video sequence was labeled by the polyp categories, however, there is no ground truth of segmentation.

  4. KUMC Colonoscopy Dataset: This is a dataset collected at the University of Kansas Medical Center with ethical oversight . It consist of 80 colonoscopy video sequences.

With the help of three endoscopists from the medical school of Jilin University and the University of Kansas Medical Center, we labeled the polyp classification results of all videos in datasets 1, 2, and 4. We also annotated the location bounding boxes for all the polyps in datasets 3 and 4. During the annotation process, the endoscopists could not reach an agreement on some sequences since they may need further biopsy verification. Those videos are removed from the datasets. We finally obtained a dataset of 157 videos (35,981 frames) with the labeled ground truth of the polyp histology and bounding boxes.

For the labeled dataset, we randomly split all the videos into training, validation, and test sets which contains 119, 16, and 22 video sequences, respectively. The study focuses on evaluating the performance of the state-of-the-art classification models. We assume the polyps have been accurately detected and generate two separate datasets for the evaluation. As shown in Fig 2, set-1 only contains the patches of the cropped polyps, and set-2 contains not only the cropped polyps but also about 55% background around the polyps.

Training

In this study, we implemented and compared a total of 6 classical classification models: VGG19 with/without batch normalization

[14], ResNet50 [30], DenseNet121 [33], SE-ResNet50 [34] and MnasNet [36]

. The training dataset contains 119 sequences (27,048 images). We train all the model using NVIDIA Tesla K80 or P100 GPUs. The hyperparameters used to train the models are tabulated in Table

1. All models were initialized by pre-trained ImageNet weights and the training time of each model ranges from 1 to 3 hours.

Model   Learning rate Batch size Epoch Step size Gamma
VGG19   0.001 32 25 - -
VGG19-BN   0.001 32 25 - -
ResNet50   0.001 64 25 - -
DenseNet   0.001 64 25 - -
SE-ResNet   0.001 64 50 30 0.1
MnasNet   0.001 64 150 - -

The hyperparameters used to train different models.

Table 1: Hyperparameters

Evaluation Metrics

In the experiments, we train each model until it achieves the optimal performance on the validation set. To evaluate the model performance, we calculate the top-1 classification error. In order to make a fair comparison of different models, the performance has also been evaluated in terms of sensitivity, specificity, accuracy, precision, and F1-Score. The definitions of these matrices are listed in Table 2. We evaluates the performance of all models on each sequences individually for both datasets.

-2.25in0in   Polyp classification True Positive(TP)   Numbers of adenomatous polyps that are correctly classified True Negative(TN)   Numbers of hyperplastic polyps that are correctly classified False Positive(FP)   Numbers of hyperplastic polyps that are incorrectly misclassified as adenomatous False Negative(FN)   Numbers of adenomatous polyps that are incorrectly classified as hyperplastic Sensitivity   % of actual adenoma are correctly classified. Also termed as recall and accuracy of adenoma. Specificity   % of actual hyperplastic are correctly classified. Also termed as recall and accuracy of hyperplastic. Precision(Adenoma)   % of predicted adenoma that are truly adenoma. Precision(Hyperplastic)   % of predicted hyperplastic that are truly hyperplastic. Accuracy   Overall accuracy of both classes. F1-Score  

Weighted average of precision and recall.

Error   ROC   Receiver operating characteristic curve AUC   Area under the curve (of ROC)

Table 2: Evaluation Metrics

Evaluation metrics used in the comparison. Precision, Recall(class based accuracy) and F1-Score are calculated for both classes

Results

In this section, we report the classification results of all comparative models using the two datasets. All input images are resized to for a fair comparison. All models include batch normalization except VGG-19. The test set contains a total of 22 sequences (4719 frames), where 13 sequences (2890 frames) belong to adenomatous and 9 sequences (1829 frames) belong to hyperplastic. All models employ softmax as the classifier to yield the scores for the two classes, and the model outputs the class corresponding to the higher score. The top-1 error, precision, recall (individual class accuracy), and F1-score for both categories are as shown in Table 3

. To alleviate the influence of the variation of illumination, all images in the datasets were normalized with respect to their mean and standard deviation. The mean and standard deviation of both datasets are listed in Table

4.

-2.25in0in

Table 3: Evaluation Results
Model   TP TN FP FN Ade Hyper Acc Err Pre-1 Pre-2 F1-1 F1-2 AUC
  (%) (%) (%) (%) (%) (%) (%) (%) (%)
VGG-19(set-1)   2424 1149 680 466 83.87 62.82 75.71 24.28 78.09 71.14 80.88 66.72 76.43
VGG-19(set-2)   2419 1346 483 471 83.70 73.59 79.78 20.21 83.35 74.07 83.52 73.83 84.80
VGG19-BN(set-1)   2071 1440 389 819 71.66 78.73 74.40 25.59 84.18 63.74 77.42 70.45 78.58
VGG19-BN(set-2)   2295 1345 484 595 79.41 73.53 77.13 22.86 82.58 69.32 80.96 71.37 82.20
ResNet50(set-1)   2350 1222 607 540 81.31 66.81 75.69 24.30 79.47 69.35 80.38 68.05 77.25
ResNet50(set-2)   2042 1305 524 848 70.65 71.35 70.92 29.07 79.57 60.61 74.85 65.54 76.27
DenseNet(set-1)   2246 1282 547 644 77.71 70.09 74.76 25.23 80.41 66.56 79.042 68.28 79.28
DenseNet(set-2)   2065 1306 523 825 71.45 71.40 71.43 28.56 79.79 61.28 75.39 65.95 78.65
SENet(set-1)   2230 1320 509 660 77.16 72.17 75.22 24.77 81.41 66.66 79.23 69.30 72.78
SENet(set-2)   2338 1138 691 552 80.89 62.21 73.65 26.34 77.18 62.21 78.99 64.67 82.05
MnasNet(set-1)   2239 1213 616 651 77.47 66.32 73.15 26.84 78.42 65.07 77.94 65.69 73.32
MnasNet(set-2)   2115 1242 587 775 73.18 67.90 71.13 28.86 78.27 61.57 75.64 64.58 77.11

Overall performance of all model on set-1 and set-2 based on individual frame irrespective of sequence.

  Mean and standard deviation used for normalization
Set-1   [0.6916, 0.5297, 0.4158][0.1439, 0.1377, 0.1306]
Set-2   [0.6594, 0.5112, 0.4026][0.2469,0.2254,0.2095]

Mean and standard deviation of set-1 and set-2, used to normalize input images.

Table 4: Mean and standard deviation

Discussion

Frame-based Performance

We first report the comparative performance of different models based on each individual frame. Frame-based performance is measured without considering the particular sequence of those frames. It measures the overall accuracy similar to the generic classification evaluation for other datasets. As shown in Table 3, VGG19 outperforms all other models with an overall accuracy of 75.71% and 79.78% for set-1 and set-2, respectively. The precision of Adenomatous class is higher than that of Hyperplastic class for every model in both datasets, except for VGG-19 with batch normalization (on set-1) and ResNet50 (on set-2). If we consider precision and F1-score for every model in both datasets, the precision of Adenomatous is always higher than that of Hyperplastic. VGG-19 has also achieved the highest recall for both classes on set-2. The most recently proposed models, like ResNet, SENet, and MnasNet did not perform well in both datasets, although they have better performance than VGG-19 on generic image classification datasets.

From Table 3 we also observe that VGG-19 outperforms VGG-19 with batch normalization in most metrics. This is contradicting to what was observed in other datasets. The reason might because that, in polyp classification, the exact intensity values of the pixels may be more useful for the discrimination of different types of polyps than that in generic image classification. While batch normalization layer scales the pixel values with respect to the batches, which may affect the intensity information and downgrade the performance.

To better visulize the performance, we employ AUC (area under the curve) ROC (receiver operating characteristics) curve to demonstrate the frame-based performance. AUC-ROC curve represents the degree of separability of a classification problem. It demonstrates the capability of a model in differentiating classes. Fig  3 and Fig  4 show the ROC curves of different models for set-1 and set-2, respectively. The results show that, in general, the models achieve better classification performance on set-2 than that on set-1 except for ResNet. We can also see that VGG-19 achieves the highest ROC score and the best accuracy on set-2.

-2.25in0in

(a) (b) (c)
(d) (e) (f)
Fig 3: AUC-ROC curves of different models on set-1: (a) VGG19 (b) VGG19-BN (c) ResNet50 (d) DenseNet (e) SENet (f) MnasNet

-2.25in0in

(a) (b) (c)
(d) (e) (f)
Fig 4: AUC-ROC curves of different models on set-2: (a) VGG19 (b) VGG19-BN (c) ResNet50 (d) DenseNet (e) SENet (f) MnasNet

Sequence-based Performance

Based on the classification of each frame, we can measure the performance of each sequence. The sequence-by-sequence performance for the two datasets are shown in Fig  5 and Fig  6

, respectively. We can see that the results are not consistent among all frames within the same sequence of the same polyp. This is because the appearance of the polyp may subject to significant appearance changes due to the variance of the viewpoints, zooming scales, and illumination. Fig 

7 shows some sample frames of a sequence under different viewpoints and lighting conditions. In this case, even experienced endoscopists cannot make an accurate prediction from a single frame. As a result, not all frames can be correctly classified. In practice, we calculate the percentage of correctly classified frames for each sequence. Then, we set a threshold in terms of the percentage, and a sequence is considered to be correctly classified if the percentage of correctly classified frames is greater than the specified threshold. Table 5 shows the performance corresponding to different thresholds for the two datasets.

Model   Threshold(70%) Threshold(60%) Threshold(50%)
VGG-19   63.63/68.18 72.72/81.81 81.81/90.90
VGG19-BN   69.63/68.18 72.72/81.81 81.81/90.90
ResNet50   68.18/59.09 77.27/72.72 86.36/81.81
DenseNet   59.09/63.63 72.72/68.18 86.36/68.18
SE-ResNet   63.63/54.54 72.72/72.72 72.72/77.27
MnasNet   54.54/54.54 68.18/68.18 81.81/81.81
 

Accuracy per sequence for all models based on different threshold with set-1 / set-2 . First term before ’/’ specifies accuracy for set-1 and and term after ’/’ indicates accuracy for set-2.

Table 5: Sequence-based accuracy

-2.25in0in

Fig 5: Sequence-based performance of set-1: The performance of different models for each test sequence of set-1.

-2.25in0in

Fig 6: Sequence-based performance of set-2: The performance of different models for each test sequence of set-2.

-2.25in0in

(a) (b) (c) (d) (e) (f)
Fig 7: Images from different viewpoints: Six sample frames from the same sequence. The same polyp looks considerably different due to the variations of viewpoints and lighting conditions.

As shown in Fig  5 and Fig  6, the classification result for each sequence is not consistent. The test sequences 1, 3, 10, 12, 13, 14, 18,19, 21, and 22 are correctly classified by all models for both datasets, while the results of sequences 2, 4, 5, 6, 7, 9, 11, 17, and 20 are not consistent because the percentage of the correctly classified frames is in between 40-50%. Sequences 5 and 6 could not be classified well by all models. Some sample frames of sequences 5 and 6 are shown in Fig 8, which subject large variations in appearance that cause the difficulty in classification. Table 5 shows the threshold-based performance of all models. The results indicate the consistency of the prediction of different models, from which we can see that VGG models achieve relatively better performance than other models. For example, VGG-19 achieves around 70%, 80%, and 90% accuracy at the thresholds of 70%, 60%, and 50%, respectively. Comparing Table 3 and Table 5, we can find that if we set the threshold at 50%, the sequence-based accuracy is much higher than frame-based based accuracy, especially for VGG models. However, at a higher threshold of 70%, the overall accuracy of the frame-based is higher than the sequence-based approaches, which indicates the consistent prediction within the sequence.

To better visualize the sequence-based performance, we have included the box plots. Box plots show the accuracy per sequence distribution of the total 22 sequences. Fig  9

shows the box plots of different models on set-1 and set-2, respectively. It can be seen that the maximum accuracy of all models is 100% because at least one sequence has been correctly classified by each of the models. The upper quartile range is dependent on the median value. A high median value decreases the upper half range, which shows the ability of the model to consistently correctly classified sequence. On set-1, VGG-19 achieves the highest median value, which indicates that half of the sequences are correctly classified with a very high threshold. On set-2, ResNet-50 yields the most consistent results with the highest median value. We can also see from the results that the upper quartile ranges are smaller than the lower quartile range, which indicates that the spread of accuracy below the median value is very high.

(a)
(b)

Fig 8: Missclassified sequences: Sample frames from different sequences that could not be correctly classified by almost all models. (a) and (b) are sequences 5 and 6, respectively, where 5 is of type adenomatous and 6 is of type hyperplastic.

-2.25in0in (a) (b)

Fig 9: Box plot of set-1 and set-2: The accuracy per sequence distribution of different models on (a) set-1 and (b) set-2

Polyp Crops vs Crops with Background

In order to test the background information in polyp classification, we generate two datasets in the experiment, set-1 has only polyp crops and set-2 contains polyp crops with 50% background. From Table 3 we can see that, if we consider frame-based performance, except for the VGG models, all other models achieve higher accuracy on set-1 than on set-2. If we consider the overall AUC-ROC score, set-2 yields better performance which means the two classes are easier to distinguish in set-2 than in set-1. If we consider sequence-based analysis, the performance of all sequences is almost similar for both types of datasets. For consistency-based performance, the consistency is improved by VGG-19, VGG-19 with batch normalization, and DenseNet for set-2, whereas for other models, the overall threshold-based accuracy is very close. If we consider the box plots and set median as a threshold, the consistency of correctly classifying sequence is improved by ResNet, DenseNet, and SENet for set-2.

Conclusion

In this paper, we have established two datasets and compared six state-of-the-art deep learning-based classification models. We have evaluated the results both at the frame level and at the polyp level. Our results show that VGG-19, in general, outperforms other models in both cases for both datasets. While some more advanced classification models, like ResNet, DenseNet, SENet, and MnasNet did not perform well in our experiments, though they have advantages on other benchmark datasets. The poor performance may be caused by the limited size of the polyp dataset. This study provides a good baseline for future research to develop more accurate and more robust polyp classification models.

Acknowledgement

The authors would like to thank Dr. Vijay Kanakadandi at the University of Kansas Medical Center for his insightful help and advice for this study.

References

  •  1. Society AC. Key Statistics for Colorectal Cancer;.
  •  2. Shinya H, Wolff WI. Morphology, anatomic distribution and cancer potential of colonic polyps. Annals of surgery. 1979;190(6):679.
  •  3. KIM DH, PICKHARDT PJ. Chapter 1 - colorectal polyps: Overview and classifi- cation. In P. J. Pickhardt and D. H. Kim (Eds.), CT Colonography: Principles and Practice of Virtual Colonoscopy. 2010; p. 3–9.
  •  4. Li K, Ma W, Sajid U, Wu Y, Wang G. Object Detection with Convolutional Neural Networks. arXiv preprint arXiv:191201844. 2019;.
  •  5. Mo X, Tao K, Wang Q, Wang G. An efficient approach for polyps detection in endoscopic videos based on faster R-CNN.

    In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE; 2018. p. 3929–3934.

  •  6. Li K, Fathan MI, Patel K, Wang G. Colonoscopy Polyp Detection and Classification: Dataset Creation and Comparative Evaluation. ITTC Technical Report, the University of Kansas. 2019;.
  •  7. Bernal J, Tajkbaksh N, Sánchez FJ, Matuszewski BJ, Chen H, Yu L, et al. Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 endoscopic vision challenge. IEEE transactions on medical imaging. 2017;36(6):1231–1249.
  •  8. NICE Polyp Classification;. https://www.endoscopy-campus.com/en/classifications/polyp-classification-nice/.
  •  9. Wimmer G, Gadermayr M, Kwitt R, Häfner M, Merhof D, Uhl A. Evaluation of i-scan virtual chromoendoscopy and traditional chromoendoscopy for the automated diagnosis of colonic polyps. In: International Workshop on Computer-Assisted and Robotic Endoscopy. Springer; 2016. p. 59–71.
  •  10. Häfner M, Tamaki T, Tanaka S, Uhl A, Wimmer G, Yoshida S. Local fractal dimension based approaches for colonic polyp classification. Medical image analysis. 2015;26(1):92–107.
  •  11. Uhl A, Wimmer G, Hafner M. Shape and size adapted local fractal dimension for the classification of polyps in HD colonoscopy. In: 2014 IEEE International Conference on Image Processing (ICIP). IEEE; 2014. p. 2299–2303.
  •  12. Wimmer G, Uhl A, Häfner M. A novel filterbank especially designed for the classification of colonic polyps. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE; 2016. p. 2150–2155.
  •  13. Ribeiro E, Uhl A, Wimmer G, Häfner M.

    Exploring deep learning and transfer learning for colonic polyp classification.

    Computational and mathematical methods in medicine. 2016;2016.
  •  14. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014;.
  •  15. Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:14053531. 2014;.
  •  16. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems; 2012. p. 1097–1105.
  •  17. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.
  •  18. Korbar B, Olofson AM, Miraflor AP, Nicka CM, Suriawinata MA, Torresani L, et al. Deep learning for classification of colorectal polyps on whole-slide images. Journal of pathology informatics. 2017;8.
  •  19. Akbari M, Mohrekesh M, Rafiei S, Soroushmehr SR, Karimi N, Samavi S, et al.

    Classification of Informative Frames in Colonoscopy Videos Using Convolutional Neural Networks with Binarized Weights.

    In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE; 2018. p. 65–68.
  •  20. Cen F, Wang G.

    Boosting occluded image classification via subspace decomposition-based estimation of deep features.

    IEEE transactions on cybernetics. 2019;.
  •  21. Cen F, Wang G.

    Dictionary representation of deep features for occlusion-robust face recognition.

    IEEE Access. 2019;7:26595–26605.
  •  22. Wu Y, Zhang Z, Wang G. Unsupervised deep feature transfer for low resolution image classification. In: Proceedings of the IEEE International Conference on Computer Vision Workshops; 2019. p. 0–0.
  •  23. Ma W, Wu Y, Cen F, Wang G. MDFN: Multi-scale deep feature learning network for object detection. Pattern Recognition. 2020;100:107149.
  •  24. Ma W, Wu Y, Wang Z, Wang G. Mdcn: Multi-scale, deep inception convolutional neural networks for efficient object detection. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE; 2018. p. 2510–2515.
  •  25. He L, Wang G, Hu Z. Learning depth from single images with deep neural network embedding focal length. IEEE Transactions on Image Processing. 2018;27(9):4676–4689.
  •  26. He L, Yu M, Wang G. Spindle-Net: CNNs for monocular depth inference with dilation kernel method. In: 2018 24th International Conference on Pattern Recognition (ICPR). IEEE; 2018. p. 2504–2509.
  •  27. Xu W, Shawn K, Wang G. Toward learning a unified many-to-many mapping for diverse image translation. Pattern Recognition. 2019;93:570–580.
  •  28. Xu W, Keshmiri S, Wang G.

    Adversarially approximated autoencoder for image generation and manipulation.

    IEEE Transactions on Multimedia. 2019;21(9):2387–2396.
  •  29. Sajid U, Sajid H, Wang H, Wang G. Zoomcount: A zooming mechanism for crowd counting in static images. IEEE Transactions on Circuits and Systems for Video Technology. 2020;.
  •  30. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
  •  31. Srivastava RK, Greff K, Schmidhuber J. Highway networks. arXiv preprint arXiv:150500387. 2015;.
  •  32. Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–1500.
  •  33. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–4708.
  •  34. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–7141.
  •  35. Tan M, Le QV. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:190511946. 2019;.
  •  36. Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, et al. Mnasnet: Platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2019. p. 2820–2828.
  •  37. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee; 2009. p. 248–255.
  •  38. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft coco: Common objects in context. In: European conference on computer vision. Springer; 2014. p. 740–755.
  •  39. Bernal J, Sánchez J, Vilarino F. Towards automatic polyp detection with a polyp appearance model. Pattern Recognition. 2012;45(9):3166–3182.
  •  40. Mesejo P, Pizarro D, Abergel A, Rouquette O, Beorchia S, Poincloux L, et al. Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE transactions on medical imaging. 2016;35(9):2051–2063.