In medicine, endoscopy procedures on the Gastrointestinal (GI) tract play an important role in supporting domain experts to track down abnormalities within the GI tract of a patient. Such abnormalities may be a symptom for a life-threatening disease such as colorectal cancer. This analysis is typically carried out manually by a medical expert, and detecting critical symptoms relies solely on the experience of the practitioner, and is susceptible to human error. As such, we seek to automate the process of endoscopic video analysis, providing support to human experts during diagnosis.
Due to advancements in biomedical engineering, extensive research has been performed to support and improve the detection of anomalies via machine learning and computer vision techniques. These methods have shown great promise, and can detect abnormalities that can be easily missed by human experts[9, 25, 13]. Yet automated methods face multiple challenges when analysing endoscopic videos, due to overlaps between symptoms and the difficult imaging conditions.
the encoded image features are obtained through a bidirectional marginal Fisher analysis (BMFA) and classified using a support vector machine (SVM). In
, local binary patterns (LBP) and edge histogram features are used with logistic regression. A limitation of these hand-crafted methods is that they are highly dependent on the domain knowledge of the human designer, and as such risk losing information that best describes the image. Therefore, through the advancement of deep learning approaches and due to their automatic feature learning ability, research has focused on deep learning methods. However, training these deep learning models from scratch is time consuming and requires a great amount of data. To overcome this challenge, transfer learning has been widely used; whereby a deep neural network that is trained on a different domain is adapted to the target domain through fine-tuning some or all layers. Such approaches have been widely used for anomaly detection in endoscopy videos obtained from the GI tract. The recent methods[21, 19]
on computer aided video endoscopy analysis predominately extract discriminative features from a pre-trained convolutional neural network (CNN), and classify them using a classifier such as a Logistic Model Tree (LMT) or SVM. In, a Bayesian optimisation method is used to optimise the hyper-parameters for a CNN based model for endoscopy data analysis. In , the authors tested multiple existing pre-trained CNN network features to better detect abnormalities.
In  the authors propose an architecture that consists of two feature extractors. The outputs of these are multiplied using an outer product at each location of the image and are pooled to obtain an image descriptor. This architecture models local pairwise feature interactions. The authors of  introduce a hierarchical bilinear pooling framework where they integrate multiple cross-layer bilinear modules to obtain information from intermediate convolution layers. In  several skip connections between different layers were added to detect objects in different scales and aspect ratios. In contrast, the proposed work extracts semantic features from different CNN layers and explicitly models the relationship between these through a novel relation mapping network.
In this paper, we introduce a relational reasoning approach 
that is able to map the relationships among individual features extracted by a pre-trained deep neural network. We extract features from the mid layers of a pre-trained deep model and pass them through the relational network, which considers all possible relationships among individual features to classify an endoscopy image. Our primary evaluations are performed on the KVASIR dataset, containing endoscopic images and eight classes to detect. We also evaluate the proposed model on the Nerthus dataset  to further demonstrate the effectiveness of the proposed model. For both datasets, the proposed method outperforms the existing state-of-the-art.
In this paper, we propose a deep relational model that obtains deep information from two feature streams, that are combined to understand the class of the input endoscopy image. An overview of our proposed architecture is given in Figure 1.
Training a CNN model from scratch is time consuming and requires a large dataset. Therefore, in practice it is more convenient to use a pre-trained network and adapt this to a target domain, and this has been shown to be an effective method in the computer vision [7, 8] and medical domains [1, 21]. To obtain the two feature streams we utilise a pre-trained ResNet50 
network, trained on ImageNet. Training on large-scale datasets such as ImageNet  improves the ability of the network to capture important patterns in input images and translate them into more discriminative feature vectors, that support different computer vision tasks.
When extracting features from a pre-trained CNN model, features from earlier layers contain more local information than those from later layers; though later layers contain more semantic information 
. Thus combining such features offers more discriminative information and facilitates our final prediction task. In this study, we combine features from an earlier layer and a later layer from the pre-trained CNN model. This allows us to capture spatial and semantic features, both of which are useful for accurate classification of endoscopy images. We avoid features from the final layers as they are over-compressed and do not contain information relating to our task, instead containing information primarily for the task the network is previously trained on (i.e. object detection). Our extracted features are further encoded through 1D convolutional and max pooling layers, and passed through a relational network to map the relationship between feature vectors, facilitating the final classification task.
2.1 Semantic Feature Extractor
The input image, , is first passed through the Semantic Feature Extractor (SFE) module. The SFE is based on a ResNet50 pre-trained CNN, and features are extracted from two layers. We denote the respective features as,
where and denote the sizes of the respective three-dimensional vectors. We reshape these vectors to two-dimensions such that they are of shape and .
2.2 Relational Network
The resultant two-dimensional feature vectors are passed through separate 1D convolution functions, and , to further encode these features from the individual streams such that,
Then through a relational network, , we map all possible relations among the two input feature streams. Our relational network is inspired by the model introduced in . However, there exists a clear distinction between the proposed architecture and that of .  utilises a relational network to map the relationships among the pixels in an input image. In the proposed work we illustrate that a relational network can be effectively utilised to map the correspondences among two distinct feature streams. We define the output of the relational network, , as,
where is composed of and
which are Multi-Layer Perceptrons (MLPs),and ,
The resultant vector, , is passed through a decoding function, , which is composed of a layer of LSTM cells , and three fully connected layers to generate the classification of the input image,
We utilise two publicly available endoscopy datasets, KVASIR and Nerthus, to demonstrate the capability of our model to analyse endoscopy images and detect varying conditions within the GI tract.
The KVASIR Dataset  was released as part of the medical multimedia challenge presented by MediaEval . It is based on images obtained from the GI tract via an endoscopy procedure. The dataset is composed of images that are annotated and verified by medical doctors, and captures 8 different classes. The classes are based on three anatomical landmarks (z-line, pylorus, cecum), three pathological findings (esophagitis, polyps, ulcerative colitis) and two other classes (dyed and lifted polyps, dyed resection margins) related to the polyp removal process. Overall, the dataset contains 8,000 endoscopic images, with 1,000 image examples per class. We utilise the standard test set released by the dataset authors, where 4,000 samples are used for model training and 4,000 for testing.
The Nerthus Dataset  is composed of 2,552 images from 150 colonoscopy videos. The dataset contains 4 different classes defined by the Boston Bowel Preparation Scale (BBPS) score, that ranks the cleanliness of the bowel and is an essential part of a successful colonoscopy (the endoscopy examination of the bowel). The number of examples per class lies within the range 160 to 980, and the data is annotated by medical doctors. We use the training/testing splits provided by the dataset authors.
For the evaluations on the KVASIR dataset we utilise the metrics accuracy, precision, recall, F1-score, and matthews correlation coefficient (MCC) as suggested in . The evaluations on the Nerthus dataset utilise the accuracy metric.
3.3 Implementation Details
We use a pre-trained ResNet50  network and extract features from two layers: ‘activation_36’ and ‘activation_40’. Feature shapes are () and () respectively. For the encoding of each feature stream we utilise a 1D convolution layer with a kernel size of 3 and 32 filters, followed by a BatchNorm_ReLu 
and a dropout layer, with a dropout rate of 0.25. The LSTM used has 300 hidden units and the output is further passed through three fully connected layers with the dimensionality of 256, 128 and k (number of classes) respectively. The model is trained using the RMSProp optimiser with a learning rate of 0.001 with a decay of5]
with a theano backend.
We use the KVASIR dataset for our primary evaluation and compare our results with recent state-of-the-art models (see Table 1). The first block of results in Table 1 are the results obtained from various methods introduced for the MediaEval Challenge  on the KVASIR data. In , a dimensionality reduction method called bidirectional marginal Fisher analysis (BMFA) which uses a Support Vector Machine (SVM) is proposed; while in  a method that combines 6 different features (JCD, Edge Histogram, Color Layout, AutoColor Correlogram, LBP, Haralick) and uses a logistic regressor to classify these features is presented. Aside from hand-crafted feature based methods, in  ResNet50 CNN features are extracted and fed to a Logistic Model Tree (LMT) classifier, and in , a GoogLeNet based model is employed. The authors in , introduce an approach where they obtained a collection of hand-crafted features (Tamura, ColorLayout, EdgeHistogram and, AutoColorCorrelogram) and deep CNN network features (VGGNet and Inception-V3 features), and train a multi-class SVM. This model records the highest performance among the previous state-of-the-art methods. However, with two streams of deep feature fusion and relation mapping, our proposed model is able to outperform  by 2.3% in accuracy, 5.1% in precision, 4.5% in recall, 5% in F1-score, 5.1% in MCC and 1.4% in specificity.
In , the authors have tested extracting features from input endoscopic images through different pre-trained networks and classifying them through a multi-class SVM. In Table 1 we show these results for ResNet50 features, MobileNet features and a combined deep feature obtained from multiple pre-trained CNN networks. In our proposed method, we also utilise features from a ResNet50 network, yet instead of naively combining features we utilise the proposed relational network to effectively attend to the feature vectors, and derive salient features for classification.
|Combined feat. ||0.838||-||-||-||-||-|
shows the confusion matrix for the evaluation results on the KVASIR dataset. For clarity we represent the classes as 0- ‘dyed-lifted-polyps’, 1- ‘dyed-resection-margins’, 2- ‘esophagitis’, 3- ‘normal-cecum’, 4- ‘normal-pylorus’, 5- ‘normal-z-line’, 6- ‘polyps’, 7- ‘ulcerative-colitis’. Confusions occur primarily between the normal-z-line and esophagitis classes, and a number of classes are classified correctly for all instances.
To further illustrate the importance of our two-stream architecture and the value of the relational network for combining these feature streams, we visualise (in Figure 3) the activations obtained from the LSTM layer of the proposed model and two ablation models, each with only one input stream. The ablation model in Figure 3 (b) receives the feature stream as the input, while ablation model in Figure 3 (c) receives the feature stream as the input. In the ablation models (b) and (c), as in  the relational network is used to model relationships within a single vector.
The activations are obtained for a randomly selected set of 500 images from the KVASIR test-set, and we use t-SNE  to plot them in two dimensions.
Considering Fig. 3 (a), we observe that samples from a particular class are tightly grouped and clear separation exists between classes. However, in the ablation models (b) and (c) we observe significant overlaps between the embeddings from different classes, indicating that the model is not capable of discriminating between those classes. These visualisations provide further evidence of the importance of utilising multiple input streams, and how they can be effectively fused together with the proposed relational model to learn discriminative features to support the classification task.
To demonstrate the effectiveness of our model on different problem domains, we evaluated our model on the Nerthus dataset . While the task in this dataset, measuring the cleanliness of the bowel based on the BBPS value, is less challenging compared to the abnormality classification task in the KVASIR dataset, the Nerthus dataset provides a different evaluation scenario to investigate the generalisability of the proposed approach. We obtained a 100% accuracy when predicting the BBPS value with our proposed model, while the baseline model of  has only achieved a 95% accuracy. This clearly illustrates the applicability of the proposed architecture for different classification tasks within the domain of automated endoscopy image analysis.
Endoscopy image analysis is a challenging task and automating this process can aid both the patient and the medical practitioner. Our approach is significantly different from the previous approaches that are based on obtaining handcrafted features or extracting pre-trained CNN features and learning a classifier based on these features. Our relational model, with two discriminative feature streams, is able to map dependencies between feature streams to help detect and identify salient features, and outperforms state-of-the-art methods for the KVASIR and Nerthus datasets. Furthermore, as our model learns the image to label mapping automatically, it is applicable for detecting abnormalities in other medical domains apart from the analysis of endoscopy images.
The research presented in this paper was supported by an Australian Research Council (ARC) grant DP170100632.
-  (2019) On evaluating cnn representations for low resource medical image classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1363–1367. Cited by: §1, §2, §3.4, Table 1.
-  (2017) SCL-umd at the medico task-mediaeval 2017: transfer learning based classification of medical images.. In MediaEval, Cited by: §1, §3.4, Table 1.
-  (2016) Theano: a python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 472, pp. 473. Cited by: §3.3.
Automatic hyperparameter optimization for transfer learning on medical image datasets using bayesian optimization. In 2019 13th International Symposium on Medical Information and Communication Technology (ISMICT), pp. 1–6. Cited by: §1.
-  (2015) Keras. Note: https://keras.io Cited by: §3.3.
-  (2017) Two stream lstm: a deep fusion framework for human action recognition. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 177–186. Cited by: §2.
-  (2019) Forecasting future action sequences with neural memory networks. British Machine Vision Conference (BMVC). Cited by: §2.
-  (2019) Predicting the future: a jointly learnt model for action anticipation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5562–5571. Cited by: §2.
-  (2019) Triple anet: adaptive abnormal-aware attention network for wce image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 293–301. Cited by: §1.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2, §3.3.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.2.
-  (2017-07) Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
Kernel generalized-gaussian mixture model for robust abnormality detection. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 21–29. Cited by: §1.
-  (2015) Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision, pp. 1449–1457. Cited by: §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
-  (2017) HKBU at mediaeval 2017 medico: medical multimedia task. Cited by: §1, §3.4, Table 1.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.4.
-  (2017) Ensemble of texture features for finding abnormalities in the gastro-intestinal tract.. In MediaEval, Cited by: §1, §3.4, Table 1.
-  (2017) An inception-like cnn architecture for gi disease and anatomical landmark classification.. In MediaEval, Cited by: §1, §3.4, Table 1.
-  (2017) Nerthus: a bowel preparation quality video dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 170–174. Cited by: §1, §3.1, §3.4.
-  (2017) Kvasir: a multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 164–169. Cited by: §1, §1, §2, §3.1, §3.2, §3.4, Table 1.
-  (2017) Multimedia for medicine: the medico task at mediaeval 2017. Cited by: §3.1, §3.4.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §2.
-  (2017) A simple neural network module for relational reasoning. In Advances in neural information processing systems, pp. 4967–4976. Cited by: §1, §2.2, §3.4.
-  (2019) Retinal abnormalities recognition using regional multitask learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 30–38. Cited by: §1.
-  (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European conference on computer vision (ECCV), pp. 574–589. Cited by: §1.