Unsupervised Shot Boundary Detection for Temporal Segmentation of Long Capsule Endoscopy Videos

10/18/2021 ∙ by Sodiq Adewole, et al. ∙ 0

Physicians use Capsule Endoscopy (CE) as a non-invasive and non-surgical procedure to examine the entire gastrointestinal (GI) tract for diseases and abnormalities. A single CE examination could last between 8 to 11 hours generating up to 80,000 frames which is compiled as a video. Physicians have to review and analyze the entire video to identify abnormalities or diseases before making diagnosis. This review task can be very tedious, time consuming and prone to error. While only as little as a single frame may capture useful content that is relevant to the physicians' final diagnosis, frames covering the small bowel region alone could be as much as 50,000. To minimize physicians' review time and effort, this paper proposes a novel unsupervised and computationally efficient temporal segmentation method to automatically partition long CE videos into a homogeneous and identifiable video segments. However, the search for temporal boundaries in a long video using high dimensional frame-feature matrix is computationally prohibitive and impracticable for real clinical application. Therefore, leveraging both spatial and temporal information in the video, we first extracted high level frame features using a pretrained CNN model and then projected the high-dimensional frame-feature matrix to lower 1-dimensional embedding. Using this 1-dimensional sequence embedding, we applied the Pruned Exact Linear Time (PELT) algorithm to searched for temporal boundaries that indicates the transition points from normal to abnormal frames and vice-versa. We experimented with multiple real patients' CE videos and our model achieved an AUC of 66% on multiple test videos against expert provided labels.



There are no comments yet.


page 1

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With an estimated 70 million Americans affected by different digestive tract diseases each year, physicians use endoscopy as the non-surgical procedure to visualize and examine the stomach, upper small bowel and colon of a person

[19]. Using an endoscope, a flexible tube which carries light by fibreoptic bundles with attached camera, the physician is able to view pictures of the digestive tract on a color TV monitor. Traditionally, three main endoscopy procedures include gastroscopy, small-bowel endoscopy and colonoscopy. During gastroscopy, also known as the upper endoscopy, an endoscope is easily passed through the mouth and throat and into the esophagus, thereby allowing the physician to view the esophagus and stomach [43]. The small bowel endoscopy advances further and allows visibility into the upper part of the small intestine. Colonoscopy involves passing endoscopes into the colon through the rectum to examine the colon. Small bowel endoscopy is especially limited by how far it can advance into the small bowel, thereby limiting the extent of the physicians’ examination. All three traditional methods are also limited due to the invasiveness and discomfort that accompanies them. While there has not been a complete replacement for these traditional procedures, especially when a biopsy (removal of tissue) is necessary, Video Capsule Endoscopy (VCE) has innovatively made the endoscopy procedure a lot less invasive and less uncomfortable.

VCE is currently the standard procedure to examine the entire digestive tract without the invasiveness associated with the traditional gastroscopy, small-bowel endoscopy and colonoscopy procedures. While VCE helps ease diagnosis of many digestive tract diseases, a single capsule endoscopy study can last between 8 - 11 hours generating up to 80,000 images of various sections of the digestive tract. In a typical VCE study, up to 50,000 images are obtained for the small bowel region alone, however, it is possible for pathology of interest to be present in as few as one single frame. Notwithstanding, physicians have to review the entire video in order to identify frames capturing diseases or abnormalities.

Research efforts on automating analysis of VCE videos have been on for more than two decades and many promising methods and techniques have been developed in literature (See section II

). However, many of the proposed techniques focus on identifying specific abnormalities in individual frame independent of other frames in the video. Secondly, Deep Convolutional Neural Network (DCNN) models

[41] are currently state-of-the-art models in medical image analysis [8, 38, 27] and object recognition including various abnormality detection in VCE video frames [15]. However, despite their impressive performance on VCE video data, the variety of possible abnormalities in the gastrointestinal (GI) tract coupled with the wide inter-patient variation as well as the sample inefficiency of DCNN models limits their direct applicability towards developing a fully automated system to review and analyze CE videos.

Secondly, the capsule camera used in CE is propelled down the GI tract through peristaltic movement of the intestinal walls and the output videos have unique properties that tend to degenerate the performance of any generic video analysis technique , leading to high misses in diagnosing diseases. For example, poor illumination, food particles causing occlusion and also unstable peristaltic movement of the GI walls results in frequent camera flip sometimes leading to poor quality video output.

Lastly, many open dataset used in traditional video analysis research have already been manually segmented into short video clips with fixed frame counts of fixed time duration [42, 16]

. Therefore many video analysis techniques, especially deep learning based models

[40, 14], are designed to operate mostly on short video clips. Manually segmenting long video into clips have two (2) main problems: 1) The sequence of frames contained in each video clip cannot be guaranteed to be uncorrelated. Manually segmenting long videos, therefore, will not yield a homogeneous and identifiable segment that can lead to optimal summarization output; 2) When a non-homogeneous video segment is to be summarized, there is a chance of selecting a non-key frame as the representative frame, leading to higher miss-rate in any diagnosis.

Ii Related Work

Ii-a VCE Video Analysis

Analysing CE videos encompasses disease or abnormality detection, quantifying severity of identified diseases, localizing identified abnormalities, and decision making on appropriate intervention by the physician. For more than two decades, researchers have proposed different techniques to automate some of these steps by leveraging both classical image analysis and machine learning techniques

[33] as well as more recent and advanced deep learning based methods [34, 3, 7, 1]. Prior works on VCE fall into three broad categories: 1) detect specific lesion such as bleeding in [37], polyp [29], ulcer [51], and angioectasia [47, 32]

; 2) abnormal or outlier frame detection where frames with abnormalities are consider outliers

[15, 52]; and 3) VCE video summarization where representative frames are selected from the entire video [18, 13, 30, 31, 21, 7] for review by the experts.

Ii-B Video Temporal Segmentation

Temporal segmentation is usually the first step when trying to automate analysis of long videos. The goal is to divide the video stream into a set of meaningful segments or shots

. Each member frame within a segment are correlated and have visual similarity while each segment will exhibit independence characteristic. Vu et al. proposed a coherent three-stage procedure to detect intestinal contractions in

[49]. The authors utilized changes in intestinal edge structure of the intestinal folds for contraction assessment. The output is contraction-based shots. Mackiewicz et al. in [26]

utilized three dimension LBP operator, color histogram, and motion vector to classify every 10th image of the video. The final classification result was assessed using a 4-state hidden Markov model for topographical segmentation. In

[9], two color vectors that were created with hue and saturation components of HSI model were used to represent the entire video. Spectrum analysis was applied to detect sudden changes in the peristalsis pattern. The authors assumed that each organ has a different peristalsis pattern and hence, any change in the pattern may suggest an event in which a gastroenterologist may be interested. Energy and High Frequency Content (HFC) functions are subsequently used to identify such change while two other specialized features aim to enhance the detection of duodenum and cecum. Zhao et al. [52]

proposed a temporal segmentation approach based on adaptive non-parametric key-point detection model using multi-feature extraction and fusion. The aim of their work was not only to detect key abnormal frames using pairwise distance, but also to augment gastroenterologist’s performance by minimizing the miss-rate and thus, improving detection accuracy. None of these prior works considered the computation cost of the temporal segmentation task and given the complexity of CE videos, the time it takes to run a model may render the solution impracticable. The work presented in this paper is motivated by this challenge. In another work aimed at summarizing VCE,

[7] proposed to find transition boundaries in the video using pair-wise similarity between the sequence of frames. A threshold parameter is used to determine the boundaries based on the similarity score between frame pairs. Computing pairwise similarity between video frames can be computationally prohibitive and impracticable in real world clinical setting.

Ii-C Boundary Detection

Detection of boundaries or transition points (TP) on sequence data [39] has been considered in solving many sequence segmentation problems across various applications such as medical condition monitoring [28], climate change detection [35], audio activity segmentation and boundary recognition for silence in speech [11], speaker segmentation, scene change detection, and human activity analysis [10]. Other areas where detection and localization of distributional changes in sequence data arises include online sequential time series analysis [4, 44]. Essentially, Change Point Detection (CPD) involves partitioning a sequence into several homogeneous temporal segments.

Techniques such as probabilistic sequence models including Hidden Markov Models (HMM)

[24] or the discriminative counterpart such as Conditional Random Fields [25] are well validated. These probabilistic models require a good knowledge of the transition structure between the segments and also require careful pre-training to yield a competitive performance. This may not be practicable for online applications where data are acquired online [2, 39]. Parametric approaches model the distribution before and after the change based on maximum likelihood framework [6] while non-parametric methods [12] have been mostly limited to uni-variate data. Kernel-based methods [17]

use maximum kernel fisher discriminant ratio as a measure of homogeneity between segments and can achieve good results for moderately multidimensional data or in specific situations where the data lie in a low-dimensional manifold. The approach involves a regularized kernel-based test statistic to determine if: 1) there is a change point in the data and thereafter, the location/instant of the change point. However, the method lack robustness when moving to larger dimensions. Particularly, kernel-based methods are not robust with respect to the presence of contaminating noise and to the fact that the changes in the detected points may only affect a subset of the components of the high-dimensional data.

Algorithms such as Binary Segmentation (BS) and dynamic programming [45, 36], can identify locations where there are significant changes in the distribution of a sequence of data through recursive search. However, in order to use these techniques, prior knowledge of the number of change point instances in the sequence is required. The algorithms only try to recursively find the location of these points using maximum likelihood estimation. Also, BS search is the most established in literature. The algorithm is an approximate method with an efficient computational cost of , where is the number of data points. Dynamic Programming (DP) search is an exact search method, with a computational cost of , where is the max number of change points and is the number of data points [45]. DP can also be applied using different kernels such as the linear or Gaussian kernels. Window-Based Search is an approximate search method that computes the discrepancy between two adjacent windows that move along with signal . When the two windows are highly dissimilar, a high discrepancy between the two values occurs, which is indicative of a change point. Upon generating a discrepancy curve, the algorithm locates optimal change point indices in the sequence [45]. Pruned Exact Linear Time (PELT) [23] is an unsupervised CPD technique where no prior knowledge of the number of change point is necessary. Rather the model finds the optimal location as well as count of the change points in the series based on a cost function. In temporally segmenting CE videos, no prior knowledge of the number of boundaries is available. Therefore, we considered this technique as most suitable for our task. Other related methods include Segment Neighbourhood (SN) algorithm [5] and Optimal Partitioning (OP) algorithm [22].

The key to analysis of video structured data is leveraging both spatial (images) and temporal information in the data. While analysis of CE videos has been on for more than two (2) decades, little to no attention has been paid to the temporal relationship between the sequence of frames in the video. In this work, we consider both spatial and temporal structure of the video to developed a computationally efficient method to temporally segment our long VCE video with the aim of generating multiple shorter, homogeneous and identifiable video segments that are faster and easier to review and analyse. The output of our model could be applied in other domains and also integrated into long video summarization model.

Ii-D Problem formulation

Let be unlabelled sequence of frames in a sample CE video . Our hypothesis test consist of;

Step 1:

Step 2: Estimate from the sample if is true

Fig. 1: Recursive Search Temporal Shot Boundary in CE Videos

Figure 1 shows illustration of the recursive search for a boundary in the contiguous sequence of frames. Temporal segmentation algorithm:

Data: VCE video with frames ;
Result: short video segments
       for  do
             Extract Features using CNN: Project each feature vector to 1-D embedding
      Concatenate embedding projections ; Compute transition points Get segments for ; {
Algorithm 1 VCE Video Temporal Segmentation algorithm

Iii Methodology

Iii-a Overview of Proposed Method

Algorithm 1 shows the overview of the proposed technique in this work. Detecting temporal boundaries in long videos allows us to automatically segment long CE videos into short, meaningful, homogeneous and identifiable clips. Our work leverages concept from time series change point analysis [23, 6, 17] to detect multiple transition points in a sequence of video frames. CPD methods have been successfully applied on time-series data in one dimension with linear computational time. However, video frame features are usually in higher dimensions, therefore, exponentially increasing the computational cost. In our model, We extracted the frame-features matrix using VGG-19 [41]

network model pretrained on large imageNet data and then fine-tune on our VCE video frames. The choice of our architecture is motivated by

[3]. Due to the significant class imbalance in the data, we over-sampled the minority classes to minimize the bias of the network towards only the normal class. Thereafter, we projected the frame-features into a 1-dimensional manifold space with the sequence for the entire video appearing like a single time series data. Projecting from p-dimensional video features reduces the computational cost of segmenting the video from to . Thereafter, we applied the Prune Exact Linear Time (PELT) algorithm proposed in [23] to detect multiple transition points in the video. Our model does not require any form of annotation from medical expert. To the best of our knowledge, this is the first work to approach VCE video analysis using concept from CPD model to exploit the temporal information in the sequence of frames. We experimented with multiple embedding methods to compare performance in the segmentation task.

Fig. 2: Proposed Temporal Segmentation Pipeline for CE Videos

Iii-B Lower Dimensional Feature Projection

In this section, we describe our approach for embedding the extracted features to a lower 1-dimensional feature. We applied this technique to reduce the computational complexity of finding the temporal boundaries in the video sequence from to to a linear time complexity of . We approached this by projecting the high dimensional frame feature vector to a lower 1-dimensional embedding space. First, we experimented with detecting change boundaries using the high dimensional feature matrix of the video, however, after running for several days on a single video, we recognized the impracticability for real clinical application. Representing abnormalities captured in a VCE image by a single 1-dimensional feature vector is not a trivial task. Therefore, we experimented with other embedding methods to compare performance. Specifically, we experimented with PCA for linear projection and auto-encoder, TSNE, Kernel-PCA with different kernels to account for some non-linearities. We restricted our test to these techniques based on consideration for computational cost and also after experimenting with many manifold learning techniques. We briefly describe each of these embedding techniques below;

Iii-B1 Principal Component Embedding (PCE)

The principal component of a feature matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots [50]

. PCA is a linear dimensionality reduction that is used to decompose a multivariate dataset in a set of successive orthogonal components that captures maximum variance in the data. The input data is centered but not scaled for each feature before applying Singular Value Decomposition (SVD). The computational efficiency and speed of PC method makes it a very popular option in the machine learning research community. See figure

3 for the visualization of a sample video projected on the 1-dimension that explains most variance using 4096-dimensional feature vector extracted from VGG-19.

Iii-B2 Kernel Principal Component Embedding (KPCE)

In order to capture some non-linearities in the embedding, we applied kernel principal component which achieves non-linear dimensionality reduction through the use of kernels. While PCA uses a linear kernel

to construct the eigen-decomposition of the covariance matrix of the data, kernel-PCA uses the kernel trick by mapping the data to a hyperplane with the original linear eigen-decomposition performed in a reproducing kernel hilbert space. We experimented with two (2) different kernels - gaussian and cosine kernels. Figure

2(b) and 2(c) shows the 1-d projection using the two kernels.

(a) PCA - Linear kernel
(b) Kernel PCA - Cosine kernel
(c) Kernel PCA - Gaussian kernel
(d) Autoencoder Representation
(e) TSNE
Fig. 3: 1-D Plot of Sample Video Using Extracted VGG-19 Frame Feature-Matrix

Figures 3 shows the visualization of a sample video after projecting to a 1-dimensional embedding space. The cosine kernel compute the using cosine distance metrics . Two objects that are exactly alike have zero distance. The gaussian kernel is an exponential function of the gamma scaled quadratic distance between any two points . The aim of comparing multiple kernels as shown in figure 3 is to understand the impact on the sensitivity of the change point algorithm to the structure of the video embedding.

Iii-B3 Auto-Encoder

Auto-encoders learns useful representation without any supervision [46]

. The goal is to learn a mapping from high-dimensional observations to a lower-dimensional representation space such that the original observations can be reconstructed (approximately) from the lower-dimensional representation. It is a parametric model that is trained using an encoder-decoder neural network architecture. We applied 2-layer architecture and optimized the parameters by minimizing the mean squared loss between the actual frame features and the reconstruction. We used a learning rate of 0.001. The pretrained autoencoder was subsequently used to encode the extracted features for the test videos to a 1-dimensional sequence. While training, we also over-sampled the minority classes to account for the class imbalance as described in


Iii-B4 T-Stochastic Neighborhood Embedding (TSNE)

TSNE [48]

uses a probabilistic model to minimize the KL-divergence between the high dimensional input Gaussian distributed feature vector and the lower dimensional t-distributed embedding. We applied TSNE to encode the extracted video features to a 1-dimensional manifold. We set the perplexity parameter to 50 which is similar to the number of nearest neighbour that is used in other manifold learning. Though TSNE can be computationally costly, especially with high dimensional input. We mitigate this problem by first applying PCA on the full video frame features to between 50 - 100 dimension before applying TSNE on the PCA output feature matrix. Figure

3 shows the 1-d embedding plot for our test video.

Iii-C Shot Boundary Detection (SBD) in CE Video

In temporally segmenting CE video, we consider temporal boundary as points where there is an occurrence of a pathology between a pair of frames in the sequence. In this paper, we employed the PELT algorithm 1 since it requires no supervision in detecting the transition points in the video. The algorithm is derived from the Optimal Programming algorithm but involves a pruning step within the dynamic program to minimize the computational cost. The pruning reduces the computational cost without affecting the exactness of the resulting segmentation making it an ideal candidate for high dimensional video data. The PELT algorithm is able to detect multiple transition points and generally produces quick and consistent results. It solves the penalized detection problem when the number of transition points in the sequence is unknown. By minimizing the log-likelihood cost function in 1, it estimates both the number of transition points as well as location of the change in a sequence of data. The algorithm has a computational cost of , where is the number of data points. In our case, is the number of frames in the video. The PELT algorithm can solve the change point detection problem using different kernels but the most validated is the Gaussian kernel.

On an ordered sequence of frames features , our SBD model will have transition points with their positions ; where . We specify and and assume transition points are ordered such that . The transition points will split the data into segments with the segment containing

The algorithm begins by first conditioning on the last point of change, it then iteratively relates the optimal value of the cost function to the cost for the optimal partition of the data prior to the last transition point plus the cost for the segment for the last point to the end of the data [23]. We set as the set of possible vectors of transition points for the video. Set . The optimal partition is defined as:


Where is a cost function for the segment; is a regularizer to guard against over fitting which essentially determines how many transition points the algorithm will find. The higher the specified the less the number of detect transition points forcing the algorithm to minimize the False Positives Rate (FPR). It is important to experiment with this hyper-parameter to make sure increasing the penalty is not jeopardising the ability to detect true transition points or true positives (TP).


is chosen as twice the negative log-likelihood as in eq. 2 and the minimum segment length .

Iv Experiments

We conducted experiments using eight (8) VCE videos collected during real clinical examination under the supervision of expert gastroenterology. During review and analysis of CE videos, gastroenterologist are mostly interested in the small bowel region which can only be accessed through VCE and not through any of the other upper or lower endoscopy procedures. Detecting pathological change within the small bowel is a much more difficult problem that detecting transition between regions of the GI tract such as esophagus, stomach and colon. For our experiment, we therefore trim the long video to focus only on the small bowel region. Table I shows the number of frames per video covering only the small bowel region after removing other regions such as the upper esophagus, stomach and the lower colon.

We extracted the videos from the RapidReader software program and pre-processed each video into frames. The eight (8) videos were collected from different patients during a clinical endoscopy procedure using the SB3 Given Imaging PillCam capsules. The capsules were equipped with 576 x 576 pixel camera. For each complete video, the small bowel transit time corresponds to approximately hr [20]. In order to isolate the small bowel region, two endoscopy research scientists annotated each video by identifying the region where the image was captured as well as any disease or abnormality found. After the annotation, the number of frames in the videos is summarized in table I.

We randomly selected 5 videos for pre-training our feature extraction model and also to perform pre-training of the autoencoder. We reserved the remaining three (3) videos for testing the entire system. Using videos from completely different patients during testing helps minimize any bias and ensures our approach will generalize to any new unseen patient video data.

Video ID Training samples Testing samples
Video 1 12,303 -
Video 2 13,177 -
Video 3 8,452 -
Video 4 23,124 -
Video 5 32,181 -
Video 6 - 8,701
Video 7 - 16,909
Video 8 - 10,037
TABLE I: Small Bowel Frame Count for Train and Test videos

Iv-a Implementation

We developed our entire system using the Pytorch framework on NVIDIA GTX2080 machine. We ensured that all our experiment were run on the same configuration for consistency across the compared techniques. Each of the feature extractors were trained for up to 30 epochs using 0.001 as learning rate and Stochastic Gradient Descent optimization. We also trained the autoencoder to embed the frame-features for about 50 epochs. During each of the pre-training, we over-sampled the minority classes based on the inverse of their proportion in the data. This gave a significant boost to the representation capability of the network on the abnormal frames.

Iv-B Evaluation

We evaluated the performance of this method based on the AUC-ROC as shown in table II. At each time step , the model predicts whether is a transition point or not. A transition point is defined when the class of frame at is different from the class of frame at . Using the predicted output, we computed the True Positive and False Positive rates and we applied this in computing the ROC. Each transition point is considered to be a pathological event and so we bench-marked against the ground truth label provided by the medical experts. This is, obviously a very challenging problem as both the change point detection algorithm and the feature-embedding models do not have any information on the statistical property that characterizes any of the pathologies in the video.

V Results and Discussion

Figure 4, shows experimental results of detected boundaries using PCA embedding and the PELT change point algorithm. Each of the alternating pink-colored intervals are sections of some pathological abnormality. There are points where visually we can observe changes but are not pathological events. These points are due to the camera rotation and flips as it is propelled down the GI tract through peristalsis. This means there is a spatial transition in the content captured by the camera but those changes are not pathological changes.

Fig. 4: Detected Boundaries vs Ground Truth using PCA @

Experiments on feature extraction also showed that feature extraction capability of the base CNN model is critical to what the boundary detector is able to identify. The representation capability of the base CNN of the diseased-frames will impact the performance of the boundary-detection algorithm. In addition, different CNN architectures showed varying representation performance when applied on different classes of diseases (or lesions). This means, for example Resnet-152 may better represent diffuse bleeding in a frame than VGG-19. Lesions show significant difference both geometrically and in terms of color, texture as well as the surrounding lighting condition. This indicates that the base CNN capabilities are not universal and some architectures better capture some structure than others.

Table II below shows comparative results using different parametric and non-parametric embedding techniques. Parametric representation frameworks such as auto-encoder are very difficult to train, but are able to capture some non-linearities in the data wherever they successfully train.

Embedding AUC-Score
PCA 0.66
TSNE 0.49
Kernel-PCA (Cosine) 0.50
Kernel-PCA (RBF) 0.50
Auto-Encoder 0.42
TABLE II: AUC Score of Different Embedding @

Table II compares the receiver operating characteristics of different embedding techniques with the TPR and FPR aggregated over the test videos. From the table II, while some embedding performed worse than random encoding, PCA with linear kernel achieved an AUC of 0.66, outperforming other embedding techniques. Clearly, PCA is able to better encode the frame features to capture more abnormal boundaries than other embedding techniques including the autoencoder (0.42) and TSNE (0.49).

While VGG-19 was able to encode the frames and separate diseased frames from the normal frames, the final pool layer of the model has 4096 features. Embedding this to 1-dimensional vector is not a trivial problem due to the complexity of the CE video frames features and the complex geometry of some abnormalities, such as angioectasia, that may be difficult to detect, even by humans.

V-1 Detected Video Boundaries in a Sample Test Video

Figure 5 below show the detected transition points in the sequence of frames.

Fig. 5: Visual Illustration of Detected Video Boundaries

As shown in figure 5, some of the boundaries detected in the sequence of video frames are not necessarily indicative of pathological change event. However, very similar frames are captured in the same temporal boundaries. Clearly, detecting pathological boundaries in VCE videos is not trivial and also a very challenging problem. Therefore, a binary classification model that can encode the abnormalities into binary category may help mitigate this challenge.

Conclusion and Future Works

In this paper, we developed a novel unsupervised technique for temporal segmentation of long capsule endoscopy videos. While our method can be generalized to videos across other domains, we experimented using capsule endoscopy videos collected from patients during real clinical examination. All collected data went through proper IRB approval prior to analysis. After downloading the video from the Rapid reader software, we extracted features from each frame using a pre-trained CNN model. The high-dimensional frame features were projected into lower 1-dimensional representation for the entire video. We applied the pruned exact linear time algorithm to detect transition boundaries in the video using this lower dimensional embedding. Our result showed that the transition detection algorithm is able to better capture pathological event in the sequence of frames when the PCA was used as the embedding mechanism. PCA achieved an AUC-ROC of 66% and outperformed other non-linear embedding techniques. While our method can easily generalize across multiple domains, when applied on CE videos, the proposed technique can facilitate experts’ review of CE videos through significant time and effort saving. For our next step, we will develop a fully integrated long video summarization model requiring little of no expert supervision.


  • [1] S. Adewole, P. Fernandez, J. Jablonski, S. Syed, A. Copland, M. Porter, and D. Brown (2021) Lesion2Vec: deep metric learning for few shot multiple lesions recognition in wireless capsule endoscopy. arXiv preprint arXiv:2101.04240. Cited by: §II-A.
  • [2] S. Adewole, E. Gharavi, B. Shpringer, M. Bolger, V. Sharma, S. M. Yang, and D. E. Brown (2020) Dialogue-based simulation for cultural awareness training. arXiv preprint arXiv:2002.00223. Cited by: §II-C.
  • [3] S. Adewole, M. Yeghyayan, D. Hyatt, L. Ehsan, J. Jablonski, A. Copland, S. Syed, and D. Brown (2020) Deep learning methods for anatomical landmark detection in video capsule endoscopy images. In Proceedings of the Future Technologies Conference, pp. 426–434. Cited by: §II-A, §III-A.
  • [4] S. Aminikhanghahi and D. J. Cook (2017) A survey of methods for time series change point detection. Knowledge and information systems 51 (2), pp. 339–367. Cited by: §II-C.
  • [5] I. E. Auger and C. E. Lawrence (1989) Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology 51 (1), pp. 39–54. Cited by: §II-C.
  • [6] J. Chen and A. K. Gupta (2012) Parametric Statistical Change Point Analysis. Birkhäuser, Basel, Switzerland. External Links: ISBN 978-0-8176-4800-8, Document Cited by: §II-C, §III-A.
  • [7] J. Chen, Y. Zou, and Y. Wang (2016)

    Wireless capsule endoscopy video summarization: a learning approach based on siamese neural network and support vector machine


    2016 23rd International Conference on Pattern Recognition (ICPR)

    pp. 1303–1308. Cited by: §II-A, §II-B.
  • [8] M. Chen, X. Shi, Y. Zhang, D. Wu, and M. Guizani (2017) Deep features learning for medical image analysis with convolutional autoencoder neural network. IEEE Transactions on Big Data. Cited by: §I.
  • [9] Y. Chen, W. Yasen, J. Lee, D. Lee, and Y. Kim (2009) Developing assessment system for wireless capsule endoscopy videos based on event detection. In Medical Imaging 2009: Computer-Aided Diagnosis, Vol. 7260, pp. 72601G. Cited by: §II-B.
  • [10] H. Cho and P. Fryzlewicz (2015) Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pp. 475–507. Cited by: §II-C.
  • [11] M. F. R. Chowdhury, S. Selouani, and D. O’Shaughnessy (2012) Bayesian on-line spectral change point detection: a soft computing approach for on-line asr. International Journal of Speech Technology 15 (1), pp. 5–23. Cited by: §II-C.
  • [12] M. Csörgő and L. Horváth (1988) 20 nonparametric methods for changepoint problems. Handbook of statistics 7, pp. 403–425. Cited by: §II-C.
  • [13] A. Z. Emam, Y. A. Ali, and M. M. B. Ismail (2015) Adaptive features extraction for capsule endoscopy (ce) video summarization. In

    International Conference on Computer Vision and Image Analysis Applications

    pp. 1–5. Cited by: §II-A.
  • [14] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. Cited by: §I.
  • [15] Y. Gao, W. Lu, X. Si, and Y. Lan (2020)

    Deep model-based semi-supervised learning way for outlier detection in wireless capsule endoscopy images

    IEEE Access 8, pp. 81621–81632. Cited by: §I, §II-A.
  • [16] G. Geisler and G. Marchionini (2000) The open video project: research-oriented digital video repository. In Proceedings of the fifth ACM conference on Digital libraries, pp. 258–259. Cited by: §I.
  • [17] Z. Harchaoui, E. Moulines, and F. R. Bach (2009) Kernel change-point analysis. In Advances in neural information processing systems, pp. 609–616. Cited by: §II-C, §III-A.
  • [18] D. K. Iakovidis, S. Tsevas, and A. Polydorou (2010) Reduction of capsule endoscopy reading times by unsupervised image mining. Computerized Medical Imaging and Graphics 34 (6), pp. 471–478. Cited by: §II-A.
  • [19] G. Iddan, G. Meron, A. Glukhovsky, and P. Swain (2000) Wireless capsule endoscopy. Nature 405 (6785), pp. 417–417. Cited by: §I.
  • [20] S. Iobagiu, L. Ciobanu, and O. Pascu (2008) Colon capsule endoscopy: a new method of investigating the large bowel. Journal of Gastrointestinal and Liver Diseases 17 (3), pp. 347–352. Cited by: §IV.
  • [21] M. M. B. Ismail, O. Bchir, and A. Z. Emam (2013)

    Endoscopy video summarization based on unsupervised learning and feature discrimination

    In 2013 Visual Communications and Image Processing (VCIP), pp. 1–6. Cited by: §II-A.
  • [22] B. Jackson, J. D. Scargle, D. Barnes, S. Arabhi, A. Alt, P. Gioumousis, E. Gwin, P. Sangtrakulcharoen, L. Tan, and T. T. Tsai (2005) An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12 (2), pp. 105–108. Cited by: §II-C.
  • [23] R. Killick, P. Fearnhead, and I. A. Eckley (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107 (500), pp. 1590–1598. Cited by: §II-C, §III-A, §III-C.
  • [24] F. D. la Torre Frade, J. Campoy, Z. Ambadar, and J. F. Cohn (2007-10) Temporal segmentation of facial behavior. In Proceedings of (ICCV) International Conference on Computer Vision, Cited by: §II-C.
  • [25] J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001-06) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. External Links: ISBN 978-155860778, Document Cited by: §II-C.
  • [26] M. Mackiewicz, J. Berens, and M. Fisher (2008) Wireless capsule endoscopy color video segmentation. IEEE Transactions on Medical Imaging 27 (12), pp. 1769–1781. Cited by: §II-B.
  • [27] A. Malik, P. Patel, L. Ehsan, S. Guleria, T. Hartka, S. Adewole, and S. Syed (2021)

    Ten simple rules for engaging with artificial intelligence in biomedicine

    Public Library of Science San Francisco, CA USA. Cited by: §I.
  • [28] R. Malladi, G. P. Kalamangalam, and B. Aazhang (2013) Online bayesian change point detection algorithms for segmentation of epileptic activity. In 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1833–1837. Cited by: §II-C.
  • [29] A. V. Mamonov, I. N. Figueiredo, P. N. Figueiredo, and Y. R. Tsai (2014) Automated polyp detection in colon capsule endoscopy. IEEE transactions on medical imaging 33 (7), pp. 1488–1502. Cited by: §II-A.
  • [30] I. Mehmood, M. Sajjad, and S. W. Baik (2014) Video summarization based tele-endoscopy: a service to efficiently manage visual data generated during wireless capsule endoscopy procedure. Journal of medical systems 38 (9), pp. 109. Cited by: §II-A.
  • [31] A. Mohammed, S. Yildirim, M. Pedersen, Ø. Hovde, and F. Cheikh (2017) Sparse coded handcrafted and deep features for colon capsule video summarization. In 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), pp. 728–733. Cited by: §II-A.
  • [32] K. Pogorelov, O. Ostroukhova, M. Jeppsson, H. Espeland, C. Griwodz, T. de Lange, D. Johansen, M. Riegler, and P. Halvorsen (2018) Deep learning and hand-crafted feature based approaches for polyp detection in medical videos. In 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 381–386. Cited by: §II-A.
  • [33] T. Rahim, M. A. Usman, and S. Y. Shin (2020) A survey on contemporary computer-aided tumor, polyp, and ulcer detection methods in wireless capsule endoscopy imaging. Computerized Medical Imaging and Graphics, pp. 101767. Cited by: §II-A.
  • [34] S. Rasoul, S. Adewole, and A. Akakpo (2021) Feature selection using reinforcement learning. arXiv preprint arXiv:2101.09460. Cited by: §II-A.
  • [35] J. Reeves, J. Chen, X. L. Wang, R. Lund, and Q. Q. Lu (2007) A review and comparison of changepoint detection techniques for climate data. Journal of applied meteorology and climatology 46 (6), pp. 900–915. Cited by: §II-C.
  • [36] C. Rohrbeck (2013) Detection of changes in variance using binary segmentation and optimal partitioning. Note: [Online; accessed 17. Jul. 2021] External Links: Link Cited by: §II-C.
  • [37] S. Sainju, F. M. Bui, and K. A. Wahid (2014) Automated bleeding detection in capsule endoscopy videos using statistical features and region growing. Journal of medical systems 38 (4), pp. 25. Cited by: §II-A.
  • [38] R. Sali, S. Adewole, L. Ehsan, L. A. Denson, P. Kelly, B. C. Amadi, L. Holtz, S. A. Ali, S. R. Moore, S. Syed, et al. (2020) Hierarchical deep convolutional neural networks for multi-category diagnosis of gastrointestinal disorders on histopathological images. arXiv preprint arXiv:2005.03868. Cited by: §I.
  • [39] V. Sharma, B. Shpringer, S. M. Yang, M. Bolger, S. Adewole, D. Brown, and E. Gharavi (2019) Data collection methods for building a free response training simulation. In 2019 Systems and Information Engineering Design Symposium (SIEDS), pp. 1–6. Cited by: §II-C, §II-C.
  • [40] Z. Shou, D. Wang, and S. Chang (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1049–1058. Cited by: §I.
  • [41] K. Simonyan and A. Zisserman (2014-09) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv. External Links: 1409.1556, Link Cited by: §I, §III-A.
  • [42] A. F. Smeaton, P. Over, and A. R. Doherty (2010) Video shot boundary detection: seven years of trecvid activity. Computer Vision and Image Understanding 114 (4), pp. 411–418. Cited by: §I.
  • [43] P. Swain (2003) Wireless capsule endoscopy. Gut 52 (suppl 4), pp. iv48–iv50. Cited by: §I.
  • [44] A. Tartakovsky, I. Nikiforov, and M. Basseville (2014) Sequential analysis: hypothesis testing and changepoint detection. CRC Press. Cited by: §II-C.
  • [45] C. Truong, L. Oudre, and N. Vayatis (2018-01) Selective review of offline change point detection methods. arXiv. External Links: 1801.00718, Document Cited by: §II-C.
  • [46] M. Tschannen, O. Bachem, and M. Lucic (2018) Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069. Cited by: §III-B3.
  • [47] A. Tsuboi, S. Oka, K. Aoyama, H. Saito, T. Aoki, A. Yamada, T. Matsuda, M. Fujishiro, S. Ishihara, M. Nakahori, et al. (2020) Artificial intelligence using a convolutional neural network for automatic detection of small-bowel angioectasia in capsule endoscopy images. Digestive Endoscopy 32 (3), pp. 382–390. Cited by: §II-A.
  • [48] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §III-B4.
  • [49] H. Vu, T. Echigo, R. Sagawa, K. Yagi, M. Shiba, K. Higuchi, T. Arakawa, and Y. Yagi (2009) Detection of contractions in adaptive transit time of the small bowel from wireless capsule endoscopy videos. Computers in biology and medicine 39 (1), pp. 16–26. Cited by: §II-B.
  • [50] S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §III-B1.
  • [51] Y. Yuan, J. Wang, B. Li, and M. Q. Meng (2015) Saliency based ulcer detection for wireless capsule endoscopy diagnosis. IEEE transactions on medical imaging 34 (10), pp. 2046–2057. Cited by: §II-A.
  • [52] Q. Zhao and M. Q. Meng (2010) An abnormality based wce video segmentation strategy. In 2010 IEEE International Conference on Automation and Logistics, pp. 565–570. Cited by: §II-A, §II-B.