I Introduction
With an estimated 70 million Americans affected by different digestive tract diseases each year, physicians use endoscopy as the nonsurgical procedure to visualize and examine the stomach, upper small bowel and colon of a person
[19]. Using an endoscope, a flexible tube which carries light by fibreoptic bundles with attached camera, the physician is able to view pictures of the digestive tract on a color TV monitor. Traditionally, three main endoscopy procedures include gastroscopy, smallbowel endoscopy and colonoscopy. During gastroscopy, also known as the upper endoscopy, an endoscope is easily passed through the mouth and throat and into the esophagus, thereby allowing the physician to view the esophagus and stomach [43]. The small bowel endoscopy advances further and allows visibility into the upper part of the small intestine. Colonoscopy involves passing endoscopes into the colon through the rectum to examine the colon. Small bowel endoscopy is especially limited by how far it can advance into the small bowel, thereby limiting the extent of the physicians’ examination. All three traditional methods are also limited due to the invasiveness and discomfort that accompanies them. While there has not been a complete replacement for these traditional procedures, especially when a biopsy (removal of tissue) is necessary, Video Capsule Endoscopy (VCE) has innovatively made the endoscopy procedure a lot less invasive and less uncomfortable.VCE is currently the standard procedure to examine the entire digestive tract without the invasiveness associated with the traditional gastroscopy, smallbowel endoscopy and colonoscopy procedures. While VCE helps ease diagnosis of many digestive tract diseases, a single capsule endoscopy study can last between 8  11 hours generating up to 80,000 images of various sections of the digestive tract. In a typical VCE study, up to 50,000 images are obtained for the small bowel region alone, however, it is possible for pathology of interest to be present in as few as one single frame. Notwithstanding, physicians have to review the entire video in order to identify frames capturing diseases or abnormalities.
Research efforts on automating analysis of VCE videos have been on for more than two decades and many promising methods and techniques have been developed in literature (See section II
). However, many of the proposed techniques focus on identifying specific abnormalities in individual frame independent of other frames in the video. Secondly, Deep Convolutional Neural Network (DCNN) models
[41] are currently stateoftheart models in medical image analysis [8, 38, 27] and object recognition including various abnormality detection in VCE video frames [15]. However, despite their impressive performance on VCE video data, the variety of possible abnormalities in the gastrointestinal (GI) tract coupled with the wide interpatient variation as well as the sample inefficiency of DCNN models limits their direct applicability towards developing a fully automated system to review and analyze CE videos.Secondly, the capsule camera used in CE is propelled down the GI tract through peristaltic movement of the intestinal walls and the output videos have unique properties that tend to degenerate the performance of any generic video analysis technique , leading to high misses in diagnosing diseases. For example, poor illumination, food particles causing occlusion and also unstable peristaltic movement of the GI walls results in frequent camera flip sometimes leading to poor quality video output.
Lastly, many open dataset used in traditional video analysis research have already been manually segmented into short video clips with fixed frame counts of fixed time duration [42, 16]
. Therefore many video analysis techniques, especially deep learning based models
[40, 14], are designed to operate mostly on short video clips. Manually segmenting long video into clips have two (2) main problems: 1) The sequence of frames contained in each video clip cannot be guaranteed to be uncorrelated. Manually segmenting long videos, therefore, will not yield a homogeneous and identifiable segment that can lead to optimal summarization output; 2) When a nonhomogeneous video segment is to be summarized, there is a chance of selecting a nonkey frame as the representative frame, leading to higher missrate in any diagnosis.Ii Related Work
Iia VCE Video Analysis
Analysing CE videos encompasses disease or abnormality detection, quantifying severity of identified diseases, localizing identified abnormalities, and decision making on appropriate intervention by the physician. For more than two decades, researchers have proposed different techniques to automate some of these steps by leveraging both classical image analysis and machine learning techniques
[33] as well as more recent and advanced deep learning based methods [34, 3, 7, 1]. Prior works on VCE fall into three broad categories: 1) detect specific lesion such as bleeding in [37], polyp [29], ulcer [51], and angioectasia [47, 32]; 2) abnormal or outlier frame detection where frames with abnormalities are consider outliers
[15, 52]; and 3) VCE video summarization where representative frames are selected from the entire video [18, 13, 30, 31, 21, 7] for review by the experts.IiB Video Temporal Segmentation
Temporal segmentation is usually the first step when trying to automate analysis of long videos. The goal is to divide the video stream into a set of meaningful segments or shots
. Each member frame within a segment are correlated and have visual similarity while each segment will exhibit independence characteristic. Vu et al. proposed a coherent threestage procedure to detect intestinal contractions in
[49]. The authors utilized changes in intestinal edge structure of the intestinal folds for contraction assessment. The output is contractionbased shots. Mackiewicz et al. in [26]utilized three dimension LBP operator, color histogram, and motion vector to classify every 10th image of the video. The final classification result was assessed using a 4state hidden Markov model for topographical segmentation. In
[9], two color vectors that were created with hue and saturation components of HSI model were used to represent the entire video. Spectrum analysis was applied to detect sudden changes in the peristalsis pattern. The authors assumed that each organ has a different peristalsis pattern and hence, any change in the pattern may suggest an event in which a gastroenterologist may be interested. Energy and High Frequency Content (HFC) functions are subsequently used to identify such change while two other specialized features aim to enhance the detection of duodenum and cecum. Zhao et al. [52]proposed a temporal segmentation approach based on adaptive nonparametric keypoint detection model using multifeature extraction and fusion. The aim of their work was not only to detect key abnormal frames using pairwise distance, but also to augment gastroenterologist’s performance by minimizing the missrate and thus, improving detection accuracy. None of these prior works considered the computation cost of the temporal segmentation task and given the complexity of CE videos, the time it takes to run a model may render the solution impracticable. The work presented in this paper is motivated by this challenge. In another work aimed at summarizing VCE,
[7] proposed to find transition boundaries in the video using pairwise similarity between the sequence of frames. A threshold parameter is used to determine the boundaries based on the similarity score between frame pairs. Computing pairwise similarity between video frames can be computationally prohibitive and impracticable in real world clinical setting.IiC Boundary Detection
Detection of boundaries or transition points (TP) on sequence data [39] has been considered in solving many sequence segmentation problems across various applications such as medical condition monitoring [28], climate change detection [35], audio activity segmentation and boundary recognition for silence in speech [11], speaker segmentation, scene change detection, and human activity analysis [10]. Other areas where detection and localization of distributional changes in sequence data arises include online sequential time series analysis [4, 44]. Essentially, Change Point Detection (CPD) involves partitioning a sequence into several homogeneous temporal segments.
Techniques such as probabilistic sequence models including Hidden Markov Models (HMM)
[24] or the discriminative counterpart such as Conditional Random Fields [25] are well validated. These probabilistic models require a good knowledge of the transition structure between the segments and also require careful pretraining to yield a competitive performance. This may not be practicable for online applications where data are acquired online [2, 39]. Parametric approaches model the distribution before and after the change based on maximum likelihood framework [6] while nonparametric methods [12] have been mostly limited to univariate data. Kernelbased methods [17]use maximum kernel fisher discriminant ratio as a measure of homogeneity between segments and can achieve good results for moderately multidimensional data or in specific situations where the data lie in a lowdimensional manifold. The approach involves a regularized kernelbased test statistic to determine if: 1) there is a change point in the data and thereafter, the location/instant of the change point. However, the method lack robustness when moving to larger dimensions. Particularly, kernelbased methods are not robust with respect to the presence of contaminating noise and to the fact that the changes in the detected points may only affect a subset of the components of the highdimensional data.
Algorithms such as Binary Segmentation (BS) and dynamic programming [45, 36], can identify locations where there are significant changes in the distribution of a sequence of data through recursive search. However, in order to use these techniques, prior knowledge of the number of change point instances in the sequence is required. The algorithms only try to recursively find the location of these points using maximum likelihood estimation. Also, BS search is the most established in literature. The algorithm is an approximate method with an efficient computational cost of , where is the number of data points. Dynamic Programming (DP) search is an exact search method, with a computational cost of , where is the max number of change points and is the number of data points [45]. DP can also be applied using different kernels such as the linear or Gaussian kernels. WindowBased Search is an approximate search method that computes the discrepancy between two adjacent windows that move along with signal . When the two windows are highly dissimilar, a high discrepancy between the two values occurs, which is indicative of a change point. Upon generating a discrepancy curve, the algorithm locates optimal change point indices in the sequence [45]. Pruned Exact Linear Time (PELT) [23] is an unsupervised CPD technique where no prior knowledge of the number of change point is necessary. Rather the model finds the optimal location as well as count of the change points in the series based on a cost function. In temporally segmenting CE videos, no prior knowledge of the number of boundaries is available. Therefore, we considered this technique as most suitable for our task. Other related methods include Segment Neighbourhood (SN) algorithm [5] and Optimal Partitioning (OP) algorithm [22].
The key to analysis of video structured data is leveraging both spatial (images) and temporal information in the data. While analysis of CE videos has been on for more than two (2) decades, little to no attention has been paid to the temporal relationship between the sequence of frames in the video. In this work, we consider both spatial and temporal structure of the video to developed a computationally efficient method to temporally segment our long VCE video with the aim of generating multiple shorter, homogeneous and identifiable video segments that are faster and easier to review and analyse. The output of our model could be applied in other domains and also integrated into long video summarization model.
IiD Problem formulation
Let be unlabelled sequence of frames in a sample CE video . Our hypothesis test consist of;
Step 1:
Step 2: Estimate from the sample if is true
Figure 1 shows illustration of the recursive search for a boundary in the contiguous sequence of frames. Temporal segmentation algorithm:
Iii Methodology
Iiia Overview of Proposed Method
Algorithm 1 shows the overview of the proposed technique in this work. Detecting temporal boundaries in long videos allows us to automatically segment long CE videos into short, meaningful, homogeneous and identifiable clips. Our work leverages concept from time series change point analysis [23, 6, 17] to detect multiple transition points in a sequence of video frames. CPD methods have been successfully applied on timeseries data in one dimension with linear computational time. However, video frame features are usually in higher dimensions, therefore, exponentially increasing the computational cost. In our model, We extracted the framefeatures matrix using VGG19 [41]
network model pretrained on large imageNet data and then finetune on our VCE video frames. The choice of our architecture is motivated by
[3]. Due to the significant class imbalance in the data, we oversampled the minority classes to minimize the bias of the network towards only the normal class. Thereafter, we projected the framefeatures into a 1dimensional manifold space with the sequence for the entire video appearing like a single time series data. Projecting from pdimensional video features reduces the computational cost of segmenting the video from to . Thereafter, we applied the Prune Exact Linear Time (PELT) algorithm proposed in [23] to detect multiple transition points in the video. Our model does not require any form of annotation from medical expert. To the best of our knowledge, this is the first work to approach VCE video analysis using concept from CPD model to exploit the temporal information in the sequence of frames. We experimented with multiple embedding methods to compare performance in the segmentation task.IiiB Lower Dimensional Feature Projection
In this section, we describe our approach for embedding the extracted features to a lower 1dimensional feature. We applied this technique to reduce the computational complexity of finding the temporal boundaries in the video sequence from to to a linear time complexity of . We approached this by projecting the high dimensional frame feature vector to a lower 1dimensional embedding space. First, we experimented with detecting change boundaries using the high dimensional feature matrix of the video, however, after running for several days on a single video, we recognized the impracticability for real clinical application. Representing abnormalities captured in a VCE image by a single 1dimensional feature vector is not a trivial task. Therefore, we experimented with other embedding methods to compare performance. Specifically, we experimented with PCA for linear projection and autoencoder, TSNE, KernelPCA with different kernels to account for some nonlinearities. We restricted our test to these techniques based on consideration for computational cost and also after experimenting with many manifold learning techniques. We briefly describe each of these embedding techniques below;
IiiB1 Principal Component Embedding (PCE)
The principal component of a feature matrix extracts the dominant patterns in the matrix in terms of a complementary set of score and loading plots [50]
. PCA is a linear dimensionality reduction that is used to decompose a multivariate dataset in a set of successive orthogonal components that captures maximum variance in the data. The input data is centered but not scaled for each feature before applying Singular Value Decomposition (SVD). The computational efficiency and speed of PC method makes it a very popular option in the machine learning research community. See figure
3 for the visualization of a sample video projected on the 1dimension that explains most variance using 4096dimensional feature vector extracted from VGG19.IiiB2 Kernel Principal Component Embedding (KPCE)
In order to capture some nonlinearities in the embedding, we applied kernel principal component which achieves nonlinear dimensionality reduction through the use of kernels. While PCA uses a linear kernel
to construct the eigendecomposition of the covariance matrix of the data, kernelPCA uses the kernel trick by mapping the data to a hyperplane with the original linear eigendecomposition performed in a reproducing kernel hilbert space. We experimented with two (2) different kernels  gaussian and cosine kernels. Figure
2(b) and 2(c) shows the 1d projection using the two kernels.Figures 3 shows the visualization of a sample video after projecting to a 1dimensional embedding space. The cosine kernel compute the using cosine distance metrics . Two objects that are exactly alike have zero distance. The gaussian kernel is an exponential function of the gamma scaled quadratic distance between any two points . The aim of comparing multiple kernels as shown in figure 3 is to understand the impact on the sensitivity of the change point algorithm to the structure of the video embedding.
IiiB3 AutoEncoder
Autoencoders learns useful representation without any supervision [46]
. The goal is to learn a mapping from highdimensional observations to a lowerdimensional representation space such that the original observations can be reconstructed (approximately) from the lowerdimensional representation. It is a parametric model that is trained using an encoderdecoder neural network architecture. We applied 2layer architecture and optimized the parameters by minimizing the mean squared loss between the actual frame features and the reconstruction. We used a learning rate of 0.001. The pretrained autoencoder was subsequently used to encode the extracted features for the test videos to a 1dimensional sequence. While training, we also oversampled the minority classes to account for the class imbalance as described in
IIIA.IiiB4 TStochastic Neighborhood Embedding (TSNE)
TSNE [48]
uses a probabilistic model to minimize the KLdivergence between the high dimensional input Gaussian distributed feature vector and the lower dimensional tdistributed embedding. We applied TSNE to encode the extracted video features to a 1dimensional manifold. We set the perplexity parameter to 50 which is similar to the number of nearest neighbour that is used in other manifold learning. Though TSNE can be computationally costly, especially with high dimensional input. We mitigate this problem by first applying PCA on the full video frame features to between 50  100 dimension before applying TSNE on the PCA output feature matrix. Figure
3 shows the 1d embedding plot for our test video.IiiC Shot Boundary Detection (SBD) in CE Video
In temporally segmenting CE video, we consider temporal boundary as points where there is an occurrence of a pathology between a pair of frames in the sequence. In this paper, we employed the PELT algorithm 1 since it requires no supervision in detecting the transition points in the video. The algorithm is derived from the Optimal Programming algorithm but involves a pruning step within the dynamic program to minimize the computational cost. The pruning reduces the computational cost without affecting the exactness of the resulting segmentation making it an ideal candidate for high dimensional video data. The PELT algorithm is able to detect multiple transition points and generally produces quick and consistent results. It solves the penalized detection problem when the number of transition points in the sequence is unknown. By minimizing the loglikelihood cost function in 1, it estimates both the number of transition points as well as location of the change in a sequence of data. The algorithm has a computational cost of , where is the number of data points. In our case, is the number of frames in the video. The PELT algorithm can solve the change point detection problem using different kernels but the most validated is the Gaussian kernel.
On an ordered sequence of frames features , our SBD model will have transition points with their positions ; where . We specify and and assume transition points are ordered such that . The transition points will split the data into segments with the segment containing
The algorithm begins by first conditioning on the last point of change, it then iteratively relates the optimal value of the cost function to the cost for the optimal partition of the data prior to the last transition point plus the cost for the segment for the last point to the end of the data [23]. We set as the set of possible vectors of transition points for the video. Set . The optimal partition is defined as:
(1)  
Where is a cost function for the segment; is a regularizer to guard against over fitting which essentially determines how many transition points the algorithm will find. The higher the specified the less the number of detect transition points forcing the algorithm to minimize the False Positives Rate (FPR). It is important to experiment with this hyperparameter to make sure increasing the penalty is not jeopardising the ability to detect true transition points or true positives (TP).
(2)  
is chosen as twice the negative loglikelihood as in eq. 2 and the minimum segment length .
Iv Experiments
We conducted experiments using eight (8) VCE videos collected during real clinical examination under the supervision of expert gastroenterology. During review and analysis of CE videos, gastroenterologist are mostly interested in the small bowel region which can only be accessed through VCE and not through any of the other upper or lower endoscopy procedures. Detecting pathological change within the small bowel is a much more difficult problem that detecting transition between regions of the GI tract such as esophagus, stomach and colon. For our experiment, we therefore trim the long video to focus only on the small bowel region. Table I shows the number of frames per video covering only the small bowel region after removing other regions such as the upper esophagus, stomach and the lower colon.
We extracted the videos from the RapidReader software program and preprocessed each video into frames. The eight (8) videos were collected from different patients during a clinical endoscopy procedure using the SB3 Given Imaging PillCam capsules. The capsules were equipped with 576 x 576 pixel camera. For each complete video, the small bowel transit time corresponds to approximately hr [20]. In order to isolate the small bowel region, two endoscopy research scientists annotated each video by identifying the region where the image was captured as well as any disease or abnormality found. After the annotation, the number of frames in the videos is summarized in table I.
We randomly selected 5 videos for pretraining our feature extraction model and also to perform pretraining of the autoencoder. We reserved the remaining three (3) videos for testing the entire system. Using videos from completely different patients during testing helps minimize any bias and ensures our approach will generalize to any new unseen patient video data.
Video ID  Training samples  Testing samples 

Video 1  12,303   
Video 2  13,177   
Video 3  8,452   
Video 4  23,124   
Video 5  32,181   
Video 6    8,701 
Video 7    16,909 
Video 8    10,037 
Iva Implementation
We developed our entire system using the Pytorch framework on NVIDIA GTX2080 machine. We ensured that all our experiment were run on the same configuration for consistency across the compared techniques. Each of the feature extractors were trained for up to 30 epochs using 0.001 as learning rate and Stochastic Gradient Descent optimization. We also trained the autoencoder to embed the framefeatures for about 50 epochs. During each of the pretraining, we oversampled the minority classes based on the inverse of their proportion in the data. This gave a significant boost to the representation capability of the network on the abnormal frames.
IvB Evaluation
We evaluated the performance of this method based on the AUCROC as shown in table II. At each time step , the model predicts whether is a transition point or not. A transition point is defined when the class of frame at is different from the class of frame at . Using the predicted output, we computed the True Positive and False Positive rates and we applied this in computing the ROC. Each transition point is considered to be a pathological event and so we benchmarked against the ground truth label provided by the medical experts. This is, obviously a very challenging problem as both the change point detection algorithm and the featureembedding models do not have any information on the statistical property that characterizes any of the pathologies in the video.
V Results and Discussion
Figure 4, shows experimental results of detected boundaries using PCA embedding and the PELT change point algorithm. Each of the alternating pinkcolored intervals are sections of some pathological abnormality. There are points where visually we can observe changes but are not pathological events. These points are due to the camera rotation and flips as it is propelled down the GI tract through peristalsis. This means there is a spatial transition in the content captured by the camera but those changes are not pathological changes.
Experiments on feature extraction also showed that feature extraction capability of the base CNN model is critical to what the boundary detector is able to identify. The representation capability of the base CNN of the diseasedframes will impact the performance of the boundarydetection algorithm. In addition, different CNN architectures showed varying representation performance when applied on different classes of diseases (or lesions). This means, for example Resnet152 may better represent diffuse bleeding in a frame than VGG19. Lesions show significant difference both geometrically and in terms of color, texture as well as the surrounding lighting condition. This indicates that the base CNN capabilities are not universal and some architectures better capture some structure than others.
Table II below shows comparative results using different parametric and nonparametric embedding techniques. Parametric representation frameworks such as autoencoder are very difficult to train, but are able to capture some nonlinearities in the data wherever they successfully train.
Embedding  AUCScore 

PCA  0.66 
TSNE  0.49 
KernelPCA (Cosine)  0.50 
KernelPCA (RBF)  0.50 
AutoEncoder  0.42 
Table II compares the receiver operating characteristics of different embedding techniques with the TPR and FPR aggregated over the test videos. From the table II, while some embedding performed worse than random encoding, PCA with linear kernel achieved an AUC of 0.66, outperforming other embedding techniques. Clearly, PCA is able to better encode the frame features to capture more abnormal boundaries than other embedding techniques including the autoencoder (0.42) and TSNE (0.49).
While VGG19 was able to encode the frames and separate diseased frames from the normal frames, the final pool layer of the model has 4096 features. Embedding this to 1dimensional vector is not a trivial problem due to the complexity of the CE video frames features and the complex geometry of some abnormalities, such as angioectasia, that may be difficult to detect, even by humans.
V1 Detected Video Boundaries in a Sample Test Video
Figure 5 below show the detected transition points in the sequence of frames.

As shown in figure 5, some of the boundaries detected in the sequence of video frames are not necessarily indicative of pathological change event. However, very similar frames are captured in the same temporal boundaries. Clearly, detecting pathological boundaries in VCE videos is not trivial and also a very challenging problem. Therefore, a binary classification model that can encode the abnormalities into binary category may help mitigate this challenge.
Conclusion and Future Works
In this paper, we developed a novel unsupervised technique for temporal segmentation of long capsule endoscopy videos. While our method can be generalized to videos across other domains, we experimented using capsule endoscopy videos collected from patients during real clinical examination. All collected data went through proper IRB approval prior to analysis. After downloading the video from the Rapid reader software, we extracted features from each frame using a pretrained CNN model. The highdimensional frame features were projected into lower 1dimensional representation for the entire video. We applied the pruned exact linear time algorithm to detect transition boundaries in the video using this lower dimensional embedding. Our result showed that the transition detection algorithm is able to better capture pathological event in the sequence of frames when the PCA was used as the embedding mechanism. PCA achieved an AUCROC of 66% and outperformed other nonlinear embedding techniques. While our method can easily generalize across multiple domains, when applied on CE videos, the proposed technique can facilitate experts’ review of CE videos through significant time and effort saving. For our next step, we will develop a fully integrated long video summarization model requiring little of no expert supervision.
References
 [1] (2021) Lesion2Vec: deep metric learning for few shot multiple lesions recognition in wireless capsule endoscopy. arXiv preprint arXiv:2101.04240. Cited by: §IIA.
 [2] (2020) Dialoguebased simulation for cultural awareness training. arXiv preprint arXiv:2002.00223. Cited by: §IIC.
 [3] (2020) Deep learning methods for anatomical landmark detection in video capsule endoscopy images. In Proceedings of the Future Technologies Conference, pp. 426–434. Cited by: §IIA, §IIIA.
 [4] (2017) A survey of methods for time series change point detection. Knowledge and information systems 51 (2), pp. 339–367. Cited by: §IIC.
 [5] (1989) Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology 51 (1), pp. 39–54. Cited by: §IIC.
 [6] (2012) Parametric Statistical Change Point Analysis. Birkhäuser, Basel, Switzerland. External Links: ISBN 9780817648008, Document Cited by: §IIC, §IIIA.

[7]
(2016)
Wireless capsule endoscopy video summarization: a learning approach based on siamese neural network and support vector machine
. In2016 23rd International Conference on Pattern Recognition (ICPR)
, pp. 1303–1308. Cited by: §IIA, §IIB.  [8] (2017) Deep features learning for medical image analysis with convolutional autoencoder neural network. IEEE Transactions on Big Data. Cited by: §I.
 [9] (2009) Developing assessment system for wireless capsule endoscopy videos based on event detection. In Medical Imaging 2009: ComputerAided Diagnosis, Vol. 7260, pp. 72601G. Cited by: §IIB.
 [10] (2015) Multiplechangepoint detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pp. 475–507. Cited by: §IIC.
 [11] (2012) Bayesian online spectral change point detection: a soft computing approach for online asr. International Journal of Speech Technology 15 (1), pp. 5–23. Cited by: §IIC.
 [12] (1988) 20 nonparametric methods for changepoint problems. Handbook of statistics 7, pp. 403–425. Cited by: §IIC.

[13]
(2015)
Adaptive features extraction for capsule endoscopy (ce) video summarization.
In
International Conference on Computer Vision and Image Analysis Applications
, pp. 1–5. Cited by: §IIA.  [14] (2017) Video captioning with attentionbased lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. Cited by: §I.

[15]
(2020)
Deep modelbased semisupervised learning way for outlier detection in wireless capsule endoscopy images
. IEEE Access 8, pp. 81621–81632. Cited by: §I, §IIA.  [16] (2000) The open video project: researchoriented digital video repository. In Proceedings of the fifth ACM conference on Digital libraries, pp. 258–259. Cited by: §I.
 [17] (2009) Kernel changepoint analysis. In Advances in neural information processing systems, pp. 609–616. Cited by: §IIC, §IIIA.
 [18] (2010) Reduction of capsule endoscopy reading times by unsupervised image mining. Computerized Medical Imaging and Graphics 34 (6), pp. 471–478. Cited by: §IIA.
 [19] (2000) Wireless capsule endoscopy. Nature 405 (6785), pp. 417–417. Cited by: §I.
 [20] (2008) Colon capsule endoscopy: a new method of investigating the large bowel. Journal of Gastrointestinal and Liver Diseases 17 (3), pp. 347–352. Cited by: §IV.

[21]
(2013)
Endoscopy video summarization based on unsupervised learning and feature discrimination
. In 2013 Visual Communications and Image Processing (VCIP), pp. 1–6. Cited by: §IIA.  [22] (2005) An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters 12 (2), pp. 105–108. Cited by: §IIC.
 [23] (2012) Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association 107 (500), pp. 1590–1598. Cited by: §IIC, §IIIA, §IIIC.
 [24] (200710) Temporal segmentation of facial behavior. In Proceedings of (ICCV) International Conference on Computer Vision, Cited by: §IIC.
 [25] (200106) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. External Links: ISBN 978155860778, Document Cited by: §IIC.
 [26] (2008) Wireless capsule endoscopy color video segmentation. IEEE Transactions on Medical Imaging 27 (12), pp. 1769–1781. Cited by: §IIB.

[27]
(2021)
Ten simple rules for engaging with artificial intelligence in biomedicine
. Public Library of Science San Francisco, CA USA. Cited by: §I.  [28] (2013) Online bayesian change point detection algorithms for segmentation of epileptic activity. In 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1833–1837. Cited by: §IIC.
 [29] (2014) Automated polyp detection in colon capsule endoscopy. IEEE transactions on medical imaging 33 (7), pp. 1488–1502. Cited by: §IIA.
 [30] (2014) Video summarization based teleendoscopy: a service to efficiently manage visual data generated during wireless capsule endoscopy procedure. Journal of medical systems 38 (9), pp. 109. Cited by: §IIA.
 [31] (2017) Sparse coded handcrafted and deep features for colon capsule video summarization. In 2017 IEEE 30th International Symposium on ComputerBased Medical Systems (CBMS), pp. 728–733. Cited by: §IIA.
 [32] (2018) Deep learning and handcrafted feature based approaches for polyp detection in medical videos. In 2018 IEEE 31st International Symposium on ComputerBased Medical Systems (CBMS), pp. 381–386. Cited by: §IIA.
 [33] (2020) A survey on contemporary computeraided tumor, polyp, and ulcer detection methods in wireless capsule endoscopy imaging. Computerized Medical Imaging and Graphics, pp. 101767. Cited by: §IIA.
 [34] (2021) Feature selection using reinforcement learning. arXiv preprint arXiv:2101.09460. Cited by: §IIA.
 [35] (2007) A review and comparison of changepoint detection techniques for climate data. Journal of applied meteorology and climatology 46 (6), pp. 900–915. Cited by: §IIC.
 [36] (2013) Detection of changes in variance using binary segmentation and optimal partitioning. Note: [Online; accessed 17. Jul. 2021] External Links: Link Cited by: §IIC.
 [37] (2014) Automated bleeding detection in capsule endoscopy videos using statistical features and region growing. Journal of medical systems 38 (4), pp. 25. Cited by: §IIA.
 [38] (2020) Hierarchical deep convolutional neural networks for multicategory diagnosis of gastrointestinal disorders on histopathological images. arXiv preprint arXiv:2005.03868. Cited by: §I.
 [39] (2019) Data collection methods for building a free response training simulation. In 2019 Systems and Information Engineering Design Symposium (SIEDS), pp. 1–6. Cited by: §IIC, §IIC.
 [40] (2016) Temporal action localization in untrimmed videos via multistage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1049–1058. Cited by: §I.
 [41] (201409) Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv. External Links: 1409.1556, Link Cited by: §I, §IIIA.
 [42] (2010) Video shot boundary detection: seven years of trecvid activity. Computer Vision and Image Understanding 114 (4), pp. 411–418. Cited by: §I.
 [43] (2003) Wireless capsule endoscopy. Gut 52 (suppl 4), pp. iv48–iv50. Cited by: §I.
 [44] (2014) Sequential analysis: hypothesis testing and changepoint detection. CRC Press. Cited by: §IIC.
 [45] (201801) Selective review of offline change point detection methods. arXiv. External Links: 1801.00718, Document Cited by: §IIC.
 [46] (2018) Recent advances in autoencoderbased representation learning. arXiv preprint arXiv:1812.05069. Cited by: §IIIB3.
 [47] (2020) Artificial intelligence using a convolutional neural network for automatic detection of smallbowel angioectasia in capsule endoscopy images. Digestive Endoscopy 32 (3), pp. 382–390. Cited by: §IIA.
 [48] (2008) Visualizing data using tsne.. Journal of machine learning research 9 (11). Cited by: §IIIB4.
 [49] (2009) Detection of contractions in adaptive transit time of the small bowel from wireless capsule endoscopy videos. Computers in biology and medicine 39 (1), pp. 16–26. Cited by: §IIB.
 [50] (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (13), pp. 37–52. Cited by: §IIIB1.
 [51] (2015) Saliency based ulcer detection for wireless capsule endoscopy diagnosis. IEEE transactions on medical imaging 34 (10), pp. 2046–2057. Cited by: §IIA.
 [52] (2010) An abnormality based wce video segmentation strategy. In 2010 IEEE International Conference on Automation and Logistics, pp. 565–570. Cited by: §IIA, §IIB.
Comments
There are no comments yet.