Deep Learning and Conditional Random Fields-based Depth Estimation and Topographical Reconstruction from Conventional Endoscopy

by   Faisal, et al.

Colorectal cancer is the fourth leading cause of cancer deaths worldwide and the second leading cause in the United States. The risk of colorectal cancer can be mitigated by the identification and removal of premalignant lesions through optical colonoscopy. Unfortunately, conventional colonoscopy misses more than 20 contrast of lesion topography. Imaging tissue topography during a colonoscopy is difficult because of the size constraints of the endoscope and the deforming mucosa. Most existing methods make geometric assumptions or incorporate a priori information, which limits accuracy and sensitivity. In this paper, we present a method that avoids these restrictions, using a joint deep convolutional neural network-conditional random field (CNN-CRF) framework. Estimated depth is used to reconstruct the topography of the surface of the colon from a single image. We train the unary and pairwise potential functions of a CRF in a CNN on synthetic data, generated by developing an endoscope camera model and rendering over 100,000 images of an anatomically-realistic colon. We validate our approach with real endoscopy images from a porcine colon, transferred to a synthetic-like domain, with ground truth from registered computed tomography measurements. The CNN-CRF approach estimates depths with a relative error of 0.152 for synthetic endoscopy images and 0.242 for real endoscopy images. We show that the estimated depth maps can be used for reconstructing the topography of the mucosa from conventional colonoscopy images. This approach can easily be integrated into existing endoscopy systems and provides a foundation for improving computer-aided detection algorithms for detection, segmentation and classification of lesions.



There are no comments yet.


page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9


Unified Depth Prediction and Intrinsic Image Decomposition from a Single Image via Joint Convolutional Neural Fields

We present a method for jointly predicting a depth map and intrinsic ima...

Deep Convolutional Neural Fields for Depth Estimation from a Single Image

We consider the problem of depth estimation from a single monocular imag...

Discovery Radiomics for Pathologically-Proven Computed Tomography Lung Cancer Prediction

Lung cancer is the leading cause for cancer related deaths. As such, the...

Delving Deep into Liver Focal Lesion Detection: A Preliminary Study

Hepatocellular carcinoma (HCC) is the second most frequent cause of mali...

Size-to-depth: A New Perspective for Single Image Depth Estimation

In this paper we consider the problem of single monocular image depth es...

Colorectal Polyp Detection in Real-world Scenario: Design and Experiment Study

Colorectal polyps are abnormal tissues growing on the intima of the colo...

Cancer Metastasis Detection With Neural Conditional Random Field

Breast cancer diagnosis often requires accurate detection of metastasis ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

COLORECTAL cancer (CRC) is the third most commonly diagnosed cancer in the United States [1]. Colonoscopy screening can significantly reduce colorectal cancer mortality by detecting and removing premalignant lesions. However, this approach has well-known limitations [2, 3] and recent studies have suggested that gastroenterologists can easily miss more than 20% of clinically relevant polyps [4, 5, 6, 7]. Approximately 60% of colorectal cancer cases detected after optical colonoscopy are associated with missed lesions [8, 9]. In addition to the well-characterized problem of missed polyps, non-polypoid lesions are even more difficult to screen for and are increasingly recognized as harboring significant malignant potential [10].

One of the most effective ways to increase lesion detection rates is by using chromoendoscopy. Chromoendoscopy increases both polypoid and non-polypoid lesion contrast by iteratively spraying and rinsing a topical dye through the colon, effectively encoding surface topography as color contrast [11]. However, chromoendoscopy is not used in routine screening because it doubles procedure time and requires specialized training.

Computational measurement of colon topography has the potential to improve lesion detection rates while meeting practical colonoscopy workflow and clinical constraints. Surface features could be used to amplify lesion contrast, assist in geometric lesion classification [12], augment conventional images [13]

, or improve computer-aided lesion detection algorithms. Current state-of-the-art lesion detection and classification methods rely on color and texture of the lesions for feature extraction

[14, 15]. However, surface topography of the colon could prove vital for automatic lesion detection and classification. Lastly, 3D mapping of the colon surface may be useful for advanced colonoscopy quality metrics, such as fractional coverage of the examination [16].

I-a Related Work

Depth Estimation from Endoscopy Images

Despite the recent advances in computer vision (CV) and image processing, colonoscopy remains a particularly challenging environment for depth estimation and 3D reconstruction. Colonoscopes have a monocular camera with close light sources, a wide field-of-view, and both the endoscope and the colon are in frequent motion. The unpredictable movement, limited working area, small endoscope size, non-uniform colon texture patterns and deformable nature of the colon render conventional CV techniques like shape from stereo (SfSt)

[17] and shape from texture (SfT) [18, 19] inadequate for robust depth estimation. More advanced approaches attempt to reconstruct colon surfaces with restrictive assumptions [20], and there is currently no model-based approach that robustly and accurately estimates depth from a colonoscopy video. Photometric stereo endoscopy captures some 3D structure of the mucosa [21, 11, 22] but is inherently qualitative due to the unknown working distances from each object point to the endoscope.

Learning for Depth Estimation

Learning-based methods have been used for monocular depth prediction for conventional computer vision applications [23, 24, 25, 26, 27, 28], particularly for autonomous navigation [29, 30]. A dictionary learning-based approach has recently been employed for depth estimation for colonoscopy images in [31]. However, the virtual colonoscopy data used for training does not simulate the optical properties of an actual endoscope and there was no validation of the technique presented on real endoscopy images. Promising preliminary results have also been shown for 3D monocular reconstruction for functional endoscopic sinus surgery [32], but the network used was shallow with a single layer and eight nodes trained from only 36 images. A small number of images were used because it is difficult to get ground truth training data. Moreover, the texture in the data is patient specific and cannot be used to estimate depth from other patients, requiring a new training set to be acquired at the beginning of each new procedure. Recent work on monocular 3D reconstruction for assisted navigation in bronchoscopy uses deep learning for monocular depth estimation [33]. However, they validate their method only on phantom data and require training from patient specific CT data every time depth estimates are required from a new patient.

The concept of combining CNNs with a graphical model for structured learning problems has demonstrated considerable promising results [26, 34, 35]. Pixel-wise labeling, that assigns a continuous or discrete label to each pixel of the image, has generally been tackled with feature engineering. Recently CNNs have been extensively used with great success for a variety of CV tasks [35, 36]. However, such approaches lack spatial consistency such as smooth transitions. Spatial consistency has traditionally been captured by probabilistic graphical models such as CRFs [37], and can augment CNNs to improve depth estimation accuracy [26, 24].

I-B Contributions and Significance

Traditional approaches of learning-based depth estimation are trained on data that include patient-specific texture, color, and shape, making them difficult to generalize without acquiring a large amount of ground truth data. Low-level texture details are patient-specific and not diagnostic, such as vascular patterns. High-level texture, on the other hand, contains clinically-relevant features that can be generalized across patients. Although details from texture and color are important, these approaches fail to exploit what may be the strongest cue of depth in the small working distances encountered in endoscopy, the inverse square fall-off in light intensity with propagation distance. Most CT colonoscopy (CTC) or virtual colonoscopy software packages such as Slicer 3D [38, 39] and Viatronix [40] do not utilize an accurate model for the optical properties of an endoscope.

Depth values for a specific object in view of the endoscope are inherently continuous, thus depth estimation from monocular endoscope images can easily be formulated as a Conditional Random Fields (CRF) learning problem. In this work we develop and train a CNN-CRF network on a large dataset of realistic endoscopy data with ground truth depth. Our specific contributions can be summarized as follows:

  1. We developed an accurate optical model of an endoscope that includes the inverse-square law of intensity fall-off, to generate synthetic and virtual images of the colon with ground truth depth.

  2. We use this data to train a network that consists of the unary and pairwise parts of a joint CNN-CRF framework.

  3. We validate our results on test data from the digital synthetic colon, a silicone colon phantom, and endoscopy images collected from a porcine colon registered with CT to give accurate ground truth depth.

To the best of our knowledge, this is the first deep learning network trained on synthetically-generated endoscopy images. In practice, large datasets of labeled or annotated medical images are not generally available due to privacy issues, scarcity of experts available for annotation, underrepresentation of rare conditions which leads to highly correlated features of the normal condition and non standardized datasets. This problem has recently been tackled with transfer learning, which shows promising results on conventional computer vision networks fine tuned for medical images

[41, 42, 43, 44]. However, transfer learning can lead to artifacts specifically for regression problems [41]. We show that the significant performance benefits of training with large datasets can be realized by utilizing synthetically-generated medical data with an accurate forward model for the imaging system and an anatomically-realistic model of the organ. We further show that this network accurately estimates depth in real endoscopy images after transfer to a synthetic-like domain.


Fig. 1: Training frames and ground truth depths from a texture free synthetic colon. This model was used to generate a large dataset of rendered images with low spatial frequency topography and noise-free ground-truth depth.


Fig. 2: Digital data collection using a virtual endoscope moving through a synthetic colon. The endoscope was randomly translated and rotated to generate a variety of data with ground truth depth for training.

Ii Methods

Ii-a Generating Ground Truth Training and Test Data

A large dataset of endoscopy images with corresponding ground truth depth maps is required for training a CNN to estimate depth from a monocular scene. This data is challenging to generate because depth sensors are impractical to couple to a small endoscope and must receive regulatory approval to be used in humans. Moreover, the high level texture of the colon is patient-specific and cannot be used to efficiently learn depth. To circumvent these obstacles, we generate several datasets of images from synthetic, phantom, porcine, and human models, which have, increasing realism, decreasing quality of ground truth depth, and decreasing dataset sizes.

Synthetic Colon - Virtual Endoscopy Data.

To train our network, we generated over 100,000 texture free endoscopy images, each with an associated ground truth depth from a digital synthetic colon. This synthetic colon phantom was generated using Blender [45] and the data were recorded using a virtual endoscope with parameters selected to mimic the range found in common colonoscopes. The virtual colon had anatomically realistic diameter, bending angles and polyps [46] (Fig. 1). The rendered images have a resolution of and a varying viewing angle between . A Mitchell-Netravalie filter [47, 48] was used to prevent aliasing. Two virtual light sources were placed on either side of the camera on the virtual endoscope and each was configured to provide inverse square fall-off of illumination intensity. The depth of the scene was recorded by calculating the distance from the camera to each point on the synthetic colon being imaged. Fig. 1 shows the synthetic colon and representative endoscopy data with aligned depth maps generated using this procedure. We varied the position of the virtual endoscope to generate a diverse set of endoscopy data and in order to accurately model the effects of endoscope motion and light illuminating similar surfaces from different angles (Fig 2). The virtual endoscope was randomly translated along the horizontal axis within the bounds of the synthetic colon and randomly rotated between to generate a diverse set of data. This dataset was used for pre-training the CNN-CRF network.


Fig. 3: Training frames and ground truth depths from reconstructed CT of a silicone colon phantom model. The 3D model was imaged using our virtual endoscope to generate training data with realistic high-spatial frequency topography.
Dataset Virtual / Real Endoscopy Low Spatial Freq. Detail High Spatial Freq. Detail Ground Truth Depth Available Training Testing Dataset Size
Synthetic Colon
Virtual Yes No Yes
Yes 100,000
Phantom Colon
Virtual Yes Yes Yes
Yes 100,000
Porcine Colon
Real Yes Yes Yes
Yes 1460
Human Colonoscopy
Real Yes Yes No
Qualitative N/A
Human CTC
Virtual Yes No Yes
Not Used
Not Used N/A
TABLE I: A Comparison of different endoscopy datasets.

Colon Phantom CT - Virtual Endoscopy Data.

Although the synthetic colon data may be sufficient for learning the inverse square law, our synthetic colon model lacks high spatial frequency detail. To incorporate sensitivity to these features, we also generated training data from virtual endoscopy images from a CT dataset of a silicone colon phantom (The Chamberlain Group Colonoscopy Trainer #2003). The CT reconstruction was performed using filtered back-projection and was filtered using a Ram-Lak filter with linear interpolation in Slicer 3D

[38]. The data was then imaged using the virtual endoscope described previously. These endoscopy images with higher frequency details help the network learn the properties of inverse intensity fall-off on ridges and polyps in the colon. This reconstructed CT data was filtered to reduce the effects of fine texture in the scene which may be specific to the cadaver the phantom was molded from. Over 100,000 images were collected from this setup. Fig. 3 shows a portion of the reconstructed colon phantom, rendered virtual endoscopy images, and corresponding ground truth depth. This dataset was used to fine-tune our CNN-CRF depth estimation network.

Porcine Colon CT - Real Endoscopy Data.


Fig. 4: Monocular endoscopy depth prediction framework. The input image is transformed to its synthetic-like representation and fed into a shared unary and pairwise network trained on synthetic endoscopy data. The unary part regress the depth and the pairwise part is responsible for smoothing based on neighboring superpixels. The predicted depth can be used to reconstruct a topographical map of the surface of the mucosa.

To validate the accuracy of our trained model on real tissue, we dissected a porcine colon and fixed it to a half-pipe scaffold with a diameter of and a degree bend to simulate the anatomy of the human transverse and descending colon (Fig. 7). Metallic pins were used as fiducial markers for localization and size estimation. The tissue was then imaged using a benchtop cone beam CT scanner with 720 projections at half-degree increments by rotating the scaffold on a stepper motor stage. The CT projections were reconstructed using filtered back-projection and a Ram-Lak filter [49]. The resulting 3D model was imaged using the virtual endoscope in Blender. The scaffold was then covered with black foil and was imaged with an optical endoscope (Misumi MO-V5006L) with a wide angle lens with (Misumi L23010IR-M5.5-53). The CT and optical endoscopy results were registered to get ground truth depth maps.

Optimization-based multimodal registration was performed on the virtual endoscopy image collected from a CT reconstructed porcine colon and an optical endoscopy image of the same view. A one-plus-one evolutionary optimizer [50, 51] was used to optimize a Mattes mutual information metric for similarity. The growth factor was set to 1.52, initial radius was set to 0.30, the radius adjustment was set to 0.014 and the maximum number of random spatial samples used to compute the metric was set to 500. The optimization was run for 250 iterations at 3 pyramid levels.

Human Endoscopy Data. Human endoscopy data available from the NIH and other datasets available from the MICCAI endoscopy challenge [52] were used to qualitatively evaluate the performance of our network. Since real endoscopy data can have specular reflections we used graph-based in-painting [53, 54] to partially remove specular reflections.

Table I compares various datasets used in this study and CTC or virtual colonoscopy data. CTC has been used in other depth estimation studies such as [31], however poor cleaning is a major limiting factor [55]. Moreover, recent studies on nonpolypoid lesion detection using CTC restrict to polyps larger than mm because of the limited practical resolution of in-vivo CTC and the requirement of post-processing to remove artifacts from incomplete preparation [56]. CTC also has a relatively high miss rate for non-polypoid lesions [57]. For these reasons CTC data is not used for training.

Ii-B Deep Learning with Conditional Random Fields

Inspired by success of using similar models for analyzing conventional images in [58, 59, 60, 26], we implemented an algorithm to estimate depth using continuous CRF and CNN. A continuous CRF is able to exploit the continuous depth values within specific regions of an endoscopic scene. Moreover, unlike several previous methods, this method does not require assumptions since the log-likelihood optimization problem can be directly solved because the partition function can be analytically calculated. Fig. 4 shows a top level flow diagram of the setup. The following sections describe the unary and pairwise parts and the training in detail. Then we discuss how the network trained on virtual endoscopy data is adapted to real data.

Preliminaries. Let be an image acquired from an endoscopic camera which has been divided into super pixels and

be the depth vector corresponding each super-pixel. Based on conventional graphical models, the depth of an image can be predicted by solving the following maximum a postereori (MAP) problem:


As with general CRFs the conditional probability distribution of the raw data can be defined as:



Fig. 5: The overall architecture for depth estimation from monocular endoscopy images. The input image is fed into a fully convolution network (

) which produces convolution maps. The maps were then related back to super-pixels in a pooling layer which gives feature vectors for each super-pixel. This is followed by 3 fully connected layers which gives the unary part. For the pairwise part, similarities based on neighboring super-pixels were considered and fed into a fully connected layer. The outputs of both the unary and pairwise parts were fed into a CRF loss layer which minimizes the negative log likelihood of the probability density function.

The energy function, can be defined in terms of the unary potentials and pairwise potentials over nodes and edges of ,


where, the unary part, , regresses the depth from each superpixel and the pairwise part, , enforces smoothness between similar neighboring superpixels. and are the two learning parameters associated with the unary and pairwise terms respectively.

Unary Potential. The unary part is designed to regresses superpixel-wise depth for an input endoscopy image. Similar to [26] the unary potential can be defined as follows,


where is the regressed depth of superpixel and represents CNN parameters. The architecture of the training network is described systematically in Fig. 5. The CNN used for the unary part makes use of recent developments in fully convolutional networks (FCNs). Unlike standard CNNs which are composed of convolution followed by fully connected layers and produce non-spatial outputs, FCNs can take images of any size and produce spatial convolutional maps. FCNs have have been extensively used for complicated problems specifically for semantic segmentation [25, 61, 62, 63]. We initialize the first five layers from Alex-Net [64, 65]. Two additional channel convolutional layers with a filter size of are added to the network (as shown in , Fig. 5). is capable of taking an input image of any size and giving channel convolutional maps. A typical problem with all fully convolutional architectures is that the feature maps produced can be significantly smaller than the actual size of the images. We mitigate this problem using a convolution map up-sampling step. For ease of implementation we use nearest neighbor up-sampling. Moreover, we incorporate a super-pixel pooling layer similar to [66, 26] to acquire super-pixel features from convolutional maps.

Pairwise Potential. The objective of the pairwise potential is to smooth the depth regressed from the unary part based on the neighboring super-pixels. The pairwise potential function is based on standard CRF vertex and edge feature functions studied extensively in [58] and other works. Let be the network parameters and be the similarity matrix where represents a similarity metric between ithe and superpixel. We hypothesize that intensity is a valuable cue for depth estimation. With this in mind, we use intensity difference and the greyscale histogram as pairwise similarities expressed in the general form. The pairwise potential can then be defined as,


Learning and . The overall energy function defined in Eq. 3 can now be populated with unary and pairwise terms and can be written as,


For simplicity and explicit vector calculations the term

can be defined as the affinity matrix, and

as a diagonal matrix. defines the graph Laplacian for further simplicity we notate where is a identity matrix. Using these notations Eq. 6 can be simplified as,


Assigning , the probability density function in Eq. 2 can now be simplified to the following form,



Fig. 6: Input frames and predicted depths for synthetic colon data and colon phantom CT data imaged using our virtual endoscope.

Given we can now calculate the negative log-likelihood which simplifies to,


The negative log-likelihood of the training data is minimized during the training process and the optimization problem can be represented as the following objective function,


Where represents the maximum number of images in the training set. In order to prevent over-fitting two regularization terms have been added with each learning parameter. and represent the regularization or weight decay parameters. regularization penalizes heavily weighted vectors and promotes weight diffusion by encouraging the network to utilize all its inputs.

Optimization Solution. The optimization problem is solved by standard stochastic gradient decent-based back-propagation. For the unary part the the partial derivatives of are calculated with respect to . In Eq. 9 only the terms with represent a term with so all other terms are excluded as a result of the partial derivative,


For the pairwise part the partial derivatives are calculated with respect to


Depth Estimation. To estimate the depth of a new endoscopic image, the MAP problem in Eq. 1 must be solved. Here we show that a closed form solution of the problem exists based on the definitions presented above.


To solve the above maximization problem, the partial derivative of the maximization term has to be calculated with respect to . Thus, all terms without can be ignored, simplifying the problem to,


This clearly shows that the problem has a close form solution and can be solved.


Fig. 7: Process of data generation from a porcine colon using reconstructed CT and optical endoscopy images. A porcine colon was dissected and placed on a scaffold. The scaffold was imaged on a cone-beam bench-top CT scanner and the 720 projections obtained were reconstructed using filtered back-projection. The CT reconstructed model was imaged with a virtual endoscope. The porcine colon was also imaged using an optical endoscope. The optical and virtual endoscopy views were registered and ground true depth for each optical endoscopy view was obtained.


Fig. 8: Views from an optical and CT-virtual endoscopy on a porcine colon were registered using a one-plus-one evolutionary optimizer to generate ground truth depth for optical endoscopy images. The depth from CT-virtual endoscopy was calculated by measuring the distance from the virtual endoscope to each point appearing on the image. The depth from optical endoscopy was estimated using our network and was filtered like the reconstructed CT density for a fair comparison.

Ii-C Adversarial Training for Domain Adaptation

Since our network was trained on synthetic data, where low-level patient specific texture details are absent, we include a domain adaption step to test the network on real images. For new input images that contain this texture, we developed a network that transforms them to a synthetic like representation. This bridges the gap between real and synthetic domains. We use adversarial training between a discriminator network and a transformer network. This setup is based on recent advances in generative adversarial networks and adversarial training

[67, 68]

. The transformer network takes batches of synthetic images for unsupervised training and learns to remove patient-specific texture from the input images. The discriminator, which is embedded in the transformer’s loss function, classifies the output as real or synthetic. Once the training reaches Nash equilibrium, the transformer is able to fool the discriminator every time and can perfectly transform a real image to its synthetic counterpart. To prevent the synthetic-like representation of a real image from deviating significantly from the original image we use a self-regularization term to preserve patient independent features. If

and represent the learning parameters for the transformer and discriminator respectively, represents the trained discriminator and represent the output of the transformer then the overall transformer loss term can be defined as,


The second term defines the self-regularization between the real endoscopy image and its synthetic-like counterpart. The terms represent the feature transforms, for the sake of simplicity this was choosen to be a per-pixel loss. More details about this domain adaptation step and its implementation are beyond the scope of this paper and can be found here [69].

Iii Experiments

Iii-a Experimental Setup

We implemented the training networks using VLFeat Mat-ConvNet [48] using MATLAB 2017a and CUDA 8.0. The training data was prepared by oversegmenting each virtual endoscopy image into superpixels using SLIC [70] and corresponding ground truth depth was assigned to each superpixel. The data was randomized to prevent the network from learning too many similar features quickly. The network was pre-trained on synthetic colon data and fine tuned on colon phantom data. of the data was used for training and for validation and for testing. Training was done using K80 GPUs on the Maryland Advanced Computing Cluster (MARCC). Momentum was set at 0.9 as suggested in [26] and both weight decay parameters

were set to 0.0007. The learning rate was initialized at 0.00001 and decreased by 20% every 20 epoches. These parameters were tuned to achieve best results. A total of 300 epochs were run and the epochs with least

error were selected to avoid the selection of an over-fitted model.

Iii-B Quantitative Evaluation Metrics

We evaluated the three datasets mentioned in section II based on metrics used by other monocular depth estimation work [23, 28, 26, 24], mostly for conventional vision. These metrics are:

  1. Relative Error:

  2. Root Mean Square Error:

  3. Average Error:


Fig. 9: Qualitative results showing predicted depth and topographical reconstructions for monocular endoscopy images.

[b] Method rel rms Unary (FCN Only) 0.211 0.094 0.847 Unary (Smooth) 0.196 0.083 0.781 CNN-CRF 0.152 0.061 0.612

TABLE II: Performance Evaluation for Synthetic Blender Generated Endoscopy Data

[b] Method rel rms Unary (FCN Only) 0.227 0.097 0.907 Unary (Smooth) 0.216 0.094 0.884 CNN-CRF 0.183 0.080 0.753

TABLE III: Performance Evaluation for Colon Phantom CT Rendered Virtual Endoscopy Data

[b] Method rel rms Unary (FCN Only) 0.293 0.136 1.216 Unary (Smooth) 0.279 0.122 1.043 CNN-CRF 0.242 0.098 0.973

TABLE IV: Performance Evaluation on Real Endoscopic Data Collected on a Porcine Colon Registered with CT

Iii-C Comparative Analysis

It is not possible to compare our results directly with existing endoscopy depth estimation work because of the diversity of datasets and evaluation methods used. However, we do make a comparison with an FCN regression model that does not employ CRFs. Simple FCNs have recently been used for a variety of CV tasks including work for endoscopy [33]. This comparison allows us to judge the benefit of using a graphical model. The CRF loss layer in the network is replaced with least squares regression. However, we do not claim this to be a direct comparison with [33] or other works because the data used is drastically different and their setup does not use of super-pixels.


Fig. 10: Comparison of our proposed depth estimation and reconstruction method with a reconstruction from [20]. Input image and 3D reconstruction was taken from [20].

Iii-D Results with Synthetic Colon and Phantom Virtual Endoscopy Data

The trained network was tested on images from the synthetic colon and silicone colon phantom that were not used for training (Fig. 6). Using 10,000 randomly-selected test images we observe that the accuracy of the network improves by every metric with the CNN-CRF method for images which are very similar to the training data (Table II, III).

Iii-E Results with Porcine Colon Real Endoscopy Data

As mentioned earlier optical and virtual endoscopy views from a CT reconstruction were registered to get ground truth depth maps for optical endoscopy images (Fig. 7). For a fair comparison, the depth map from endoscopy was filtered through the same pipeline of filters used for the CT reconstruction. Only registered regions of the two depth maps were compared. Representative images from this process are shown in Fig. 8 and a the algorithm performance on porcine colon images is summarized in Table IV.

Iii-F Qualitative Results with Real Human Endoscopy Data

We tested the trained network on real data from colonoscopy images available from the NIH and the MICCAI endoscopy challenge databases [52, 71, 72]. The results in Fig. 9, show that the network can regress coarse depth maps that match intuitive cues from the image. These depth maps were then used to reconstruct the topography of the surface of the colon. Fig. 10 compares our method with 3D monocular reconstruction from the tubular assumption-based approach presented by Hong et al. [20].

Iv Conclusions

This paper presents a novel architecture for monocular endoscopy depth estimation and topographical reconstruction that uses the advantages of a joint CNN and CRF-based framework. Unlike previous, approaches this method does not require geometric assumptions. The network was trained using 200,000 images from synthetically-generated data and CT-reconstructions imaged using a virtual endoscope. To the best of our knowledge, this is the first work which trains a network from a large set of synthetically generated and rendered medical images. This is a particularly relevant approach to 3D endoscopy applications because, despite the clinical need, there are no practical alternatives to acquiring large datasets of real endoscopy images with corresponding ground truth. We validate our network on real colon tissue and endoscopy by generating a test dataset using a porcine colon and mounting it on a scaffold followed by CT and registered optical endoscopy. Our work adds to the active area of 3D endoscopy research and has the potential to improve CAD algorithms for detection, segmentation and classifications of lesions.

The limitations of the current method include artifacts due to specular reflections, cases where inverse of intensity might not be the major cue and instances where the pairwise similarities can give rise to artifacts. Moreover, there were several sources of testing errors beyond the inherent accuracy of the network. The reconstruction, refinement, and filtering of the raw CT data all contribute to inaccuracies in the depth map used for ground truth. The CT data also includes streaking artifacts due to non-uniform x-ray absorption, specifically around the metallic pin fiducial markers. More error sources include inconsistency of the stepper motor which rotated the scaffold and errors related to registering the virtual endoscopy CT view with the optical endoscopy image.

Our future work will focus on generalizing the concept of synthetic data generation for medical images and utilizing depth estimation as an additional cue for other endoscopy applications. The proposed network can also be used as an initialization for future deep learning-oriented endoscopy applications.

V Acknowledgments

The authors thank Dr. J. Webster Stayman and Mr. Steven Tilley II (Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD) for collecting cone-beam CT data on the porcine colon, and Dr. Jeffrey Siewerdsen for his insightful feedback on the manuscript. The authors also thank the staff at Maryland Advanced Computing Cluster (MARCC) for their efficient technical support and training.


  • [1] R. L. Siegel, K. D. Miller, S. A. Fedewa, D. J. Ahnen, R. G. Meester, A. Barzi, and A. Jemal, “Colorectal cancer statistics, 2017,” CA: a cancer journal for clinicians, vol. 67, no. 3, pp. 177–193, 2017.
  • [2] D. F. Ransohoff, “How much does colonoscopy reduce colon cancer mortality?” Annals of internal medicine, vol. 150, no. 1, pp. 50–52, 2009.
  • [3] N. N. Baxter, M. A. Goldwasser, L. F. Paszat, R. Saskin, D. R. Urbach, and L. Rabeneck, “Association of colonoscopy and death from colorectal cancer,” Annals of internal medicine, vol. 150, no. 1, pp. 1–8, 2009.
  • [4] A. Leufkens, M. van Oijen, F. Vleggaar, and P. Siersema, “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, vol. 44, no. 05, pp. 470–475, 2012.
  • [5] A. Pabby, R. E. Schoen, J. L. Weissfeld, R. Burt, J. W. Kikendall, P. Lance, M. Shike, E. Lanza, and A. Schatzkin, “Analysis of colorectal cancer occurrence during surveillance colonoscopy in the dietary polyp prevention trial,” Gastrointestinal endoscopy, vol. 61, no. 3, pp. 385–391, 2005.
  • [6] J. C. Van Rijn, J. B. Reitsma, J. Stoker, P. M. Bossuyt, S. J. Van Deventer, and E. Dekker, “Polyp miss rate determined by tandem colonoscopy: a systematic review,” The American journal of gastroenterology, vol. 101, no. 2, p. 343, 2006.
  • [7] D. H. Kim, P. J. Pickhardt, A. J. Taylor, W. K. Leung, T. C. Winter, J. L. Hinshaw, D. V. Gopal, M. Reichelderfer, R. H. Hsu, and P. R. Pfau, “Ct colonography versus colonoscopy for the detection of advanced neoplasia,” New England journal of medicine, vol. 357, no. 14, pp. 1403–1412, 2007.
  • [8] C. M. le Clercq, M. W. Bouwens, E. J. Rondagh, C. M. Bakker, E. T. Keulen, R. J. de Ridder, B. Winkens, A. A. Masclee, and S. Sanduleanu, “Postcolonoscopy colorectal cancers are preventable: a population-based study,” Gut, pp. gutjnl–2013, 2013.
  • [9] D. Heresbach, T. Barrioz, M. Lapalus, D. Coumaros, P. Bauret, P. Potier, D. Sautereau, C. Boustière, J. Grimaud, C. Barthélémy et al., “Miss rate for colorectal neoplastic polyps: a prospective multicenter study of back-to-back video colonoscopies,” Endoscopy, vol. 40, no. 04, pp. 284–290, 2008.
  • [10] R. M. Soetikno, T. Kaltenbach, R. V. Rouse, W. Park, A. Maheshwari, T. Sato, S. Matsui, and S. Friedland, “Prevalence of nonpolypoid (flat and depressed) colorectal neoplasms in asymptomatic and symptomatic adults,” Jama, vol. 299, no. 9, pp. 1027–1035, 2008.
  • [11] N. J. Durr, G. González, and V. Parot, “3d imaging techniques for improved colonoscopy,” 2014.
  • [12] A. Axon, M. Diebold, M. Fujino, R. Fujita, R. Genta, J. Gonvers, M. Guelrud, H. Inoue, M. Jung, H. Kashida et al., “Update on the paris classification of superficial neoplastic lesions in the digestive tract,” Endoscopy, vol. 37, no. 6, pp. 570–578, 2005.
  • [13] G. González, V. Parot, W. Lo, B. J. Vakoc, and N. J. Durr, “Feature space optimization for virtual chromoendoscopy augmented by topography,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2014, pp. 642–649.
  • [14] D. K. Iakovidis, D. E. Maroulis, S. A. Karkanis, and A. Brokos, “A comparative study of texture features for the discrimination of gastric polyps in endoscopic video,” in Computer-Based Medical Systems, 2005. Proceedings. 18th IEEE Symposium on.   IEEE, 2005, pp. 575–580.
  • [15] L. A. Alexandre, N. Nobre, and J. Casteleiro, “Color and position versus texture features for endoscopic polyp detection,” in BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on, vol. 2.   IEEE, 2008, pp. 38–42.
  • [16] P. C. De Groen, “Advanced systems to assess colonoscopy,” Gastrointestinal endoscopy clinics of North America, vol. 20, no. 4, pp. 699–716, 2010.
  • [17] L. Cohen, L. Vinet, P. Sander, and A. Gagalowicz, “Hierarchical region based stereo matching.”   IEEE Comput. Soc. Press, 1989, pp. 416–421.
  • [18] J. Aloimonos, “Shape from texture,” Biological cybernetics, vol. 58, no. 5, pp. 345–360, 1988.
  • [19] A. Lobay and D. A. Forsyth, “Shape from Texture without Boundaries,” International Journal of Computer Vision, vol. 67, no. 1, pp. 71–91, Apr. 2006.
  • [20] D. Hong, W. Tavanapong, J. Wong, J. Oh, and P. C. De Groen, “3d Reconstruction of virtual colon structures from colonoscopy images,” Computerized Medical Imaging and Graphics, vol. 38, no. 1, pp. 22–33, 2014.
  • [21] N. J. Durr, G. González, D. Lim, G. Traverso, N. S. Nishioka, B. J. Vakoc, and V. Parot, “System for clinical photometric stereo endoscopy,” in SPIE BiOS.   International Society for Optics and Photonics, 2014, pp. 89 351F–89 351F.
  • [22] V. Parot, D. Lim, G. González, G. Traverso, N. S. Nishioka, B. J. Vakoc, and N. J. Durr, “Photometric stereo endoscopy,” Journal of biomedical optics, vol. 18, no. 7, pp. 076 017–076 017, 2013.
  • [23] R. Ranftl, V. Vineet, Q. Chen, and V. Koltun, “Dense monocular depth estimation in complex dynamic scenes,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 4058–4066.
  • [24]

    B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1119–1127.
  • [25] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
  • [26] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024–2039, 2016.
  • [27] A. Saxena, S. H. Chung, and A. Y. Ng, “3-d depth reconstruction from a single still image,” International journal of computer vision, vol. 76, no. 1, pp. 53–69, 2008.
  • [28] S. Ashutosh and A. Y. Ng, “Learning depth from single monocular images,” in Advances in neural information processing systems, 2006, pp. 1161–1168.
  • [29]

    J. Michels, A. Saxena, and A. Y. Ng, “High speed obstacle avoidance using monocular vision and reinforcement learning,” in

    Proceedings of the 22nd international conference on Machine learning

    .   ACM, 2005, pp. 593–600.
  • [30] E. Royer, M. Lhuillier, M. Dhome, and J.-M. Lavest, “Monocular vision for mobile robot localization and autonomous navigation,” International Journal of Computer Vision, vol. 74, no. 3, pp. 237–260, 2007.
  • [31] S. Nadeem and A. Kaufman, “Computer-aided detection of polyps in optical colonoscopy images,” G. D. Tourassi and S. G. Armato, Eds., Mar. 2016, p. 978525.
  • [32] A. Reiter, S. Léonard, A. Sinha, M. Ishii, R. H. Taylor, and G. D. Hager, “Endoscopic-ct: learning-based photometric reconstruction for endoscopic sinus surgery.” in Medical Imaging: Image Processing, 2016, p. 978418.
  • [33] M. Visentini-Scarzanella, T. Sugiura, T. Kaneko, and S. Koto, “Deep monocular 3d reconstruction for assisted navigation in bronchoscopy,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–11, 2017.
  • [34] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
  • [35] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
  • [36] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
  • [37] J. Lafferty, A. McCallum, F. Pereira, and others, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the eighteenth international conference on machine learning, ICML, vol. 1, 2001, pp. 282–289.
  • [38] A. Fedorov, R. Beichel, J. Kalpathy-Cramer, J. Finet, J.-C. Fillion-Robin, S. Pujol, C. Bauer, D. Jennings, F. Fennessy, M. Sonka et al., “3d slicer as an image computing platform for the quantitative imaging network,” Magnetic resonance imaging, vol. 30, no. 9, pp. 1323–1341, 2012.
  • [39] D. Nain, S. Haker, R. Kikinis, and W. E. L. Grimson, “An interactive virtual endoscopy tool.”   Georgia Institute of Technology, 2001.
  • [40] P. J. Pickhardt, J. R. Choi, I. Hwang, J. A. Butler, M. L. Puckett, H. A. Hildebrandt, R. K. Wong, P. A. Nugent, P. A. Mysliwiec, and W. R. Schindler, “Computed tomographic virtual colonoscopy to screen for colorectal neoplasia in asymptomatic adults,” New England Journal of Medicine, vol. 349, no. 23, pp. 2191–2200, 2003.
  • [41] H. Ravishankar, P. Sudhakar, R. Venkataramani, S. Thiruvenkadam, P. Annangi, N. Babu, and V. Vaidya, “Understanding the mechanisms of deep transfer learning for medical images,” in International Workshop on Large-Scale Annotation of Biomedical Data and Expert Label Synthesis.   Springer, 2016, pp. 188–196.
  • [42] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” arXiv preprint arXiv:1702.05747, 2017.
  • [43] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
  • [44] M. H. Yap, G. Pons, J. Martí, S. Ganau, M. Sentís, R. Zwiggelaar, A. K. Davison, and R. Martí, “Automated breast ultrasound lesions detection using convolutional neural networks,” IEEE Journal of Biomedical and Health Informatics, 2017.
  • [45] R. Hess, The essential Blender: guide to 3D creation with the open source suite Blender.   No Starch Press, 2007.
  • [46] G. Hounnou, C. Destrieux, J. Desme, P. Bertrand, and S. Velut, “Anatomical study of the length of the human intestine,” Surgical and radiologic anatomy, vol. 24, no. 5, pp. 290–294, 2002.
  • [47] D. P. Mitchell and A. N. Netravali, “Reconstruction filters in computer-graphics,” ACM Siggraph Computer Graphics, vol. 22, no. 4, pp. 221–228, 1988.
  • [48] T. M. Lehmann, C. Gonner, and K. Spitzer, “Survey: Interpolation methods in medical image processing,” IEEE transactions on medical imaging, vol. 18, no. 11, pp. 1049–1075, 1999.
  • [49] A. C. Kak and M. Slaney, Principles of computerized tomographic imaging.   SIAM, 2001.
  • [50] M. Styner, C. Brechbuhler, G. Szckely, and G. Gerig, “Parametric estimate of intensity inhomogeneities applied to mri,” IEEE transactions on medical imaging, vol. 19, no. 3, pp. 153–165, 2000.
  • [51] M. Styner and G. Gerig, “Evaluation of 2d/3d bias correction with 1+ 1es-optimization,” Rapport de recherche, vol. 179, 1997.
  • [52] N. Tajbakhsh, S. R. Gurudu, and J. Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, vol. 35, no. 2, pp. 630–644, 2016.
  • [53] G. Peyré, S. Bougleux, and L. Cohen, “Non-local regularization of inverse problems,” Computer Vision–ECCV 2008, pp. 57–68, 2008.
  • [54]

    Y. Liu and V. Caselles, “Exemplar-based image inpainting using multiscale graph cuts,”

    IEEE transactions on image processing, vol. 22, no. 5, pp. 1699–1711, 2013.
  • [55] P. J. Pickhardt and J.-H. R. Choi, “Electronic cleansing and stool tagging in ct colonography: advantages and pitfalls with primary three-dimensional evaluation,” American Journal of Roentgenology, vol. 181, no. 3, pp. 799–805, 2003.
  • [56] P. J. Pickhardt, P. A. Nugent, J. R. Choi, and W. R. Schindler, “Flat colorectal lesions in asymptomatic adults: implications for screening with ct virtual colonoscopy,” American Journal of Roentgenology, vol. 183, no. 5, pp. 1343–1347, 2004.
  • [57] J. C. D. Fidler, J. L., “Detection of flat lesions in the colon with ct colonography,” Abdominal Imaging, vol. 27, no. 3, pp. 292–300, May 2002.
  • [58] T. Qin, T.-Y. Liu, X.-D. Zhang, D.-S. Wang, and H. Li, “Global ranking using continuous conditional random fields,” in Advances in neural information processing systems, 2009, pp. 1281–1288.
  • [59] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation,” arXiv preprint arXiv:1704.02157, 2017.
  • [60] V. Radosavljevic, S. Vucetic, and Z. Obradovic, “Continuous conditional random fields for regression in remote sensing.” in ECAI, 2010, pp. 809–814.
  • [61] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [62]

    C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in

    European Conference on Computer Vision.   Springer, 2014, pp. 184–199.
  • [63] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker, “Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,” Medical image analysis, vol. 36, pp. 61–78, 2017.
  • [64]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [65] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
  • [66] S. Kwak, S. Hong, and B. Han, “Weakly supervised semantic segmentation using superpixel pooling network.” in AAAI, 2017, pp. 4111–4117.
  • [67] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [68] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” arXiv preprint arXiv:1612.07828, 2016.
  • [69] F. Mahmood, R. Chen, and D. Nicholas J, “Unsupervised reverse domain adaption for synthetic medical images via adversarial training,” arXiv preprint arXiv:1711.06606, 2017.
  • [70] W. F. Noh and P. Woodward, “SLIC (Simple Line Interface Calculation),” in Proceedings of the Fifth International Conference on Numerical Methods in Fluid Dynamics June 28 – July 2, 1976 Twente University, Enschede, J. Ehlers, K. Hepp, H. A. Weidenmüller, J. Zittartz, W. Beiglböck, A. I. van de Vooren, and P. J. Zandbergen, Eds.   Springer Berlin Heidelberg, 1976, pp. 330–340, dOI: 10.1007/3-540-08004-X_336.
  • [71] J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, “Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer,” International Journal of Computer Assisted Radiology and Surgery, vol. 9, no. 2, pp. 283–293, 2014.
  • [72] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015.