Towards automatic initialization of registration algorithms using simulated endoscopy images

06/28/2018 ∙ by Ayushi Sinha, et al. ∙ 0

Registering images from different modalities is an active area of research in computer aided medical interventions. Several registration algorithms have been developed, many of which achieve high accuracy. However, these results are dependent on many factors, including the quality of the extracted features or segmentations being registered as well as the initial alignment. Although several methods have been developed towards improving segmentation algorithms and automating the segmentation process, few automatic initialization algorithms have been explored. In many cases, the initial alignment from which a registration is initiated is performed manually, which interferes with the clinical workflow. Our aim is to use scene classification in endoscopic procedures to achieve coarse alignment of the endoscope and a preoperative image of the anatomy. In this paper, we show using simulated scenes that a neural network can predict the region of anatomy (with respect to a preoperative image) that the endoscope is located in by observing a single endoscopic video frame. With limited training and without any hyperparameter tuning, our method achieves an accuracy of 76.53 (+/-1.19) avenues for improvement, making this a promising direction of research. Code is available at



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several surgical procedures, especially minimally invasive surgeries (MIS), require registration or alignment between a preoperative image and an intraoperative image in order to provide additional knowledge available from preoperative images [MJB13]. For MIS through the nasal cavity, these modalities are generally computed tomography (CT) scans and endoscopic video. Registration provides contextual cues to augment the limited information provided by endoscopic video. This allows surgeons to locate their endoscope and tools in relation to critical anatomical structures visible in CT scans, making the surgery safer [SKT09]. Registration also enables navigation during surgery, allowing surgeons to additionally track the endoscope and tools within the CT coordinate frame. High accuracy in these registrations is crucial because of the proximity of critical anatomical structures, like the carotid arteries, optic nerves, eyes, brain, etc., to the nasal cavity region [TMDJ99]. However, registration algorithms generally require a coarse initial alignment to bring the two modalities close enough before launching the registration, and registration accuracy is dependent on the quality of this initial alignment. Often, this initial alignment is performed manually [MDS02, HHP03, LRS16]. This can be tedious and interferes with the clinical workflow. Accurate automatic initialization of registration algorithms is one of the major components required to fully automate surgical navigation, making this an important area of research.

Many previous methods have attempted to automatically initialize registration algorithms for different applications. Methods have been presented that align features from the two modalities to a common coordinate frame before computing the final initialization [FTF08] or that align the principal modes of variation to find the initial coarse alignment [BKZB18]. In video based registration algorithms, several methods rely on finding canonical landmarks or other features in both modalities [MLL12, RRT18]

. These features can be specific to certain anatomical structures and, therefore, hard to generalize. Many of these methods require several preprocessing steps which can be time-consuming as well as prone to noise and outliers.

We present a method that takes as input a frame of endoscopic video and directly predicts the region in which the endoscope must be located in order to generate the view seen in the video frame. Our preliminary results are on simulated data and show that our region classifier is able to identify the region that the endoscope is in with an accuracy of

 () %. We are currently working on vastly expanding the dataset and hope to extend this method to real endoscopy video.

2 Methods

We approach the problem of camera localization by trying to learn the types of views observed from cameras located in different regions of the nasal cavity. However, it is difficult to generate reliable ground truth in large numbers of in-vivo endoscopy images because the exact location of the camera is unknown. Therefore, we generate labeled data in simulation where the exact ground truth is known in order to learn the types of views that are observed from different camera poses. In the following sections, we explain how the simulated data is generated and how the camera location is learned.

2.1 Data generation

Our simulated dataset is generated in OpenGL using a textured mesh of the nasal cavity. The mesh is a mean mesh built from a publicly available dataset [BUB15, BSMP15, CVS13, FCU16] of CTs which were automatically segmented [SRL17]. This mean mesh is textured using an image generated from in-vivo nasal endoscopy video. A camera and a single light source are co-located and steered through this textured mesh environment. At each camera location, the camera is rotated slightly to observe random views. The images rendered from these views as well as the corresponding camera poses are saved. The camera poses are grouped into classes based on the region that the camera is located in (Figure 1). Examples of rendered images from each class are shown in Figure 2.

Figure 1: The regions of the nasal cavity: images acquired from cameras located in these different regions of the nasal cavity contain different salient features which we want to learn.

2.2 Classification

We trained a

layer convolutional neural network 

[LBH15] with cross entropy loss to learn the classes associated with the rendered images. Details about our network are fully specified in our code, available at Our training dataset contained images, and we tested on a separate set of images. Since we are not performing any hyperparameter tuning, we do not have a validation set. We trained and tested our network times, each time training for a total of epochs with a learning rate and weight decay factor of . We report the average error for evaluation.

Figure 2: Examples of the different scenes observed from cameras located in the different regions of the nasal cavity.

3 Results and discussion

We were able to train our classifier to learn the class associated with each image (Figure 3). The images in our test dataset were classified with a mean accuracy of  () % over runs. Further, the errors in classification are almost always with neighboring classes (Figure 4). That is, images in Region 2 are sometimes misclassified as belonging to Region 1 or Region 3, but never as belonging to Region 4. This is reasonable given that images rendered from endoscopes located at the border of the specified regions (Figure 1) are likely to be similar in appearance but have different associated labels. A larger training dataset and tuning the hyperparameters can help improve such errors.

Figure 3:

Evolution of the loss function from run

(above) and run (below): the loss function converges quickly and remains stable for all runs.

These preliminary results are promising given the limited size of our current training dataset and that no hyperparameter tuning has been performed. We are currently working on expanding our training dataset in order to generalize better to different scenarios. Different images will be used to texture the mean mesh to simulate different patients and different lighting conditions will be added to simulate different endoscopes. Further, we hope to use transfer learning to extend this classification to in-vivo endoscopy video and evaluate whether these classifications are accurate enough to result in reliable final registrations.

We will continue to attempt to associate the simulated and in-vivo endoscopy images to regions in the mean mesh because we expect all nasal passages to contain some number of common salient features. If a patient specific CT is segmented such that the patient mesh is in correspondence with the mean mesh, then the associated region in the mean mesh can be easily transfered to the patient mesh. If the two meshes are not in correspondence, then a registration between the two might also require manual initialization. However, since CT scans are available before the endoscopic procedure, the initialization and registration can be performed offline without interfering with the surgical workflow. Once a region in the patient mesh is determined, registrations between the patient CT and features extracted from patient endoscopic video can be initiated from multiple poses within the specified region. The registration with the lowest error or highest stability 

[LSR18] can be chosen as the final registration. If a patient CT is not available, then recent methods are able to deformably register features from patient endoscopic video to a statistically mean mesh [SLR18]

. In this case, registrations can be initiated directly within the region in the mean mesh determined by our classifier. Further, with a much larger dataset, we hope to additionally be able to estimate the camera pose that would generate a rendered image and use the estimated pose to initialize the registration.

Figure 4: Confusion matrix for run (above) and run (below): most labels are classified correctly and errors in classification are generally with neighboring classes.

4 Conclusions

This work shows preliminary results towards building a closed loop surgical navigation pipeline that does not require any user interaction. Our results show that it is possible to reliably classify the region that an endoscope is located in by simply observing a single simulated endoscopic image. We hope to further improve our accuracy and achieve such classifications in in-vivo endoscopic images. Assuming that features from video and preoperative images can be extracted automatically, further work in this area holds the potential for fully automated registrations and, consequently, seamless navigation during clinical endoscopic explorations as well as during endoscopic surgery.


This work was funded by NIH R01-EB015530, the Provost’s Postdoctoral Fellowship at the Johns Hopkins University, fellowship support from Intuitive Surgical, Inc., and the Johns Hopkins University internal funds.


  • [BKZB18] Bricq S., Kidane H. L., Zavala-Bojorquez J., Oudot A., Vrigneaud J.-M., Brunotte F., Walker P. M., Cochet A., Lalande A.: Automatic deformable pet/mri registration for preclinical studies based on b-splines and non-linear intensity transformation. Medical & Biological Engineering & Computing (Feb 2018). doi:10.1007/s11517-018-1797-0.
  • [BSMP15] Bosch W. R., Straube W. L., Matthews J. W., Purdy J. A.: Data from head-neck_cetuximab. The Cancer Imaging Archive., 2015. doi:10.7937/K9/TCIA.2015.7AKGJUPZ.
  • [BUB15] Beichel R. R., Ulrich E. J., Bauer C., Wahle A., Brown B., Chang T., Plichta K. A., Smith B. J., Sunderland J. J., Braun T., Fedorov A., Clunie D., Onken M., Riesmeier J., Pieper S., Kikinis R., Graham M. M., Casavant T. L., Sonka M., Buatti J. M.: Data from QIN-HEADNECK. The Cancer Imaging Archive., 2015. doi:10.7937/K9/TCIA.2015.K0F5CGLI.
  • [CVS13] Clark K., Vendt B., Smith K., Freymann J., Kirby J., Koppel P., Moore S., Phillips S., Maffitt D., Pringle M., Tarbox L., Prior F.: The cancer imaging archive (TCIA): Maintaining and operating a public information repository. Journal of Digital Imaging 26, 6 (2013), 1045–1057. doi:10.1007/s10278-013-9622-7.
  • [FCU16] Fedorov A., Clunie D., Ulrich E., Bauer C., Wahle A., Brown B., Onken M., Riesmeier J., Pieper S., Kikinis R., Buatti J., Beichel R. R.: DICOM for quantitative imaging biomarker development: a standards based approach to sharing clinical data and structured PET/CT analysis results in head and neck cancer research. PeerJ 4 (May 2016), e2057. doi:10.7717/peerj.2057.
  • [FTF08] Foroughi P., Taylor R. H., Fichtinger G.: Automatic initialization for 3d bone registration. In Proc.SPIE (2008), vol. 6918, pp. 6918 – 6918 – 8. doi:10.1117/12.772632.
  • [HHP03] Higgins W. E., Helferty J. P., Padfield D. R.: Integrated bronchoscopic video tracking and 3d ct registration for virtual bronchoscopy. In Proc.SPIE (2003), vol. 5031, pp. 5031 – 5031 – 10. doi:10.1117/12.483825.
  • [LBH15] LeCun Y., Bengio Y., Hinton G.: Deep learning. Nature 521 (May 2015), 436–444. doi:
  • [LRS16] Leonard S., Reiter A., Sinha A., Ishii M., Taylor R. H., Hager G. D.: Image-based navigation for functional endoscopic sinus surgery using structure from motion. In Proc. SPIE (2016), vol. 9784, pp. 97840V–97840V–7. doi:10.1117/12.2217279.
  • [LSR18] Leonard S., Sinha A., Reiter A., Ishii M., Gallia G. L., Taylor R. H., Hager G. D.: Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in-vivo clinical data. IEEE Transactions on Medical Imaging (2018). doi:10.1109/TMI.2018.2833868.
  • [MDS02] Mori K., Deguchi D., Sugiyama J., Suenaga Y., Toriwaki J., Maurer C., Takabatake H., Natori H.: Tracking of a bronchoscope using epipolar geometry analysis and intensity-based image registration of real and virtual endoscopic images. Medical Image Analysis 6, 3 (2002), 321 – 336. Special Issue on Medical Image Computing and Computer-Assisted Intervention - MICCAI 2001. doi:
  • [MJB13] Mezger U., Jendrewski C., Bartels M.: Navigation in surgery. Langenbecks Arch Surg 398, 4 (Apr 2013), 501–514. 1059[PII]. doi:10.1007/s00423-013-1059-4.
  • [MLL12] Miao S., Lucas J., Liao R.: Automatic pose initialization for accurate 2d/3d registration applied to abdominal aortic aneurysm endovascular repair. In Proc.SPIE (2012), vol. 8316, pp. 8316 – 8316 – 8. doi:10.1117/12.911495.
  • [RRT18] Robu M. R., Ramalhinho J., Thompson S., Gurusamy K., Davidson B., Hawkes D., Stoyanov D., Clarkson M. J.: Global rigid registration of ct to video in laparoscopic liver surgery. International Journal of Computer Assisted Radiology and Surgery 13, 6 (Jun 2018), 947–956. doi:10.1007/s11548-018-1781-z.
  • [SKT09] Senior B. A., Kennedy D. W., Tanabodee J., Kroger H., Hassab M., Lanza D.: Long-term results of functional endoscopic sinus surgery. The Laryngoscope 108, 2 (2009), 151–157. doi:10.1097/00005537-199802000-00001.
  • [SLR18] Sinha A., Liu X., Reiter A., Ishii M., Hager G. D., Taylor R. H.: Endoscopic navigation in the absence of ct imaging. arXiv:1806.03997 (2018). URL:
  • [SRL17] Sinha A., Reiter A., Leonard S., Ishii M., Hager G. D., Taylor R. H.: Simultaneous segmentation and correspondence improvement using statistical modes. In Proc. SPIE (2017), vol. 10133, pp. 101331B–101331B–8. doi:10.1117/12.2253533.
  • [TMDJ99] Tao H., Ma Z., Dai P., Jiang L.: Computer-aided three-dimensional reconstruction and measurement of the optic canal and intracanalicular structures. The Laryngoscope 109, 9 (1999), 1499–1502. doi:10.1097/00005537-199909000-00026.