Log In Sign Up

Deep EndoVO: A Recurrent Convolutional Neural Network (RCNN) based Visual Odometry Approach for Endoscopic Capsule Robots

Ingestible wireless capsule endoscopy is an emerging minimally invasive diagnostic technology for inspection of the GI tract and diagnosis of a wide range of diseases and pathologies. Medical device companies and many research groups have recently made substantial progresses in converting passive capsule endoscopes to active capsule robots, enabling more accurate, precise, and intuitive detection of the location and size of the diseased areas. Since a reliable real time pose estimation functionality is crucial for actively controlled endoscopic capsule robots, in this study, we propose a monocular visual odometry (VO) method for endoscopic capsule robot operations. Our method lies on the application of the deep Recurrent Convolutional Neural Networks (RCNNs) for the visual odometry task, where Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are used for the feature extraction and inference of dynamics across the frames, respectively. Detailed analyses and evaluations made on a real pig stomach dataset proves that our system achieves high translational and rotational accuracies for different types of endoscopic capsule robot trajectories.


Unsupervised Odometry and Depth Learning for Endoscopic Capsule Robots

In the last decade, many medical companies and research groups have trie...

A Deep Learning Based 6 Degree-of-Freedom Localization Method for Endoscopic Capsule Robots

We present a robust deep learning based 6 degrees-of-freedom (DoF) local...

Magnetically Guided Capsule Endoscopy

The following research undertakes a historical review of this technology...

A Non-Rigid Map Fusion-Based RGB-Depth SLAM Method for Endoscopic Capsule Robots

In the gastrointestinal (GI) tract endoscopy field, ingestible wireless ...

WCE Polyp Detection with Triplet based Embeddings

Wireless capsule endoscopy is a medical procedure used to visualize the ...

Magnetic-Visual Sensor Fusion based Medical SLAM for Endoscopic Capsule Robot

A reliable, real-time simultaneous localization and mapping (SLAM) metho...

1 Introduction

Following the advances in material science in last decades, untethered pill-size, swallowable capsule endoscopes with an on-board camera and wireless image transmission device have been developed and used in hospitals for screening the gastrointestinal tract and diagnosing diseases such as the inflammatory bowel disease, the ulcerative colitis and the colorectal cancer. Unlike standard endoscopy, endoscopic capsule robots are non-invasive, painless and more appropriate to be employed for long duration screening purposes. Moreover, they can access difficult body parts that were not possible to reach before with standard endoscopy (e.g., small intestines). Such advantages make pill-size capsule endoscopes a significant alternative screening method over standard endoscopy (liao2010indications, ; nakamura2008capsule, ; pan2011swallowable, ; than2012review, ; sitti2015biomedical, ). However, current capsule endoscopes used in hospitals are passive devices controlled by peristaltic motions of the inner organs. The control over capsule’s position, orientation, and functions would give the doctor a more precise reachability of targeted body parts and more intuitive and correct diagnosis opportunity (turan2017non, ; turan2017deep, ; turan2017six, ; turan2017fully, ; turan2017sparse, ). Therefore, several groups have recently proposed active, remotely controllable robotic capsule endoscope prototypes equipped with additional functionalities such as local drug delivery, biopsy and other medical functions (goenka2014capsule, ; nakamura2008capsule, ; munoz2014review, ; carpi2011magnetically, ; keller2012method, ; mahoney2013managing, ; yim2014biopsy, ; petruska2013omnidirectional, ; son20165, ; son2017magnetically, ). However, an active motion control needs feedback from a precise and reliable real time pose estimation functionality. In last decade, several localization methods (than2012review, ; fluckiger2007ultrasound, ; rubin2006sonographic, ; kim2008noninvasive, ; yim20133, ) were proposed to calculate the 3D position and orientation of the endoscopic capsule robot such as fluoroscopy (than2012review, ), ultrasonic imaging (fluckiger2007ultrasound, ; rubin2006sonographic, ; kim2008noninvasive, ; yim20133, ), positron emission tomography (PET) (than2012review, ; yim20133, ), magnetic resonance imaging (MRI) (than2012review, ), radio transmitter based techniques and magnetic field based techniques (yim2014biopsy, ). The common drawback of these localization methods is that they require extra sensors and hardware design. Such extra sensors have their own deficiencies and limitations if it comes to their application in small scale medical devices such as space limitations, cost aspects, design incompatibilities, biocompatibility issue and the interference of sensors with activation system of the device.

As a solution of these issues, a trend of visual odometry methods have attracted the attention for the localization of such small scale medical devices. A classic visual odometry pipeline typically consisting of camera calibration, feature detection, feature matching, outliers rejection (e.g RANSAC), motion estimation, scale estimation and global optimization (bundle adjustment) is depicted in Fig.


Figure 1: Traditional visual odometry pipeline.

Although some state-of-the-art algorithms based on this traditional pipeline have been applied for the visual odometry task of the hand-held endoscopes in the past decades, their main deficiency is tracking failures in low textured areas. In last years, deep learning (DL) techniques have been dominating many computer vision related tasks with some promising result, e.g object detection, object recognition, classification problems etc. Contrary to these high-level computer vision tasks, VO is mainly working on motion dynamics and relations across sequence of images, which can be defined as a sequential learning problem. With that motivation, we propose a novel monocular VO algorithm based on deep Recurrent Convolutional Neural Networks (RCNNs). Since it is designed in an end-to-end fashion, it does not need any module from the classic VO pipeline to be integrated. The main contributions of our paper are as follows:

  • To the best of our knowledge, this is the first monocular VO approach through deep learning techniques developed for the endoscopic capsule robot and hand-held standard endoscope localization.

  • Neither prior knowledge nor parameter tuning is needed to recover the absolute trajectory scale contrary to monocular traditional VO approach.

  • A novel RCNN architecture is introduced which can successfully model sequential dependence and complex motion dynamics across endoscopic video frames.

  • A real pig stomach dataset and a synthetic human simulator dataset with 6-DoF ground truth pose labels and 3D scan are recorded, which we are considering to publish for the sake of other researchers in that area.

The proposed method solves several issues faced by typical visual odometry pipelines, e.g the need to establish a frame-to-frame feature correspondence, vignetting, motion blur, specularity or low signal-to-noise ratio (SNR). We think that DL based endoscopic VO approach is more suitable for such challenge areas since the operation environment (GI tract) has similar organ tissue patterns among different patients which can be learned by a sophisticated machine learning approach easily. Even the dynamics of common artefacts such as vignetting, motion blur and specularity across frame sequences could be learned and used for a better pose estimation.

As the outline of this paper, Section 2 introduces the proposed RCNN based localization method in detail. Section 3 presents our dataset and the experimental setup. Section 4 shows our experimental results, we achieved for 6-DoF localization of the endoscopic capsule robot. Section 5 gives future directions.

Figure 2: Experimental overview.

2 System Overview and Analysis

Our architecture makes use of inception modules for feature extraction and RNN for sequential modelling of motion dynamics to regress the robot’s orientation and position in real time (5.3 ms per frame). It takes two consecutive endoscopic RGB Depth frames each with timestamp and regresses the 6-DoF pose of the robot without need of any extra sensor. For the depth image creation from RGB input images, we used shape from shading (SfS) technique of Tsai and Shah, which is based on the following assumptions

(PingSing1994, ):

  • The object surface is lambertian

  • The light comes from a single point light source

  • The surface has no self-shaded areas.

For more details of the Tsai-Shah SfS method, the reader is referred to the original paper of the authors. In past couple of years, some powerful CNN architectures, such as GoogleNet(szegedy2015going, ), VGG16(Simonyan14c, ), ResNet50(he2016deep, ) have been developed and evaluated for various high level computer vision tasks, e.g object detection, object recognition and classification (szegedy2015going, ),(sunderhauf2015performance, ),(russakovsky2015imagenet, )(kendall2015posenet, )

. One major drawback of CNN architectures is the fact that they only analyse just-in-moment information, whereas VO is rather dependent on the correlative information across frames. Unlike traditional feed-forward artificial neural networks, RCNN can use its internal memory to process arbitrarily long sequences by its directed cycles between the hidden units. Therefore, we think that RCNN architectures are more suitable than CNN architectures for VO tasks. The proposed deep EndoVO (endoscopic visual odometry) approach works as follows:

1:Take two consecutive input RGB images.
2:Create the depth images from RGB images using Tsai-Shah SfS method.
3:Subtract mean RGB Depth value of the training set from the RGB Depth images.

Stack the preprocessed RGB Depth frame pair to form a tensor.


Serve the tensor into the stack of inception modules to create the feature vector.

6:Feed the feature representation into the RNN layers.
7:Estimate the 6-DoF relative pose.
Algorithm 1 Deep EndoVO
(a) Information flow through the hidden units of the LSTM (gers1999learning, ).
(b) Inception layer(szegedy2015going, )
Figure 3: The structure of the LSTM and inception layers of the proposed model is shown.

The proposed DL network consists of three inception layers and two LSTM layers concatenated sequentially. The inception layers, imitating visual cortex of human beings, are basically extracting multi-level features; i.e, features of different sizes such as small details, middle-size or larger features (see Fig. 2(b). The final inception layer passes the feature representation into the RNN modules (see Fig. 2(a)). RNNs are very suitable for modelling the dependencies across image sequences and for creating a temporal motion model since it has a memory of hidden states over time and has directed cycles among hidden units, enabling the current hidden state to be a function of arbitrary sequences of inputs (see Fig. 2(a)). Thus, using RNN, the pose estimation of the current frame benefits from information encapsulated in previous frames (walch2016image, ; wang2017deepvo, ). Given a set of inception features at time , RNN updates at time step , denote corresponding weight matrices of the hidden units,

the bias vector, and

an element-wise hyperbolic tangent based activation function. Long Short-Term Memory (LSTM) is more suitable than RNN to exploit longer trajectories since it avoids the vanishing gradient problem of RNN resulting in a higher capacity of learning long-term relations among the sequences by introducing memory gates such as input, forget and output gates and hidden units of several blocks. The input gate controls the amount of new information flowing into the current state, the forget gate adjusts the amount of existing information that remains in the memory and the output gate decides which part of the information triggers the activations. The folded LSTM and its unfolded version over time are shown in Fig.

2(a) along with the internal structure of a LSTM memory cell. It can be seen that unfolded LSTMs correspond to timestamps. Given the input vector at time , the output vector and the cell state vector of the previous LSTM unit, the LSTM updates at time step according to the following equations, where is sigmoid non-linearity, tanh is hyperbolic tangent non-linearity, terms denote corresponding weight matrices, terms denote bias vectors, , , , and are input gate, forget gate, input modulation gate, the cell state and output gate at time , respectively (gers1999learning, ):

Although the LSTM is prone to vanishing gradient problem of RNN and is capable to detect the long-term dependencies, its learning capacity can be increased further by stacking multiple LSTM layers vertically. Thus, our deep RNN consists of two LSTM layers with the output sequence of the first one forming the input sequence of the second one each containing hidden units, as illustrated in Fig. 4

. The proposed system, which learns translational and rotational motions simultaneously to regress the 6-DoF pose, is trained on Euclidean loss using Adam optimization method with the following objective loss function:


where is the translation vector and is the rotation vector. The pseudo-code to calculate the loss value is given in Algorithm 2. In our loss function, a balance must be kept between the orientation and translation loss values which are highly coupled each other as they are learned from the same model weights. Experimental results show that the optimal is given by the ratio between the loss values of predicted positions and orientations at the end of training session (kendall2015posenet, ).

Figure 4: Architecture of the proposed RCNN based monocular VO system.
1:procedure CalculateLoss
3:     for  in  do
4:         for  in  do
Algorithm 2 Pseudo code to calculate the loss over the network

The back-propagation algorithm is used to calculate the gradients of RCNN weights, which are passed to the Adam optimization method to compute adaptive learning rates for each parameter employing the first-order gradient-based optimization of the stochastic objective function. In addition to saving exponentially decaying average of past squared gradients, , Adam optimization keeps exponentially decaying average of past gradients, that is similar to momentum. The update equations are given as


We used default values proposed by (kingma2014adam, ) for the parameters and : , and .

3 Dataset

This section demonstrates the experimental setup of the proposed study, introduces our magnetically actuated soft capsule endoscopes (MASCE) and explains how the training and testing datasets were recorded.

3.1 Magnetically Actuated Soft Capsule Endoscopes (MASCE)

Our capsule prototype is a magnetically actuated soft capsule endoscope (MASCE) designed for disease detection, drug delivery and biopsy operations in the upper gastrointestinal tract. The prototype is composed of a RGB camera, a permanent magnet, a fine-needle and a drug chamber (see Fig. 5 for visual reference). The magnet exerts magnetic force and torque to the robot in response to a controlled external magnetic field (son2017magnetically, ). The magnetic torque and forces are used to actuate the capsule robot and to release drug and deliver the needle through the hole in the bottom of the capsule. Magnetic fields from the electromagnets generate the magnetic force and torque on the magnet inside MASCE so that the robot moves inside the workspace. Sixty-four three-axis magnetic sensors are placed on the top, and nine electromagnets are placed in the bottom (son2017magnetically, ).

(a) Actuation system of the MASCE (son2017magnetically, )
(b) Exterior (left) and section view (right) of MASCE (son2017magnetically, )
Figure 5: MASCE design features and actuation unit

3.2 Training dataset

We created two groups of training datasets. The first training dataset was recorded on five different real pig stomachs (see Fig.2), whereby the second dataset which was only used for training purposes, was captured using a non-rigid open GI tract model EGD (esophagus gastro duodenoscopy) surgical simulator LM-103 (see Fig.2). To ensure that our algorithm is not tuned to a specific camera model, four different commercial endoscopic cameras were employed, specifications of which are shown in Table 1, accordingly. For each pig stomach-camera combination, frames were acquired which makes for four cameras and five pig stomachs frames, in total. Sample real pig stomach frames are shown in Fig. 5(a) for visual reference. As a second training dataset, for each of four cameras, we captured frames on an EGD human stomach simulator making frames, in total. Sample synthetic training frames are shown in Fig.5(b) for visual reference. During video recording, Optitrack motion tracking system consisting of eight Prime-13 cameras and a tracking software was utilized to obtain 6-DoF localization ground truth data in a sub-millimeter precision (see Fig. 2) which was used as a gold standard for the evaluations of the pose estimation accuracy.

Resolution 250 x 250 pixel
Footprint 2.2 x 1.0 x 1.7 mm
Pixel size 3 x 3
Frame rate 44 fps
(a) Awaiba Naneye Endoscopic Camera
Resolution 400 x 400 pixel
Diameter 8.2mm
Pixel size 5.55 x 5.55
Frame rate 30 fps
(b) Misumi-V3506-2ES camera
Resolution 640 x 480 pixel
Diameter 8.6 mm
Pixel size 6.0 x 6.0
Frame rate 30 fps
(c) Misumi-V5506-2ES camera
Resolution 1280 x 720 pixel
Diameter 8.8 mm
Pixel size 10.0 x 10.0
Frame rate 30 fps
(d) Potensic Mini Camera
Table 1: Endoscopic camera specifications used for the experiments.

3.3 Testing dataset

We created a testing dataset recorded using five different real pig stomachs, which were not used for the training section. For each pig stomach-camera combination, frames are acquired making frames, in total. We did not capture any synthetic dataset for the testing session since it is less realistic due to obvious patterns of such artificial simulators. For all of the video records, again Optitrack motion tracking system was utilized to obtain 6-DoF localization ground truth.

(a) Sample frames recorded on a real pig stomach
(b) Sample frames recorded on EGD simulator
Figure 6: Sample frames from the datasets used in the experiments.

4 Evaluations and Results

(a) Change in the loss values for a good fitting
(b) Change in the loss values for overfitting
Figure 7: The decrease in the training and validation loss values. In overfitting case, the training loss gets smaller than the validation loss. However, the loss values are balanced for a good fit.
(a) Trajectory 1
(b) Trajectory 2
(c) Trajectory 3
(d) Trajectory 4
Figure 8: Sample ground truth trajectories and estimated trajectories predicted by the DL based VO models. As seen, deep EndoVO is the closest to the ground truth trajectories. The scale is calculated and maintained correctly by the models.
(a) Trajectory length vs translation error
(b) Trajectory length vs totation error
Figure 9: Deep EndoVO outperforms both of the other models in terms of translational and rotational position estimation.

Architecture was trained using Caffe library and NVIDIA Tesla K40 GPU. Using back-propagation-through-time method, the weights of hidden units were trained for up to

epochs with an initial learning rate of . Overfitting meaning that the noise or random fluctuations in the training data are picked up and learned as concepts by the model, whereas these concepts do not apply to a new data and negatively affect the ability of the model to make generalizations, was prevented using dropout and early stopping techniques (see Fig.10). Dropout regularization technique introduced by (srivastava2014dropout, ) is an extremely effective and simple method to avoid overfitting. It samples a part of the whole network and updates its parameters based on the input data. Early stopping is another widely used technique to prevent overfitting of a complex neural network architecture which was optimized by a gradient-based method. The approach is executed by splitting the dataset into a training and a validation set to evaluate the generalization capability of the model.

(a) Training good fitting
(b) Training overfitting
(c) Test good fitting
(d) Test overfitting
Figure 10: The affect of good fitting and overfitting. The first and the second rows show over-fitted and well-fitted models, respectively. As seen in subfigures, the model learns the details and noise in the training data to an undesired extent that it negatively impacts the performance of the model on the test data

For the testing sessions, only real pig stomach recordings were used to ensure real world conditions. Additionally, we strictly avoided to use any frame from the training session for the testing session. Two separate experiments were conducted, whereas training session of the first experiment was performed using only the synthetic training dataset (see Fig.5(b)) which we call simEndoVO and training session of the second experiment was performed using frames from both synthetic and real pig stomach dataset (see Fig. 5(b) and 5(a)

) which we call realEndoVO. The performance of the simEndoVO and realEndoVO approaches were analysed using averaged Root Mean Square Errors (RMSEs) for translational and rotational motions. For various trajectories with different complexity levels of motions, including uncomplicated paths with slow incremental translations and rotations, comprehensive scans with many local loop closures and complex paths with sharp rotational and translational movements, we performed testings on both simEndoVO and realEndoVO comparing them with GoogLeNet and ResNet50 architectures which were modified to regress 6-DoF pose values by removing softmax layer and integrating a fully-connected (FC) layer and an affine regressor layer. The average translational and rotational RMSEs for simEndoVO, realEndoVO, GoogLeNet and ResNet50 networks against different path lengths are shown in Fig.

9, respectively. The results depicted indicate, that realEndoVO clearly outperforms GoogLeNet and ResNet50, whereas simEndoVO slightly outperforms them. We presume that the effective use of LSTM in EndoVO architecture enabled learning motion dynamics across frame sequences, which is not feasible by architectures working with the principle of just-in-moment information processing; i.e GoogleNet and ResNet50. The results in Fig.9 also indicate that the training procedure including both simulator and real dataset was more informative than training only with simulator dataset. On the other hand, the accuracies achieved by the modified GoogLeNet are slightly better than accuracies achieved by the modified ResNet50, proving the superiority of inception layers over residual networks for feature extraction related tasks. Derived from RMSEs calculated, the rotational motion parameters seem to be more prone to overfitting compared to translational motion parameters (see Fig.10 for visual reference). The reason for that observation could be the fact that inner organ scanning procedures generally contain more translational motions than rotational motions resulting in a better learning for translations. As the length of the trajectory increases, both the translational and rotational error of all the proposed models significantly decrease (see Fig.9). Some sample ground truth and estimated trajectories for realEndoVO, GoogLeNet and ResNet50 are shown in Fig.8 for visual reference. As seen in these sample trajectories, realEndoVO is able to stay close to the ground truth pose values for even sharp crispy motions, contrary to realEndoVO; GoogLeNet and ResNet50 path estimations which deviate drastically from the ground truth path values. Even for very fast and challenge paths such as 7(a) and 7(c), the deviations of realEndoVO from the ground truth still remain in an acceptable range for medical operations. In addition to that, it is clearly seen that all of the three evaluated neural network architectures are able to estimate the scale very accurately without using any prior information or post alignment techniques contrary to traditional VO. Solving the scale ambiguity for monocular camera based VO makes our proposed DL based method more beneficial than traditional VO approach. As opposed to the traditional VO pipeline (see Fig.1

), the DL-based VO do not require any explicit feature extraction, matching, outlier detection or multi-scale bundle adjustment-like parameter tuning requiring operations, which can be seen as further benefits of the proposed approach.

4.1 Comparisons of deep EndoVO with state-of-the-art SLAM methods

In this subsection, we compare the performance of the proposed deep EndoVO with two of the widely used state-of-the-art SLAM methods; i.e. large-scale direct monocular SLAM (LSD SLAM) (engel2014lsd, ) and the oriented fast and rotated brief SLAM (ORB SLAM) (mur2015orb, ). LSD SLAM is a direct image alignment-based method which optimizes the geometry using all of the image intensities. In addition to higher accuracy and robustness particularly in environments with little key points, this provides substantially more information about the geometry of the environment, which can be very valuable for medical robot applications, as well. ORB SLAM on the other hand, relies on feature point extraction and tracking to estimate camera pose and 3D map the environment. Even though it gives very promising results for feature-rich areas, its main deficiency appears once the robot enters poorly featured areas. Tracking failures are commonly observable for poorly featured GI tract tissues making ORB SLAM less proper for our case. We believe that our deep EndoVO architectures makes an optimal use of both direct and feature point information to estimate the pose. The average translational and rotational RMSEs for simEndoVO, realEndoVO, LSD SLAM and ORB SLAM, shown in Fig. 11 indicate that both simEndoVO and realEndoVO clearly outperforms LSD SLAM and ORB SLAM in terms of pose accuracy. Sample trajectory estimations shown in Fig. 12 visualize clearly that the tracking capability of the proposed deep EndoVO is much more robust and reliable compared to LSD SLAM and ORB SLAM. In many parts of the trajectories, ORB SLAM and LSD SLAM deviate from the ground truth trajectory drastically, whereas deep EndoVO is still able to stay close to the ground truth values even for most challenge trajectory sections (see Fig.11(b),11(c)).

(a) Trajectory length vs translation error
(b) Trajectory length vs rotation error
Figure 11: Deep EndoVO outperforms the state-of-the-art SLAM methods ORB SLAM and LSD SLAM in both the translation and orientation estimation.
(a) Trajectory 1
(b) Trajectory 2
(c) Trajectory 3
(d) Trajectory 4
Figure 12: The ground truth and the trajectory plots acquired via deep EndoVO, LSD SLAM and ORB SLAM. Deep EndoVO is the closest to the ground truth trajectories compared to the state-of-the-art SLAM methods.

5 Conclusion

In this study, we presented, to the best of our knowledge, the first deep VO method for endoscopic capsule robot and standard hand-held endoscope operations. The proposed system is able to achieve simultaneous representation learning and sequential modelling of motion dynamics across frames by concatenating the inception modules with RNN layers. Many issues faced by traditional VO techniques such as feature correspondence establishment in low textured areas, high reflections, motion blur and low image quality are handled by the proposed deep EndoVO successfully. Since it is trained in an end-to-end manner, there is no need to carefully fine-tune the parameters of the system. As a future step, we consider to combine deep EndoVO with some functionalities from the traditional VO pipelines such as RANSAC for outlier detection and bundle fusion for globally consistent pose estimation etc to avoid drifts. Moreover, we consider to develop a stereo version of the proposed deep EndoVO approach.