Automatic 3D Point Set Reconstruction from Stereo Laparoscopic Images using Deep Neural Networks

07/31/2016 ∙ by Balint Antal, et al. ∙ University of Debrecen (UD) 0

In this paper, an automatic approach to predict 3D coordinates from stereo laparoscopic images is presented. The approach maps a vector of pixel intensities to 3D coordinates through training a six layer deep neural network. The architectural aspects of the approach is presented and in detail and the method is evaluated on a publicly available dataset with promising results.



There are no comments yet.


page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Minimally invasive surgery (MIS) became a wide-spread technique to have surgical access to the abdomen of patients without casing major damages in the skin or tissues. Since MIS supporting techniques like laparascopy or endscopy provide a restricted access to the surgeon, computer-aided visualisation systems are developed. One of the major research areas in the 3d reconstruction of stereo endoscope images. See Figure 1 for an example stereo cardiac laparoscopy image pair.

(a) Left image
(b) Right image
Figure 1: An example stereo laparoscopy image pair [Pratt et al., 2010] [Stoyanov et al., 2010].

The images are acquired from two distinct viewpoints, assisting surgeons to have a sense of depth during surgery. The usual 3d reconstruction approach consists of several steps [Maier-Hein et al., 2013], involving the establishment of stereo correspondence between the pixels of the two viewpoints, which is a computationally expensive approach and also requires prior knowledge regarding the endoscope used in the procedure, limiting its reusability. Figure 2 shows an example stereo image pair from the [Pratt et al., 2010] [Stoyanov et al., 2010], alongside the reconstructed disparsity map, distance map and 3d point cloud. In this paper, an automatic approach for the 3d reconstruction of stereo endoscopic images will be presented. The approach is based on deep neural networks (DNN) [LeCun et al., 2015] and aims to predict 3d coordinates without the costly procedure of stereo correspondence. We have evaluated our approach on a publicly available database where it performed well compared to a stereo correspondence approach.

(a) Disparsity map
(b) Distance map
(c) Reconstructed 3D point cloud
Figure 2: Disparsity map, distance map and 3D point cloud extracted from the images shown in Figure 1.

The proposed approach only takes the pixel intensity values for the left and right images, and learns their 3D depth map during training. Figure 3 shows the flow chart of the proposed approach.

Figure 3: Flow chart of the proposed approach.

The rest of the paper is organized as follows: section 2 describes the proposed approach in details. We provide our experimental methodology in section 3. Section 4 contains the results of our experiments and finally, we draw conclusion in section 5.

2 3d Reconstruction of Stereo Endoscopic Images Using Deep Neural Networks

In this section, the proposed approach is described in details. Section 2.1 presents an overview on DNNs. Section 2.2 proposes the architecture of our approach, while we describe the optimization aspects of the proposed method in section 2.3.

2.1 Deep Neural Networks

Deep Neural Networks (DNNs) are biologically inspired machine learning techniques which does not rely on engineering domain-specific features for each separate problem but involves a mapping of the input data (e.g. images) to a target vector using a sequence of non-linear transformations

[LeCun et al., 2015]. In particular, DNNs consists of several layers of high-level representations on the input data. Each layer consists of several nodes, which contains weights, biases and activations in the following form:


where is an input vector,

is a non-linear activation function.

is a weight matrix of shape , and are the output and input dimensions of the preceding and succeeding layers, respectively,

is a bias vector. Each DNN has an input layer, an output layer and several hidden layers of this form. DNN has proven to be very successful in a wide variety of machine learning related tasks.

2.2 Architecture of the proposed DNN

The proposed DNN consists of six layers (see Figure 4).

Figure 4: Architecture of the proposed DNN.

The input layer takes a six element vector of the following form: where and are the pixel intensites at the position in channel of the left and right images, respectively. The output layer produces three outputs, namely the , ,

coordinates of the input points. Two fully connected dense layers serves as hidden layers in our architecture with 500 neurons each. A high-dimensional upscaling of the input features like this are effective in learning complex mappings. To avoid overfitting, a dropout layer after each dense layers were also included, which introduces random noise into the outputs of each layer

[Srivastava et al., 2014]

. We have used Rectified Linear Units as activiations,

[Nair and Hinton, 2010] [Glorot et al., 2011] which is of the form:


2.3 Optimization

We have used an adaptive per-dimension learning rate version of the stochastic gradient descent (SGD) approach called Adadelta, which is less sensitive to hyperparameter settings than SGD

[Zeiler, 2012]. Each weight matrix and vector were initialized using the normalized initialization [Glorot and Bengio, 2010]:


where U is the uniform distribution,


are the sizes of the previous and next layers, respectively. To avoid overfitting, we have also applied Tikhonov regularization for each weight matrix [Hoerl and Kennard, 1970]. As an energy function, we have used mean squared error:


where , and are the ground truth coordinates and , and are the DNNs predictions, and is the number of training vectors, respectively.

3 Methodology

We have used a laparascpic cardiac dataset to evaluate our approach [Pratt et al., 2010] [Stoyanov et al., 2010]. The dataset consists a pair of videos showing heart movement. The video consist of 2427 frames, each of them having a spatial resolution of and in a standard RGB format. The ground truth is a depth map containing , ,

coordinates for each point. We have used a training set of 20 images and tested our approach on the remaining 2407 frames. For training, we have allowed 20 epochs to establish the optimal parameters for the DNN. After training, we have applied pixel-wise classification of the images of the test set.We have calculated the root mean squared errors for each image:


We have used Theano

[Bergstra et al., 2010] [Bastien et al., 2012]

and Keras

[Chollet, 2015] for implementation.

4 Results and Discussion

First, to show the validity of the DNN architecture and the training hyperparameters, we have measured the training loss at each epoch. As it can be seen in Figure 5, the training loss decreased in each epoch showing but the curve started to flatten at later epochs.

Figure 5: Training losses per epoch.

We have also calculated the loss on the test instances. In average, the RMSE per image was 13.18. As it can be seen from Figure 6, there was quite a big fluctuation in losses (around 11-15). However, the ground truth was incomplete for some of the images, which might have affected the ability to properly evaluate each individual instances.

Figure 6: Histogram of root mean squared error losses on the test data.

However, in some extreme cases (see Figure 7), the reconstruction by the proposed approach was just partially successful. To correct issues like this, an approach which also incorporates on contextual information could be used in the future.

(a) Point cloud from ground truth
(b) Point cloud predicted by the proposed method
Figure 7: Comparison of point clouds extracted from the ground truth and by our approach.

5 Conclusions

Minimally invasive surgical techniques are very important in clinical settings, however, they require computational support to allow surgeons to effectively use these techniques in practice. In this paper, an approach based on deep neural networks has been introduced, which is unlike the state-of-the-art approaches, only relies on the input pixels of the stereo image pair. The approach has been evaluated on a publicly available dataset and compared well to the results obtained by a state-of-the-art technique.


This work was supported in part by the project VKSZ 14-1-2015-0072, SCOPIA: Development of diagnostic tools based on endoscope technology supported by the European Union, co-financed by the European Social Fund.


  • [Bastien et al., 2012] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
  • [Bergstra et al., 2010] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
  • [Chollet, 2015] Chollet, F. (2015). Keras.
  • [Glorot and Bengio, 2010] Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In

    International conference on artificial intelligence and statistics

    , pages 249–256.
  • [Glorot et al., 2011] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323.
  • [Hoerl and Kennard, 1970] Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
  • [LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
  • [Maier-Hein et al., 2013] Maier-Hein, L., Mountney, P., Bartoli, A., Elhawary, H., Elson, D., Groch, A., Kolb, A., Rodrigues, M., Sorger, J., Speidel, S., et al. (2013). Optical techniques for 3d surface reconstruction in computer-assisted laparoscopic surgery. Medical image analysis, 17(8):974–996.
  • [Nair and Hinton, 2010] Nair, V. and Hinton, G. E. (2010).

    Rectified linear units improve restricted boltzmann machines.

    In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814.
  • [Pratt et al., 2010] Pratt, P., Stoyanov, D., Visentini-Scarzanella, M., and Yang, G.-Z. (2010). Dynamic guidance for robotic surgery using image-constrained biomechanical models. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010, pages 77–85. Springer.
  • [Srivastava et al., 2014] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
  • [Stoyanov et al., 2010] Stoyanov, D., Scarzanella, M. V., Pratt, P., and Yang, G.-Z. (2010). Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2010, pages 275–282. Springer.
  • [Zeiler, 2012] Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701.