3dReconstructionDL
None
view repo
In this paper, an automatic approach to predict 3D coordinates from stereo laparoscopic images is presented. The approach maps a vector of pixel intensities to 3D coordinates through training a six layer deep neural network. The architectural aspects of the approach is presented and in detail and the method is evaluated on a publicly available dataset with promising results.
READ FULL TEXT VIEW PDF
Computational ghost imaging (CGI) is a single-pixel imaging technique th...
read it
There has been tremendous research progress in estimating the depth of a...
read it
This work demonstrates that, for a simple case of 10 randomly positioned...
read it
This paper proposes a new method for calculating joint-state posteriors ...
read it
Computer vision and robotics are being increasingly applied in medical
i...
read it
In this paper, we present a method of automatic catchphrase extracting f...
read it
In this paper we introduce a new, high-quality, dataset of images contai...
read it
None
Minimally invasive surgery (MIS) became a wide-spread technique to have surgical access to the abdomen of patients without casing major damages in the skin or tissues. Since MIS supporting techniques like laparascopy or endscopy provide a restricted access to the surgeon, computer-aided visualisation systems are developed. One of the major research areas in the 3d reconstruction of stereo endoscope images. See Figure 1 for an example stereo cardiac laparoscopy image pair.
The images are acquired from two distinct viewpoints, assisting surgeons to have a sense of depth during surgery. The usual 3d reconstruction approach consists of several steps [Maier-Hein et al., 2013], involving the establishment of stereo correspondence between the pixels of the two viewpoints, which is a computationally expensive approach and also requires prior knowledge regarding the endoscope used in the procedure, limiting its reusability. Figure 2 shows an example stereo image pair from the [Pratt et al., 2010] [Stoyanov et al., 2010], alongside the reconstructed disparsity map, distance map and 3d point cloud. In this paper, an automatic approach for the 3d reconstruction of stereo endoscopic images will be presented. The approach is based on deep neural networks (DNN) [LeCun et al., 2015] and aims to predict 3d coordinates without the costly procedure of stereo correspondence. We have evaluated our approach on a publicly available database where it performed well compared to a stereo correspondence approach.
The proposed approach only takes the pixel intensity values for the left and right images, and learns their 3D depth map during training. Figure 3 shows the flow chart of the proposed approach.
In this section, the proposed approach is described in details. Section 2.1 presents an overview on DNNs. Section 2.2 proposes the architecture of our approach, while we describe the optimization aspects of the proposed method in section 2.3.
Deep Neural Networks (DNNs) are biologically inspired machine learning techniques which does not rely on engineering domain-specific features for each separate problem but involves a mapping of the input data (e.g. images) to a target vector using a sequence of non-linear transformations
[LeCun et al., 2015]. In particular, DNNs consists of several layers of high-level representations on the input data. Each layer consists of several nodes, which contains weights, biases and activations in the following form:(1) |
where is an input vector,
is a non-linear activation function.
is a weight matrix of shape , and are the output and input dimensions of the preceding and succeeding layers, respectively,is a bias vector. Each DNN has an input layer, an output layer and several hidden layers of this form. DNN has proven to be very successful in a wide variety of machine learning related tasks.
The proposed DNN consists of six layers (see Figure 4).
The input layer takes a six element vector of the following form: where and are the pixel intensites at the position in channel of the left and right images, respectively. The output layer produces three outputs, namely the , ,
coordinates of the input points. Two fully connected dense layers serves as hidden layers in our architecture with 500 neurons each. A high-dimensional upscaling of the input features like this are effective in learning complex mappings. To avoid overfitting, a dropout layer after each dense layers were also included, which introduces random noise into the outputs of each layer
[Srivastava et al., 2014]. We have used Rectified Linear Units as activiations,
[Nair and Hinton, 2010] [Glorot et al., 2011] which is of the form:(2) |
We have used an adaptive per-dimension learning rate version of the stochastic gradient descent (SGD) approach called Adadelta, which is less sensitive to hyperparameter settings than SGD
[Zeiler, 2012]. Each weight matrix and vector were initialized using the normalized initialization [Glorot and Bengio, 2010]:(3) |
where U is the uniform distribution,
andare the sizes of the previous and next layers, respectively. To avoid overfitting, we have also applied Tikhonov regularization for each weight matrix [Hoerl and Kennard, 1970]. As an energy function, we have used mean squared error:
(4) |
where , and are the ground truth coordinates and , and are the DNNs predictions, and is the number of training vectors, respectively.
We have used a laparascpic cardiac dataset to evaluate our approach [Pratt et al., 2010] [Stoyanov et al., 2010]. The dataset consists a pair of videos showing heart movement. The video consist of 2427 frames, each of them having a spatial resolution of and in a standard RGB format. The ground truth is a depth map containing , ,
coordinates for each point. We have used a training set of 20 images and tested our approach on the remaining 2407 frames. For training, we have allowed 20 epochs to establish the optimal parameters for the DNN. After training, we have applied pixel-wise classification of the images of the test set.We have calculated the root mean squared errors for each image:
(5) |
We have used Theano
[Bergstra et al., 2010] [Bastien et al., 2012]and Keras
[Chollet, 2015] for implementation.First, to show the validity of the DNN architecture and the training hyperparameters, we have measured the training loss at each epoch. As it can be seen in Figure 5, the training loss decreased in each epoch showing but the curve started to flatten at later epochs.
We have also calculated the loss on the test instances. In average, the RMSE per image was 13.18. As it can be seen from Figure 6, there was quite a big fluctuation in losses (around 11-15). However, the ground truth was incomplete for some of the images, which might have affected the ability to properly evaluate each individual instances.
However, in some extreme cases (see Figure 7), the reconstruction by the proposed approach was just partially successful. To correct issues like this, an approach which also incorporates on contextual information could be used in the future.
Minimally invasive surgical techniques are very important in clinical settings, however, they require computational support to allow surgeons to effectively use these techniques in practice. In this paper, an approach based on deep neural networks has been introduced, which is unlike the state-of-the-art approaches, only relies on the input pixels of the stereo image pair. The approach has been evaluated on a publicly available dataset and compared well to the results obtained by a state-of-the-art technique.
This work was supported in part by the project VKSZ 14-1-2015-0072, SCOPIA: Development of diagnostic tools based on endoscope technology supported by the European Union, co-financed by the European Social Fund.
International conference on artificial intelligence and statistics
, pages 249–256.Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814.
Comments
There are no comments yet.