Robot-assisted minimally invasive surgery with stereo laparoscopic vision has become popular due to the advantages of enhanced movement range, precision, vision and proficiency [23, 24, 33]. Surgical scene depth estimation is a fundamental problem in image-guided intervention and has received substantial prior interest to its promise for robot navigation, 3D registration between pre- and intra-operative organ models, and augmented reality . Obtaining depth maps is not trivial due to the inherent problems such as tissue deformation, specular reflections, and lack of photometric constancy across frames .
, but these struggle with less textured tissues. More recently deep learning-based depth estimation has used RGB images as the training data and Convolutional Neural Networks (CNNs) for supervised learning[6, 4]. To produce accurate results in less than a second of GPU time, Luo et al.  treated the problem as a multi-class classification indicating all possible disparities, and exploited a product layer to simplify the representations of a Siamese architecture. Chang et al.  proposed PSMNet, where the capacity of global context information at different scales and locations could be extracted by a spatial pyramid pooling module to form a cost volume. Duggal et al.  sped up the runtime of stereo matching and developed a differentiable PatchMatch module that could discard most disparities without the need of full cost volume evaluation.
The methods above are fully supervised and require ground truth depth during training. However, acquiring per-pixel ground truth depth data is challenging for real-world settings  and especially for laparosocpic vision where port space is limited, working distance is short and sterilization is required . One alternative is self-supervised training of depth estimation models using image reconstruction as the supervisory signal . The input is usually a set of images in the form of monocular or stereo images . Godard et al.  proposed a training loss that included a left-right depth consistency term and a reconstruction term for single image depth estimation, despite the absence of ground truth depth. This was extended by  with full-resolution multi-scale sampling to reduce visual artifacts, and a minimum reprojection loss to robustly handle occlusions. Johnston et al.  further closed the gap with fully-supervised methods by including a self-attention mechanism and made use of contextual information. Ye et al. 
proposed a deep learning framework for surgical scene depth estimation in self-supervised mode for scalable data acquisition by adopting a differentiable spatial transformer and an autoencoder.
In this paper, we present a new method for self-supervised adversarial depth estimation: SADepth. A U-Net architecture  was adopted as a generative structure and fed with stereo pairs as inputs to benefit from complementary information. To cope with local minima caused by classic photometric reprojection loss, we applied the disparity smoothness loss and formed the network across multiple scales. The use of a generative adversarial network (GAN) allowed us to improve the reconstructed image quality, which formed a supervisory signal for training, while keeping the overall end-to-end optimization objective.
Here we describe the proposed self-supervised adversarial depth estimation framework, SADepth. Stereo depth estimation predicts depth maps based on the stereo RGB images of height and width . A generative network with stereo image pairs and as inputs, was used to produce two distinct left and right disparity maps and , i.e. , = . As the two disparity maps were generated from different input images, a ‘reprojection sampler’  could be used for photometric reprojection loss computation of mutual counter-parts, i.e. reconstructed left and right images and . The discriminator was exploited to indicate if the reconstructed images were real or fake (original input images were regarded as real). By forcing the reconstructed image to be consistent with the original input, we could derive accurate disparity maps for depth inference, as shown in the following sections.
2.2 Network Architecture
The generator followed the general U-Net  architecture consisting of an encoder-decoder network, where the encoder was designed to obtain compact image representations and the decoder produced disparity maps for left and right input images, recovering them at the original scale (illustrated in Figure 3). Encoder-decoder skip connections were applied to represent deep abstract features while preserving local information. To make the model compact - and different from less streamlined previous approaches which had two branches or two sub-networks for the encoder   
- we first concatenated the left and right images into a 6-channel tensor and then fed it to a ResNet18 model. The input size was . Similar to 
, our decoder was formed of five cascaded blocks where each block had four parts: the first convolutional layer, an upsampling layer, a concatenation manipulation, and the second convolutional layer. In the upsampling layer, features were interpolated to twice the input size and both convolutional layers were followed by anELUactivation function . In particular, sigmoids were applied at the output to generate a 2-channel tensor representing the left and right disparity and . Finally the sigmoid outputs were converted to depth by , where parameters and were selected to constrain the depth between 0.1 and 100 units. The depth maps were then back-projected into point clouds by applying the intrinsic parameters and using the counter-part camera’s extrinsic parameters to form reconstructed stereo images. The structural similarity between the original and reconstructed images was regarded as a supervisory signal to train the generator (see section 2.3 for the generator loss).
Godfellow et al.  introduced a generative adversarial learning strategy and presented impressive results for image generation tasks. GANs have been widely exploited in different tasks with different GAN models including e.g. DualGAN  and CycleGAN . To improve the generation quality of the reconstructed images and , and following the work in  for natural scenes, we applied an adversarial learning strategy for laparoscopic images to include geometry constraints during training and force the network to make a consistent depth map prediction. The original input stereo image pairs and reconstructed images and generated from the ‘reprojection sampler’ were fed into the discriminator
, which consisted of convolutional, batch normalization and activation function layers and classified the input and reconstructed images as real or fake. As training progressed, the reconstructed images became more similar to the original inputs, while the discriminator also became better at distinguishing between the input and reconstructed images, resulting in an overall improvement of the associated disparity maps.
2.3 Training Losses
In the depth estimation generator network , the loss was formed from the appearance matching loss and disparity smoothness loss
where balanced the loss magnitude of the two parts to stabilize the training and was set to 0.001.
Self-supervised training typically assumes that the appearance and material properties (e.g. brightness and Lambertian) of object surfaces are consistent between frames. A local structure-based appearance loss  can effectively improve the depth estimation performance compared with simple pairwise pixel differences . Following , we exploited the appearance-matching loss as part of the generator loss which forced the reconstructed image to be similar to the corresponding training inputs. During the training, the right disparity map generated by the autoencoder was then transformed to produce – a reconstruction of the original right input image – using RGB intensity information from the counter-part camera image (see Fig. 1). This was achieved by first converting the disparity map to a depth map , from which a point cloud of the surgical scene could be generated. Then the point cloud was transferred into the other camera’s coordinate system and projected onto its image plane. The reconstructed input image was generated with bilinear interpolation for each output pixel using the weighted sum of the four neighboring intensities. In contrast to , this bilinear sampling was locally fully differentiable, which allowed it to be integrated into the fully convolutional architecture without requiring simplification or approximation of the cost function. To compare the reconstructed image and the original input image , a combination of structural similarity (SSIM) index  and loss were applied as the photometric image reconstruction cost :
where denotes the number of pixels and represents the weighting for L1-norm loss term, which was set to 0.85. Similar to , the calculation of SSIM here was simplified to a block filter instead of a Gaussian. The training of the depth estimation generator then involved minimizing the reconstruction loss between input and reconstructed images.
Disparity Smoothness Loss.
Since disparities should be locally smooth and discontinuities usually occur at image gradients, we applied the disparity smoothness loss to penalize unexpected discontinuities in the disparity maps. Following , this cost was an edge-aware term weighted with the input image gradients :
where represents the generated disparity map and is the original input right image.
The adversarial objective of the generative network can be expressed as follows:
where a cross-entropy loss measured the expectation of the reconstructed image against the distribution of the input image . Note that both generator and discriminator losses included losses for left and right images but only the right image equations are shown.
One remaining issue with the above learning pipeline was that the training objective risked becoming stuck in local minima due to the application of a photometric reprojection loss . The strategy introduced in  indicated that combining the individual losses across multiple scales in the decoder was effective, which could improve the depth estimation performance and reduce sensitivity to architectural choices. Hence, the lower resolution depth maps (from the intermediate layers) were first upsampled to the input image resolution and then reprojected and resampled, with the errors computed at the higher input resolution. This manipulation is similar to matching patches, which enables low-resolution disparity maps to warp an entire patch of pixels in a high resolution image while promoting the depth maps at every scale to reconstruct the high resolution input image as accurately as possible .
Joint Optimization Loss
Finally, the joint optimization loss was a combination of generator loss and adversarial loss, written as:
The depth estimation procedure was trained based on the reconstruction supervision signal and no per-pixel depth ground truth labels were needed. The augmentation of input data was performed on the fly by flipping 50 % of the input images horizontally and reorienting the stereo pairs. Parameter was set to 4, which means that there were 4 output scales with resolutions , , and of the input resolution. and were set to 0.5.
|Method||Training||Mean SSIM||Std. SSIM|
|ELAS ||No training||47.3||0.079|
|SPS ||No training||54.7||0.092|
3 Experiments and results
We evaluated SADepth on two datasets. The first was the dVPN dataset, collected from da Vinci partial nephrectomy, with 34320 pairs of rectified stereo images for training and 14382 pairs for testing . The second was the SCARED dataset  released during the Endovis challenge at MICCAI 2019, with 17206 pairs (dataset 1, 2, 3, 6 and 7) of rectified stereo images for training and 5637 pairs for testing. To verify the generalization of our framework, we only trained on the dVPN dataset but test on both dVPN and SCARED dataset.
3.2 Evaluation Metrics, Baseline, and Implementation Details
3.2.1 Evaluation Metrics
As the ground truth depth labels were not available for the in vivo surgical data in the dVPN dataset, we adopted the SSIM index to evaluate the similarity between the reconstructed image and the original input image (i.e. and
) as the evaluation metric. For theSCARED dataset the team at Intuitive Surgical collected the ground truth by using structured light, thus we used the absolute error to assess our SADepth model.
We compared SADepth with several recent works. For the dVPN dataset, we compared our method with stereo matching-based methods: ELAS  and SPS ; Siamese-based networks: V-Basic  and V-Siamese ; and recent deep learning methods: Monodepth  and the stereo mode of Monodepth2 . For the SCARED dataset, we compared our results with the methods summarized by the recent MICCAI sub-challenge paper .
3.2.3 Implementation Details
The SADepth model was implemented in PyTorch, with a batch size of 16 and input/output resolution of . The learning rate was set to
for the first 15 epochs and then dropped tofor the remainder. The model was trained for 20 epochs using the Adam optimizer which took about 22 hours on a single NVIDIA 2080 Ti GPU.
|Method||Training||Test Set 1 Average||Test Set 2 Average|
|Lalith Sharan ||Supervised||43.03||48.72|
|Xiaohong Li ||Supervised||22.77||20.52|
|Huoling Luo ||Supervised||19.52||18.21|
|Zhu Zhanshi ||Supervised||9.60||21.20|
|Wenyao Xia ||Supervised||6.73||9.44|
|Congcong Wang ||Supervised||4.10||4.28|
|Trevor Zeffiro ||Supervised||3.60||3.47|
|J.C. Rosenthal ||Supervised||3.44||4.05|
|Dimitris Psychogyios 1 ||Supervised||3.00||1.67|
|Dimitris Psychogyios 2 ||Supervised||2.95||2.30|
|KeXue Fu ||Unsupervised||20.94||17.22|
The SADepth and other state-of-the-art results for the dVPN dataset are summarized in Table 1
using the mean and standard deviation (Std.) of the SSIM index. The SADepth model effectively outperformed other methods with an SSIM of 79.6,i.e. 24.7 units higher than Monodepth , 8.4 units higher than Monodepth2 , and 19.2 units higher than the Siamese architecture .
Table 2 presents the results of SADepth on the test set 1 and test set 2 (as defined in the SCARED dataset), together with the performance reported in the MICCAI sub-challenge summary paper . The results show an improvement over the unsupervised methods from the summary paper and recent baselines, while it is also competitive with some supervised approaches. This confirms that SADepth generalizes well across different datasets collected from different laparoscopes and subjects, while still producing superior performance compared with the state-of-the-art unsupervised approaches.
We have presented a new self-supervised adversarial depth estimation framework SADepth with an encoder-decoder generator and a concatenated stereo image pair as the input. The adversarial learning strategy improved the generation quality of the framework and led to the state-of-the-art performance on two public datasets. Furthermore, SADepth did not require any per-pixel depth labels and generalized well across different laparoscopes, suggesting excellent applicability to scalable data acquisition when accurate ground truth depth cannot be collected.
-  (2021) Stereo correspondence and reconstruction of endoscopic data challenge. arXiv:2101.01133. Cited by: §3.1, §3.2.2, §3.3, Table 2.
-  (2018) Pyramid stereo matching network. In , pp. 5410–5418. Cited by: §1, §2.2.
-  (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §2.2.
-  (2021) Multiple meta-model quantifying for medical visual question answering. arXiv preprint arXiv:2105.08913. Cited by: §1.
-  (2019) Deeppruner: learning efficient stereo matching via differentiable patchmatch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4384–4393. Cited by: §1.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283. Cited by: §1.
-  (2016) Unsupervised cnn for single view depth estimation: geometry to the rescue. In European conference on computer vision, pp. 740–756. Cited by: §1, §2.3.
-  (2010) Efficient large-scale stereo matching. In Asian conference on computer vision, pp. 25–38. Cited by: Table 1, §3.2.2.
-  (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279. Cited by: §1, §2.2, §2.3, Table 1, §3.2.2, §3.3, Table 2.
-  (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838. Cited by: §1, §2.3, §2.3, Table 1, §3.2.2, §3.3, Table 2.
-  (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.2.
-  (2013) Visual slam for handheld monocular endoscope. IEEE transactions on medical imaging 33 (1), pp. 135–146. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.
-  (2013) Pm-huber: patchmatch with huber regularization for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2360–2367. Cited by: §2.3.
-  (2020) Tracking and visualization of the sensing area for a tethered laparoscopic gamma probe. International Journal of Computer Assisted Radiology and Surgery 15 (8), pp. 1389–1397. Cited by: §1.
-  (2021) H-net: unsupervised attention-based stereo depth estimation leveraging epipolar geometry. arXiv preprint arXiv:2104.11288. Cited by: §2.2.
-  (2015) Spatial transformer networks. arXiv preprint arXiv:1506.02025. Cited by: §2.1.
-  (2020) Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4756–4765. Cited by: §1.
-  (2019) Unsupervised stereo matching using confidential correspondence consistency. IEEE Transactions on Intelligent Transportation Systems 21 (5), pp. 2190–2203. Cited by: §1.
-  (2018) Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. IEEE transactions on medical imaging 37 (10), pp. 2185–2195. Cited by: §1.
-  (2019) Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE transactions on medical imaging 39 (5), pp. 1438–1447. Cited by: §1.
-  (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5695–5703. Cited by: §1.
-  (2001) Minimally invasive and robotic surgery. Jama 285 (5), pp. 568–572. Cited by: §1.
-  (2020) End-to-end real-time catheter segmentation with optical flow-guided warping during endovascular intervention. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9967–9973. Cited by: §1.
-  (2017) Automatic differentiation in pytorch. Cited by: §3.2.3.
-  (2018) Unsupervised adversarial depth estimation using cycled generative networks. In 2018 International Conference on 3D Vision (3DV), pp. 587–595. Cited by: §2.2, §2.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §2.2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §2.3.
-  (2019) Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2162–2171. Cited by: §2.3.
-  (2014) Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In European Conference on Computer Vision, pp. 756–771. Cited by: Table 1, §3.2.2.
-  (2017) Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv:1705.08260. Cited by: §1, §1, Table 1, §3.1, §3.2.2, §3.3.
Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857. Cited by: §2.2.
-  (2018) A self-adaptive motion scaling framework for surgical robot remote control. IEEE Robotics and Automation Letters 4 (2), pp. 359–366. Cited by: §1.
-  (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858. Cited by: §1, §2.3, §2.3.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.2.