1 Description of Purpose
Minimally Invasive Surgery (MIS) techniques became popular in the last decades since they offer significant clinical benefits including reduced recovery time and fewer scars for the patients. However, most endoscopic systems rely on monocular video which is known to compromise depth perception, spatial orientation and field of view, making it more difficult for the surgeon to reliably perform suturing.
In MIS, manual interaction between needle holder instrument and the surgical needle is a key requirement which needs to be constantly trained and exercised. However, active support and enhanced quantitative analysis for guidance of the suturing process is relatively unexplored. So far, suturing has been roughly described in many textbooks and tutorials by geometric relations between needle and instruments [9, 7] , but live-support during surgical training or surgery itself is not available.
Four relevant phases of the suturing process can be defined (cf. Fig. 1):
Grasping the needle with the needle holder: Position the needle to the needle holder at approximately 2/3 from the tip of the needle. Fix the needle at an angle between 90-120 regarding the needle holder‘s axis.
Placing the needle: Place the tip of the needle towards the entry point of the tissue, forming a 90 entry angle of the tip of the needle with the tissue.
Moving the hand: The needle holder is rotated around its longitudinal axis to pass the needle through the tissue. Parallel planes between needle holder and wound should be maintained.
Exiting the needle: Grasp the exiting needle with the second instrument at approximately 1/3 from the tip. Rotate it out of the tissue.
Common technical mistakes are a poorly positioned needle or a traumatic and inefficient handling of the needle and tissue (e.g. not performing pure rotation of the needle around its rotation center, but also pulling). This may lead to a considerable waste of time, and in general the whole process loses quality and efficiency. Training labs offer possibilities for medical students or surgical residents to learn suturing on a laparoscopic box trainer to avoid such mistakes. However, the learning process is still not very effective and must be guided by an expert surgeon. At the same time, reliable 3D localization of needle and instruments in real time could be leveraged to augment the laparoscopic scene with additional quantitative parameters and visual cues that better describe their relation. We therefore propose a novel use-case for a laparoscopic Augmented Reality (AR) system as shown in Fig. 2, which has not been described before.
Several sophisticated technical steps are necessary that perform segmentation, 3D scene understanding, tracking to realize such a system. While some of these steps, like segmentation and vision-based tracking of surgical instruments have been addressed by several works[2, 3] , and challenges (e.g. EndoVis111https://endovissub-instrument.grand-challenge.org/), the same for the needle is relatively unexplored. In comparison to the instruments, the needle is considerably smaller and thinner, has shinier reflections and is most of the time partially occluded by the needle holder and tissue. Therefore, segmentation and tracking of the needle is a huge challenge. Speidel et al.  implemented a markerless needle tracking method which combines colour and geometry based approaches. The method, however, failed to produce robust segmentation of the needle in the presence of specular highlights, varying light conditions and occlusions.
In this work, we solve first steps towards the vision of quantitative suturing support in surgical training. We propose a multi-task supervised convolutional neural network that is able to segment the needle and the needle holders and that makes a depth estimation from a monocular camera of the scene. In order to overcome the scarcity of annotations, we propose to create a virtual representation of a surgical training scenario that includes a suture pad as commonly used in our clinic to train medical students . Furthermore, we propose example AR-visualizations that can be connected to the recovered 3D information to guide the trainee.
2 Material and Methods
Supervised training of neural networks for computer vision tasks requires large amounts of training data. Currently, manual annotations or additional sensor data are commonly used as ground truth for visual recognition tasks. However, manual annotation for low-level tasks like semantic segmentation is time-consuming and error-prone and dense depth information cannot be obtained by manual annotations at all. Beyond that, using additional laser sensors for objects very close to the laparoscope is difficult. Creating virtual environments for generation of synthetic data as input for learning tasks offers a viable solution to this problem, especially for a monocular approach that can not infer disparity by a second view. This has been investigated by other works in computer vision that try to solve e.g. semantic segmentation, optical flow or disparity estimation[10, 13, 1].
2.1 Creation of a virtual environment
A virtual environment consisting of 3D mesh models of needle holders, needle, thread and a silicone suture pad was created to closely match the arrangement of these objects in a laparoscopic box trainer. A virtual model of the CV-25, 1/2 circle needle was carefully designed. An example scene is shown in Fig. 2
(upper row). The different material and illumination properties like colour, reflections and roughness of various objects were fine-tuned in order to produce renderings more closer to photo-realism. Careful attention was also paid to the generation of the 3D models so as to faithfully recreate the geometric properties of the original objects. In this work, we make use of Blender 2.79, a 3D modelling and animation software to create the virtual environment for the generation of synthetic images. These synthetic images form part of the training data (segmentation labels and depth information) for supervised training of the deep learning approach presented below.
2.2 Multi-task deep learning network
For the purpose of solving segmentation and depth estimation, we developed a multi-task encoder-decoder architecture as illustrated in Fig. 3. The basic structure of the network relies on the U-Net  with two different decoders being used to predict the segmentation map and the depth map. Each stage of the encoder consists of twomax-pooling layers down-sample the feature maps at each level in the encoder. The model includes two decoders which specialize in depth map estimation and multi-class segmentation. The basic block in the decoder consists of two convolution layers followed by Leaky ReLU activation layer. The use of transposed convolution layers allows the network to learn optimal up-sampling of the feature maps and the skip connections to the decoder restore spatial information lost during down-sampling.
The multi-task learning strategy based on hard-parameter sharing of the encoder section acts as a regularization method and reduces the risk of overfitting 
. The network uses categorical cross-entropy and mean squared error (MSE) loss functions for multi-class segmentation and depth map estimation respectively. The two branches of the network were initially trained simultaneously on the synthetic data set and the segmentation branch was subsequently retrained on real images with annotations to improve the network’s segmentation performance on real data. Shared encoder weights for segmentation and depth prediction enables the networks to jointly use this high-level information.
2.3 Augmented Reality Visualizations
The proposed AR environment could assist the surgeon in the suturing task through additional visualization of 1) coloured needle segments with optimal center of rotation, 2) the plane of the needle and fixed coordinate system of the needle holder. These visualizations along with measurements of angle between the plane of the needle and the coordinate system of the instrument could enable the surgeons to grasp the needle at the recommended position and to maintain the correct trajectory of the needle.
2.4 Training Data
In total 218 frames were rendered in the virtual scene from different perspectives, including random noise. 21 images were used for testing and on the 197 remaining images, data augmentation was performed. Images were reseized to 512 pixel along the -axis and random crop of 512 pixel was performed along the -axis. This resulted in 732 synthetic images used for training. In addition to the synthetic data set, video data from real laparoscopic training sequences was captured, segmented by a physician and randomly cropped, yielding 144 frames for training. The model was first trained on a synthetic data set and the encoder and the segmentation branch was subsequently re-trained on the 144 real images to fine-tune network weights for the segmentation task. The training was initiated with a learning rate of 1e-5 and ADAM solver was used as the optimizer. The model was trained on the synthetic images for epochs and on the real images for epochs.
The virtual environment has been used to create example mock-up AR-visualizations that have the sole purpose of illustrating the concept of the proposed AR-based suturing support (Fig. 2). In a final application scenario, these visualizations must be connected to the recovered 3D position of the needle and the instrument. We will address this connection in future work.
The network produces a channel output for the multi-class segmentation task. The prediction of each object class is obtained in a different channel. Prediction accuracy was evaluated on an independent test set consisting of synthetic and real images. The network achieved a dice score of 0.39 for the needle on synthetic images and a much higher dice score of 0.67 on real images. The instruments achieved a dice score of 0.95 (synthetic) and 0.81 (real).
Mean absolute error was used as the metric for estimating the model’s performance in the depth map prediction, with representing the corresponding ground truth information.
The model gave a mean absolute error of 6.5mm on the synthetic test data set. The dice scores of individual classes is shown in Table 1. Qualitative results of the predictions on real and synthetic images are depicted in Fig. 4 and 5. The model was able to produce a segmentation and depth estimation of the scene even when encountered with challenges like occlusion.
|Synthetic Images||Real Images||Real Images|
|(without fine-tuning)||(after fine-tuning)|
|MAE depth (mm)||6.5||-||-|
The paper describes a novel use-case for an AR-concept to be potentially useful to guide surgical training. The work solves first steps towards this goal with regard to joint segmentation and depth estimation of the needle and the needle holder. In order to overcome scarcity of labels considering dense depth information, a virtual environment was created and subsequently leveraged for training data generation in a supervised deep learning approach.
With the emergence of larger resolutions of endoscopic images, i.e. full high definition or even 4K, it is now possible to better capture the needle more reliably. The proposed method was able to achieve good segmentation and reasonable depth estimation results on both synthetic and real images (cf. Fig. 4). We want to further use this information to teach another network for fine 3D localization on the relevant sub-region, since a much higher accuracy is needed for this application which has not been achieved yet.
While it is not straightforward to obtain reliable depth information from a surgical training scenario due to the lack of available sensors working on such a small scale with sufficient resolution, we have chosen to design the same scene in a virtual environment to create training data. The drawback of this approach is that we can so far only provide qualitative results on depth estimation of the real scene.
We exploited the advantage of using a virtual environment, which enables generation of high quality training data, potentially on a large scale in the future. Given the latest developments in domain transfer using generative adversarial networks (GANs), we think that we can achieve comparable results in a more complex training scenario  or intraoperative scenario. For instance, Pfeiffer et al.  could show that learned anatomical appearances can be mapped on virtual renderings of the liver using GANs. Similarily, Engelhardt et al. [6, 5] showed the improvement of realistic appearance in minimally-invasive endoscopic surgical training for mitral valve repair. Similar approaches could be taken into consideration here.
The presented approach will enable autonomous learning applications without the need for assistance from teaching personnel. Additionally, it could be extremely relevant for surgical skill assessment  to quantitatively describe the suturing process in addition to, e.g. robotic kinematic data.
The Titan Xp GPU card used for this research was donated by the NVIDIA Corporation. We furthermore thank Karl Storz for providing the 3D-surface of the needle holder.
-  (2019) Monocular depth estimation: a survey. ArXiv abs/1901.09402. Cited by: §2.
-  (2018) Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery. CoRR abs/1805.02475. Cited by: §1.
-  (2017) Vision-based and marker-less surgical tool detection and tracking: a review of the literature. Medical image analysis 35, pp. 633–654. Cited by: §1.
-  (2019) Replicated mitral valve models from real patients offer training opportunities for minimally invasive mitral valve repair. Interactive CardioVascular and Thoracic Surgery 29 (1), pp. 43–50. External Links: Cited by: §4.
-  (2019) Cross-domain conditional generative adversarial networks for stereoscopic hyperrealism in surgical training. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, pp. 155–163. External Links: Cited by: §4.
Improving surgical training phantoms by hyperrealism: deep unpaired image-to-image translation from real surgeries. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, Cham, pp. 747–755. External Links: Cited by: §4.
-  (2015) Fundamentals of geometric laparoscopy and suturing. 1st edition edition, Endo Press. External Links: Cited by: §1.
-  (2016) Development and validation of a sensor- and expert model-based training system for laparoscopic surgery: the isurgeon. Surgical Endoscopy 31, pp. 2155–2165. Cited by: §4.
-  (2015) Romeo´s gladiator rule: knots, stitches and knot tying techniques – a tutorial based on a few simple rules. 2nd edition edition, Endo Press. External Links: Cited by: §1.
-  (2018-09) What makes good synthetic training data for learning disparity and optical flow estimation?. International Journal of Computer Vision 126 (9), pp. 942–960. Cited by: §2.
-  (2019) Generating large labeled data sets for laparoscopic image processing tasks using unpaired image-to-image translation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, Cham, pp. 119–127. Cited by: §4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Cited by: §2.2.
The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3234–3243. External Links: Cited by: §2.
-  (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Cited by: §2.2.
-  (2020) Self-directed training with e-learning using the first-person perspective for laparoscopic suturing and knot tying: a randomised controlled trial. Surgical Endoscopy 34, pp. 869–879. Cited by: §1.
-  (2015) Image-based tracking of the suturing needle during laparoscopic interventions. In Medical Imaging 2015: Image-Guided Procedures, Robotic Interventions, and Modeling, R. J. W. III and Z. R. Yaniv (Eds.), Vol. 9415, pp. 70 – 75. External Links: Cited by: §1.