1 Introduction and Related Work
In recent years there has been significant progress towards computer-based surgical assistance in Minimally Invasive Surgery (MIS) and Retinal Microsurgery (RM). Two of the key components are segmentation and localization of surgical instruments during the intervention: tool segmentation provides, for example, suitable regions for a graphical overlay of additional information without obstructing the surgeon’s view; tool movement is an indicator for surgical workflow analysis; localization of the instrument tips in RM allows proximity estimation to the retina by aligning a cross-sectional view. For these tasks, marker-free approaches are particularly desirable as they do not interfere with the surgical workflow and they do not require modifications to the tracked instrument. Despite recent advances, the vision-based tracking of surgical tools in in-vivo scenarios remains challenging, as summarized by Bouget et al. , mainly due to nuisances such as strong illumination changes and blur. Prior work in the field relies on handcrafted features, such as Haar wavelets , gradient [3, 4, 5, 6] or color features [7, 8], which come with their own advantages and disadvantages. While color features, for example, are computationally cheap, they are not robust towards strong illumination changes which are often present during the surgery. Gradients features, on the other hand, are not reliable to withstand the typical motion blur of the tools. Rieke et al. 
employed both feature types in two separate Random Forests and proposed to adaptively choose the more reliable one, depending on the confidence of the respective forest’s leaf nodes. Since their explicit feature representation incorporates implicit simplifications, this tends to limit the generalization power of the forests and therefore leads to the risk of tracking failure during surgery. Furthermore, temporal trackers[9, 10, 5] require an initialization of the region of interest. Sarikaya et al.  present a deep learning approach for tool detection via region proposals, which provides a bounding box and but not a precise localization of the landmarks. Instead of tracking the tool directly, two-step methods based on tool segmentation have also been proposed. Color, HOG and SIFT features were employed by Allan et al.  for pixel-wise classification of the image. The position was subsequently determined based on largest connected components. Instead of reducing the region of interest, Reiter et al.  employ the segmentation as a post-processing step for improving the localization accuracy. Recent segmentation methods [14, 15] can also be employed for these two-step approaches. However, the observation that segmentation can be used during both pre- and post-processing suggests that tracking of an instrument landmark and its segmentation are not only dependent, but indeed interdependent.
Our contributions are as follows. Instead of carrying out the tasks as two subsequent pipeline stages, we propose to perform tool segmentation and pose estimation simultaneously, in a unified deep learning approach (Fig. 1). To this end, we reformulate the pose estimation task and model the problem as a heatmap regression where every pixel represents a confidence proportional to its proximity to the correct landmark location. This modeling allows for representing semantic segmentation and localization with equal dimensionality, leveraging on their spatial dependency and facilitating simultaneous learning. It also enables employing state-of-the-art deep learning techniques, such as Fully Convolutional Residual Networks [16, 17]. The resulting model is trained jointly and end-to-end for both tasks. It relies only on contextual information and is thus capable of reaching both objectives efficiently without requiring any post-processing technique. We compare the proposed method to state-of-the-art algorithms on the EndoVis Challenge111MICCAI 2015 Endoscopic Vision Challenge Instrument Segmentation and Tracking Sub-challenge http://endovissub-instrument.grand-challenge.org and on a benchmark dataset of in-vivo RM sequences, on which we also outperform other popular CNN architectures, such as U-net  and the FCN-based approach of . To the best of our knowledge, this is the first approach that employs deep learning for surgical instrument tracking and 2D pose estimation by predicting semantic segmentation and localization simultaneously and is successful despite limited data.
This section describes our CNN-based approach to model the mapping from an input image to the location of the tool landmarks and the corresponding dense semantic labeling. For this purpose, we motivate the use of a fully convolutional network, that models the problem of landmark localization as a regression of a set of heatmaps (one per landmark) in combination with semantic segmentation. This approach exploits global context to identify the position of the tool and has clear advantages comparing to patch-based techniques , which rely only on local information, thus being less robust towards false positives, e.g. specular reflections on the instrument or shadows. We compare the proposed architecture and discuss its advantage over two baselines. A common block for all discussed architectures is the encoder (Sec. 2.1), which progressively down-samples the input image through a series of convolutions and pooling operations. The differences lie in the subsequent decoding stages (Sec. 2.2) and the output formulation. An overview of these models is depicted in Fig. 2. We denote a training sample as , where refers to the 2D coordinates of tracked landmarks in the image , represents the semantic segmentation for labels and , denote the image width and height respectively.
For the encoding part of the three proposed models, we employ ResNet-50 
, a state-of-the-art architecture that has achieved top performance in several computer vision tasks, such as classification and object detection. It is composed of successiveresidual blocks, each one consisting of several convolutions and a shortcut (identity) connection summed to its output. In this way, it allows for a very deep architecture without hindering the learning process and with relatively low complexity. Although deeper versions of ResNet exist, we use the -layer variant, as computation time is still crucial for our problem. As input to the network, we consider images with pixels. Thus, the feature maps at the last convolutional layer of ResNet have a resolution of pixels. The last pooling layer and the loss layer are removed.
2.2 Decoder Tasks
We then define three different CNN variants, appended to the encoder, to find the best formulation for our task. In the following we outline the characteristics of each model, discuss their differences and motivate the choice of the final proposed model.
2.2.1 Localization (L):
First, we examine the naïve approach, frequently used in literature , that regresses the real 2D locations of the landmarks directly as a
dimensional vector representing theand coordinates of the
tracked landmarks of the instrument. Here, the segmentation task is excluded. To further reduce the spatial dimensions of the last feature maps, we append another residual block with stride to the end of the encoder (). Similarly to the original architecture , this is followed by a average pooling layer and a fully-connected layer with units which produces the output. This dimensionality reduction is needed so that the averaging is not applied over a large region, which would result in a greater loss of spatial information, thus affecting the precision with which the network is able to localize. In this case, the training sample is and the predicted location is . The network is trained with a standard loss: .
2.2.2 Segmentation and Localization (SL):
In this model, we regress the 2D locations and additionally predict the semantic segmentation map of an input within a single architecture. Both tasks share weights along the encoding part of the network and then split into two distinct parts to model their different dimensionality. For the regression of the landmark positions we follow the aforementioned model (L). For the semantic segmentation, we employ successive residual up-sampling layers as in 
, to predict the probability of each pixel belonging to a specified class, e.g. manipulator, shaft or background. Due to real-time constraints, we produce the network output with half of the input resolution and bilinearly up-sample the result. By sharing the encoder weights, the two tasks can influence each other while upholding their own objectives. Here, the training sample is, and the prediction consists of and . The network is trained by combining the losses for the separate tasks: , where balances the influence of both loss terms. For the segmentation we employ a pixel-wise softmax-log loss:
2.2.3 Concurrent Segmentation and Localization (CSL):
In both L and SL architectures, only a single 2D position is considered as the correct target for each landmark. However, manual annotations can differ in a range of several pixels, which in turn implies discrepancies or imprecise labeling. Predicting an absolute target location is somewhat arbitrary and ignores image context. Therefore, in the proposed model (CSL), we address this problem by regressing a heatmap for each tracked landmark instead of its exact coordinates, as recently used in the field of human pose estimation [21, 22]. The heatmap represents the confidence of being close to the actual location of the tracked point and is created by applying a Gaussian kernel to its ground truth position. The heatmaps have the same size as the segmentation and can explicitly share weights over the entire network. We further enhance the architecture with long-range skip connections that sum
lower-level feature maps from the encoding into the decoding stage, in addition to the residual connections of the up-sampling layers. This allows higher resolution information from the initial layers to flow to the output layers without being compressed through the encoder, thus increasing the model’s accuracy. Finally, we enforce a strong dependency of the two tasks by only separating them at the very end and concatenating the predicted segmentation scores (before softmax) to the last set of feature maps as an auxiliary means for guiding the location heatmaps. The overall loss is given by:
is used as the location of the instrument landmark. Notably, a misdetection is indicated by high variance in the predicted map.
3 Experiments and Results
In this section, we evaluate the performance of the proposed method in terms of localization of the instrument landmarks, as well as segmentation accuracy.
Datasets: The Retinal Microsurgery dataset  consists of in-vivo sequences, each with frames of resolution
pixels. The dataset is further classified into four instrument-dependent subsets. The annotated tool joints areand semantic classes (tool and background).
In the EndoVis challenge, the training data contains four ex-vivo 45s sequences and the testing includes the rest 15s of the same sequences, plus two new 60s videos.
Notably, the guidelines require to exclude the respective surgery for training when testing on the additional 15s sequence and one of the long testing sequences include a previously unseen tool type.
All sequences have a resolution of pixels and include one or two surgical instruments.
There is joint per tool and semantic classes (manipulator, shaft and background).
Implementation details:variance. All images are resized to pixels and augmented during training with random rotations , scaling , random crops of , gamma correction with , a multiplicative color factor and specular reflections. For localization, we set for RM and for EndoVis
in which the tools are larger. All CNNs are trained with stochastic gradient descent with learning rate, momentum and empirically chosen . The inference time is 56ms per frame on a NVIDIA GeForce GTX TITAN X using MatConvNet.
3.1 Evaluation of Modeling Strategies
First, we evaluate the models for tool landmark localization by training on sequences of the RM dataset and testing on the remaining ones. In Fig. 3, the baseline of explicitly predicting the 2D coordinates of the landmark locations (L) shows the lowest accuracy, while after combining localization with the segmentation task (SL) we observe increased performance. The proposed CSL model achieves the highest accuracy of over 90% for both tool tips and 79% for the center joint considering an acceptance pixel threshold of 20 pixels. Our model exploits contextual information for precise localization of the tool, by sharing features with the semantic segmentation task. Another baseline is the U-Net architecture  trained with the same objectives. CSL is consistently more accurate for the localization task, as well as for the segmentation, achieving a DICE score of 75.4%, comparing to 74.4% for CSL without the skip connections, 73.7% for SL and 72.5% for U-net.
3.2 Retinal Microsurgery
Analogously to , we train the proposed model (CSL) on all first halves of the RM sequences and evaluate on the remaining frames, referred to as Half Split Experiment. As shown in Fig. 4, the proposed method clearly outperforms the state-the-art-methods, reaching an average accuracy of more than 84% for the KBB score  with . In a second experiment, we evaluate the generalization ability of our method not only to unseen sequences and but also to unknown geometry. We employ a leave-one-out scheme on the subsets given by the different instrument types, referred to as Cross Validation Experiment, and show that our method achieves state-of-the-art performance.
3.3 EndoVis Challenge
For this publicly available dataset, we performed our experiments in a leave-one-surgery-out fashion, as specified by the guidelines. We report our quantitative results in Table 1 and compare to the previous state of the art, which we significantly outperform. In all of our experiments, the network was trained with the objective of multi-class segmentation. For the binary prediction, the instrument classes (Shaft and Grasper) were merged. Notably, the proposed method can also distinguish among parts of multiple instruments (Fig. 5), for example left and right, when trained with classes (left shaft, left grasper, right shaft, right grasper, background) and joints.
A challenging aspect of this dataset is that two instruments can be present in the testing set, while only one is included in the training. To alleviate this problem, we additionally augment with horizontal flips, such that the instrument is at least seen from both sides. Moreover, in Sets 5 and 6 the network was capable of successfully localizing and segmenting a previously unseen instrument and viewpoint222The challenge administrators believe that the ground truth regarding tracking for sequence and is in fact not as accurate as for the rest of the sequences, which explains the higher localization errors..
|Sequence||B.Acc. Rec.||Spec.||DICE||Prec.||Rec.||Spec.||DICE||Prec.||Rec.||Spec.||DICE||loc. error|
|Balanced Accuracy (B.Acc.), Recall (Rec.), Specificity (Spec.), DICE and Precision (Prec.) are in %|
|The average localization error (loc. error) is in pixel.|
In this paper, we propose to model the localization of surgical instrument landmarks as heatmap regression. This allows us to leverage deep-learned features via a CNN to concurrently predict the instrument segmentation and its articulated 2D pose in an end-to-end manner. It is worth noting that the resulting method is flexible regarding the number of tracked joints and semantic classes and even allows to distinguish between left and right instrument. These objectives can be specified during training by simply setting the number of the respective semantic classes and heatmaps. The inference time is near real-time and the method does not require an initialization, post-processing technique or temporal regularization. The performance is evaluated on two different surgical intervention benchmarks, on which the proposed approach delivers state-of-the-art results.
-  Bouget, D., Allan, M., Stoyanov, D., Jannin, P.: Vision-based and marker-less surgical tool detection and tracking: a review of the literature. Medical Image Analysis 35 (2017)
-  Sznitman, R., Richa, R., Taylor, R.H., Jedynak, B., Hager, G.D.: Unified detection and tracking of instruments during retinal microsurgery. IEEE trans. on Pattern Analysis and Machine Intelligence 35(5) (2013)
-  Rieke, N., Tan, D.J., Amat di San Filippo, C., Tombari, F., Alsheakhali, M., Belagiannis, V., Eslami, A., Navab, N.: Real-time localization of articulated surgical instruments in retinal microsurgery. Medical Image Analysis 34 (2016)
-  Bouget, D., Benenson, R., Omran, M., Riffaud, L., Schiele, B., Jannin, P.: Detecting surgical tools by modelling local appearance and global shape. Trans. on Medical Imaging 34(12) (2015)
-  Li, Y., Chen, C., Huang, X., Huang, J.: Instrument tracking via online learning in retinal microsurgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2014) 464–471
-  Ye, M., Zhang, L., Giannarou, S., Yang, G.Z.: Real-time 3d tracking of articulated tools for robotic surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2016) 386–394
-  Zhou, J., Payandeh, S.: Visual tracking of laparoscopic instruments. J. Autom. Cont. Eng. Vol. 2(3) (2014) 234–241
-  Speidel, S., Benzko, J., Krappe, S., Sudra, G., Azad, P., Peter, B.: Automatic classification of minimally invasive instruments based on endoscopic image sequences. In: SPIE medical imaging, International Society for Optics and Photonics (2009) 72610A–72610A
-  Rieke, N., Tan, D.J., Tombari, F., Page Vizcaíno, J., Amat di San Filippo, C., Eslami, A., Navab, N.: Real-time online adaption for robust instrument tracking and pose estimation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2016) 422–430
-  Rieke, N., Tan, D.J., Alsheakhali, M., Tombari, F., Amat di San Filippo, C., Belagiannis, V., Eslami, A., Navab, N.: Surgical tool tracking and pose estimation in retinal microsurgery. (2015) 266–273
Sarikaya, D., Corso, J., Guru, K.:
Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection.IEEE Transactions on Medical Imaging (2017)
-  Allan, M., Ourselin, S., Thompson, S., Hawkes, D.J., Kelly, J., Stoyanov, D.: Toward detection and localization of instruments in minimally invasive surgery. In: IEEE Transactions on Biomedical Engineering 60, pp. 1050 – 1058 (2013)
-  Reiter, A., Allen, P.K., Zhao, T.: Marker-less articulated surgical tool detection. In: Proc. Computer assisted radiology and surgery. Volume 7. (2012) 175–176
-  Garcia Peraza Herrera, L., Li, W., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., Ourselin, S.: Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In: CARE workshop at International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2016)
-  Pakhomov, D., Premachandran, V., Allan, M., Azizian, M., Navab, N.: Deep residual learning for instrument segmentation in robotic surgery. arXiv preprint arXiv:1703.08580 (2017)
-  Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: Int. Conf. on 3D Vision (3DV), IEEE (2016) 239–248
He, K., Zhang, X., Ren, S., Sun, J.:
Deep residual learning for image recognition.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 770–778
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2015) 234–241
-  Alsheakhali, M., Eslami, A., Navab, N.: Detection of articulated instruments in retinal microsurgery. In: Biomedical Imaging (ISBI), IEEE (2016)
-  Rupprecht, C., Lea, C., Tombari, F., Navab, N., Hager, G.D.: Sensor substitution for video-based action recognition. In: Intelligent Robots and Systems (IROS), IEEE (2016) 5230–5237
-  Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR. (2017)
-  Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 1913–1921
-  Sznitman, R., Becker, C., Fua, P. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2014) 692–699