The automatic analysis of surgical videos is at the core of many potential assistance systems for the operating room. The localization of surgical tools, in particular, is required in many applications, such as the analysis of tool-tissue interactions, the development of novel human-robot assistance platforms and the automated annotation of video databases.
In the literature, surgical tool localization has traditionally been approached with fully supervised methods 
, with the most recent localization and segmentation methods relying on deep learning[4, 8, 10, 11, 13]. However, training fully supervised approaches require the data to be fully annotated with spatial information, which is tedious and expensive. This may explain why the datasets used so far for tool localization are small, namely in the order of a few thousand images and with a maximum of 5-6 sequences, as described in the recent review . This then limits the applicability and generalizability of the approaches that can be developed.
Recently, it has been shown that when a convolutional neural network is trained for the task of classification, the convolutional layers of the network learn general notions about the detected objects. Some recent works have used this fact to successfully localize objects in images without explicitly training for localization [12, 15, 17]. The proposed deep learning approaches directly output spatial heat maps, where the detected position corresponds to the strongest activations. This is achieved by replacing all fully connected layers with equivalent convolutions or removing them altogether. The resulting architectures are called fully convolutional networks (FCNs). Others have extended this approach to address the challenging task of semantic segmentation with weak supervision [2, 9, 14]. In the medical community as well, weakly supervised learning (WSL) has been applied to tasks such as detection of cancerous regions in medical images [6, 7]. Along with the recent release of large public surgical video datasets, such as Cholec80 , which contains 80 complete cholecystectomy videos fully annotated with binary tool presence information (180K frames in total), WSL techniques can potentially help develop tool localization methods that can scale up to larger datasets containing much more variability.
In this paper, we propose a method for detecting and localizing surgical tools. It is based on weakly-supervised learning using only image-level labels and does not require any spatial annotation. Our contributions are twofold: (1) we propose the first surgical tool localization approach based on weakly-supervised learning; (2) we demonstrate our approach on the largest public endoscopic video dataset to date, namely Cholec80 .
In this work, we present a method for the localization of surgical tools in endoscopic videos that does not require spatial annotations. This is possible with a FCN architecture that preserves the spatial information and permits us to observe activation regions where the tool is detected. Therefore, our method addresses two tasks: binary presence classification and tool localization, with the latter hinging on the former.
Our model takes an image as input and returns localization heat maps, where is the number of tools to be detected. For our task on the Cholec80 dataset, . The heatmaps are used to find confidence values for each class and perform the binary classification.
2.1 Network Architecture
because it has been shown to perform well on a multitude of tasks. Since we want to preserve relative spatial information throughout our network, we remove the fully connected layer and average pooling from the end of the network. Additionally, we change the stride in the last two banks of ResNet from 2 to 1 pixel to obtain localization maps with a higher resolution. Note that reducing the strides for all banks would dramatically increase the dimensions of intermediate tensors during training, making it computationally infeasible. These changes have the collective effect of quadrupling the resolution of the output. Using images of sizeas input to the network, we obtain a feature map tensor of at the output of ResNet and a global stride of 8.
Then, we convert the 512 feature maps into localization maps by adding a convolutional layer of kernels. To obtain one map per class, we set the number of filters in this layer to
. Finally, with pooling we transform these maps into a vector of class-wise confidence values, which are, in turn, used for the binary classification of the tools. Instead of using conventional max pooling, we use the extended spatial pooling (ESP)from , which extracts more details about the detection of the object. In the equation, is the localization map for class and is 0.6 as advised by .
During inference, we use the raw localization maps to find the predicted position of the tools. First, the localization maps are resized to the original size of the input image with bilinear interpolation. Then, the position of the maximum activation is considered to be the predicted location of the tool.
Before training on Cholec80
, the ResNet layers are initialized from ImageNet weights. During training, data is first randomly shuffled and batched, then data augmentation is applied independently to each image in a batch.
2.2.1 Data Augmentation
During training, all images in the batch are augmented before being given to the network. Augmentation includes horizontal flipping, random rotation by +90/-90 degrees, as well as the masking procedure introduced in . Masking entails randomly replacing patches in the image with the mean pixel of the train set. This improves the quality of predicted localization maps.
The models are trained for multi-label classification with a weighted cross-entropy loss presented in Equation 1, where and are respectively the ground truth and predicted tool presence for class ,
is the sigmoid function, andis the weight for class . Weights are added to counteract the polarizing effect of class imbalance. The weight for each class is inversely proportional to the number of occurrences of the class in the train set.
3 Experimental Results
For our experiments, we use the Cholec80 dataset  containing 80 videos of cholecystectomy procedures, fully annotated with image-level surgical tool labels for binary detection. Our training, validation and test sets consist of 40, 10 and 30 videos, respectively. Additionally, for the purpose of evaluating the performance of our localization method, our team has fully annotated 5 videos from the test set with bounding boxes and tool centers. The details of these annotations are presented in Table 1. They are also illustrated in column 1 of Figure 3. As part of the preprocessing, we randomly mask patches of
by filling these squares with the average pixel value of the test set. For each patch, the probability of masking is 0.5. We train all the evaluated models for 120 epochs with an initial learning rate of 0.1, which decreases by a factor of 10 at [60, 100] epochs. That learning rate is applied to the new convolutional layer, while the layers of ResNet are trained with a learning rate smaller by a factor of 100. In our loss function, we use a weight decay of. The models were trained with the momentum optimizer (momentum ) and batch size of 16.
3.2 Evaluated Models
We evaluate several variants of the architecture presented in section 2.1 in order to compare the differences and search for the best performing configuration. The models we devised are as follows: FCN_ESP (M1), FCN_ESP_Msk (M2), FCN_ESP_MM (M3), FCN_ESP_MM_Msk (M4), FCN_MSP (M5), FCN_MSP_Msk (M6), FCN_MSP_MM (M7), FCN_MSP_MM_Msk (M8). The models M1-M4 use the ESP method seen in section 2.1. To see whether that spatial pooling method is beneficial, we included identical models that use max pooling (MSP) instead: M5-M8. Similarly, to evaluate the benefit of masking images during training, architectures M2, M4, M6 and M8 incorporate masking, while M1, M3, M5 and M7 do not. Finally, models M3, M4, M7 and M8 use multi-maps , described below.
Our network architecture contains a convolutional layer of 7 kernels, each dedicated to one tool. Introduced in , the notion of multi-maps is based on the following idea: instead of using a single kernel for each class, multiple kernels can be used and be followed by class-wise averaging to obtain 7 localization maps. This helps the network to extract more details about the object than when a single feature map is used. The authors of  advise to use 8 kernels per class. However, since the objects we detect are significantly simpler than the classes used in , we use only 4 kernels per class (28 filters altogether).
As mentioned above, we use the dataset Cholec80 to test our method. Specifically, the 30 videos of the test set are used for testing the classification performance. To quantify the results, we use average precision (AP), which is defined as the area under the precision-recall curve. We illustrate the curve for architecture FCN_ESP_MM_Msk in Figure 2, where we see that results for scissors and clipper fall behind the rest of the tools. A similar pattern can be observed in Table 2. All models detect most tools quite well with AP values above 93%. However, the results for scissors (50%) and clipper (82%) are significantly worse than those of the other tools. This may be due to the fact that scissors and clipper are present only in 2% and 4% of annotations, respectively. In contrast, hook is present in 64% of all annotations (see Table 1, row 2).
With our method, we are able to obtain localization maps that contain information about the positions of the tools in the frame. Multiple classes of tools can be detected in the same frame. Note, however, that our approach is not designed to detect multiple instances of the same class, because all instances would share the same localization map. In this work, we limit detection to a single instance of each type of tool, even though multiple instance detection could, for example, be possible with post-processing heuristics.
We evaluate the quality of the predictions by comparing them against the ground truth bounding boxes that we have annotated for that purpose. In the cases where multiple instances of the same tool are present in the frame, we pick the bounding box closest to the prediction.
3.4.1 Localization AP
. If the predicted location lies in a ground truth bounding box of the same class, with a tolerance of 8 pixels (the global stride of the network), the example is considered a true positive. Otherwise, it is a false positive. Taking that into account, we compute precision and recall as described in, where recall is defined as the proportion of positive predictions, and precision is the proportion of true positives in positive predictions. AP is then computed as the area under the precision-recall curve. For this evaluation, we use only the positive classes as the negative class corresponds to having no tool in the image and cannot be annotated with a bounding box. The results of this computation are presented in Table 3. The localization AP values for all models are similar, ranging approximately between 87% and 89%. Our intuition is that all models are almost equally likely to predict a tool center that lies in the bounding box, without capturing the quality of the precise location inside the bounding box. In the next section, we quantify the accuracy of the predicted tool centers relative to ground truth.
3.4.2 Distance Error
Localization AP gives a coarse idea about the quality of obtained predictions. To get a better sense of the accuracy of the localization, we compute the distance between the predicted tool center and its ground truth. We normalize this value by the diagonal of the image. The results are presented in Table 4. We can see that, generally, masking and ESP improve the quality of predicted tool centers. On the other hand, multi-maps do not seem to affect the outcome significantly. It is also noteworthy that specimen bag is localized significantly worse than the other tools. This can be explained by the varying shape of the bag, as well as the ambiguity of its center.
3.4.3 Qualitative Results
For the sake of visual comparison, we present qualitative results for 8 evaluated models in Figure 3, where input images are overlaid with localization maps. Just as the quantitative results suggest, the performances of the networks are very similar and the detected tool centers are very close to one another in most cases. However, the models with masking and ESP generate more detailed maps that cover the tools better than other models and provide strong ROI for the tools.
In Figure 4, we present additional results for the architecture FCN_ESP_MM_Msk. In the figure, we see which features the network finds most discriminative about each of the tools. Ideally, we aim to localize the working end of the tools only, as the shaft does not usually contain tool-specific features. In Figure 4, we can see that for scissors and irrigator (row 4 and 6 respectively) the shafts themselves are very distinctive and discriminative. In the case of scissors, the brightest detection corresponds to the shaft. This may explain why the localization AP values for scissors are the lowest among all tools, as the annotated bounding boxes are defined over tool tips only (see column 1 in Figure 3). Specimen bag (last row) is an exception since it is not connected to a shaft. We should also note that the second tool, bipolar, is not fully detected. The network detects the blue insulated section of the forceps but not the metal tips. Our intuition is that they look very similar to those of grasper and hence cannot be used to discriminate one tool from the other. Additional qualitative results can be seen in the supplementary video (https://youtu.be/7VWVY04Z0MA).
In this work, we showed that reliable surgical tool detection and localization can be achieved without the use of spatial annotations during training. Our method relies on a FCN architecture that preserves relative spatial information of the input image. This enables us to localize the surgical tools while using only binary presence annotations for training. We evaluated several variants of our network, obtaining very promising AP values of around 87 and 88 for classification and localization on the test set, respectively. These results also suggest that the proposed approach could be used to ease the generation of spatial annotations within surgical video labeling software and extended for tool segmentation.
This work was supported by French state funds managed within the Investissements d’Avenir program by BPI France (project CONDOR) and by the ANR (references ANR-11-LABX-0004 and ANR-10-IAHU-02). The authors would also like to acknowledge the support of NVIDIA with the donation of a GPU used in this research.
-  Bouget, D., Allan, M., Stoyanov, D., Jannin, P.: Vision-based and marker-less surgical tool detection and tracking: a review of the literature. Medical Image Analysis 35, pp. 633–654 (2017)
Durand, T., Mordan, T., Thome, N., Cord, M.: Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5957–5966 (2017)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2), pp. 303–338 (2010)
-  Garcia-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., et al.: Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 770–778 (2016)
Hwang, S., Kim, H.E.: Self-transfer learning for weakly supervised lesion localization. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 239–246. Springer International Publishing, Cham (2016)
-  Jia, Z., Huang, X., Chang, E.I.C., Xu, Y.: Constrained deep weak supervision for histopathology image segmentation. IEEE Transactions on Medical Imaging 36(11), pp. 2376–2388 (2017)
-  Jin, A., Yeung, S., Jopling, J., Krause, J., Azagury, D., Milstein, A., Fei-Fei, L.: Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In: IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 691–699 (2018)
-  Kim, D., Cho, D., Yoo, D.: Two-phase learning for weakly supervised object localization. In: IEEE International Conference on Computer Vision (ICCV). pp. 3554–3563 (2017)
Kurmann, T., Neila, P.M., Du, X., Fua, P., Stoyanov, D., Wolf, S., Sznitman, R.: Simultaneous recognition and pose estimation of instruments in minimally invasive surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 505–513. Springer (2017)
-  Laina, I., Rieke, N., Rupprecht, C., Vizcaíno, J.P., Eslami, A., Tombari, F., Navab, N.: Concurrent segmentation and localization for tracking of surgical instruments. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 664–672. Springer International Publishing, Cham (2017)
-  Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free? - weakly-supervised learning with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 685–694 (2015)
-  Sahu, M., Mukhopadhyay, A., Szengel, A., Zachow, S.: Addressing multi-label imbalance problem of surgical tool detection using cnn. International Journal of Computer Assisted Radiology and Surgery 12(6), pp. 1013–1020 (2017)
-  Saleh, F.S., Aliakbarian, M.S., Salzmann, M., Petersson, L., Alvarez, J.M., Gould, S.: Incorporating network built-in priors in weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(6), pp. 1382–1396 (2018)
-  Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: IEEE International Conference on Computer Vision (ICCV) (2017)
-  Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging 36(1), pp. 86–97 (2017)
-  Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Object detectors emerge in deep scene cnns. International Conference on Learning Representations (ICLR) (2015)