Weakly Supervised Convolutional LSTM Approach for Tool Tracking in Laparoscopic Videos

12/04/2018 ∙ by Chinedu Innocent Nwoye, et al. ∙ Université de Strasbourg 0

Purpose: Real-time surgical tool tracking is a core component of the future intelligent operating room (OR), because it is highly instrumental to analyze and understand the surgical activities. Current methods for surgical tool tracking in videos need to be trained on data in which the spatial position of the tools is manually annotated. Generating such training data is difficult and time-consuming. Instead, we propose to use solely binary presence annotations to train a tool tracker for laparoscopic videos. Methods: The proposed approach is composed of a CNN + Convolutional LSTM (ConvLSTM) neural network trained end-to-end, but weakly supervised on tool binary presence labels only. We use the ConvLSTM to model the temporal dependencies in the motion of the surgical tools and leverage its spatio-temporal ability to smooth the class peak activations in the localization heat maps (Lh-maps). Results: We build a baseline tracker on top of the CNN model and demonstrate that our approach based on the ConvLSTM outperforms the baseline in tool presence detection, spatial localization, and motion tracking by over 5.0 13.9 Conclusions: In this paper, we demonstrate that binary presence labels are sufficient for training a deep learning tracking model using our proposed method. We also show that the ConvLSTM can leverage the spatio-temporal coherence of consecutive image frames across a surgical video to improve tool presence detection, spatial localization, and motion tracking.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The automated analysis of surgical workflow can support many routine surgical activities by providing clinical decision support, report generation, and data annotation. This has sparked active research in the medical computer vision community, particularly on surgical phase recognition

tmi:twinanda2017endonet ; ipcai:zisimopoulos2018deepphase and tool detection ipcai:richa2011visual ; miccai:sznitman2014fast ; miccai:vardazaryan2018weakly ; miccai:sznitman2011unified ; mai:al2018monitoring ; wacv:jin2018tool . Since surgical activities can now be captured using cameras, large amounts of data become available for their analysis. Surgical tools tracking is a multi-object tracking (MOT) problem that entails the modeling of the trajectories of all surgical tools throughout a surgical video sequence. It is needed to model and analyze tool-tissue interactions. The predominant MOT approach has been tracking-by-detection ipcai:richa2011visual ; iccv:singh2017hide ; lstm_da:milan2017online , which is an integration of a detection model, a localization model and a tracking algorithm. In this approach, object detectors like cvpr:he2016deep are used for predicting the presence or absence of objects of interest. The bounding box coordinates of the detected objects are then extracted using a localization model as seen in miccai:vardazaryan2018weakly ; wacv:jin2018tool . Most times, the localization model is regressed over the bounding box annotations in a fully supervised manner. This is usually concluded by a one-to-one assignment of the detected objects to object trajectories using a data association algorithm. Bipartite graph matching hungarian:kuhn1955hungarian has been widely used in this regard. Most works in the medical computer vision community view this matching as a linear assignment problem learnable by stochastic optimization ipcai:richa2011visual ; miccai:sznitman2014fast ; miccai:sznitman2011unified ; cvpr:mishra2017learning

. Meanwhile, recent works have also shown that the long short-term memory (LSTM) model has the capability to learn a data association task

lstm_da:milan2017online , making it easier to build a unified deep learning tracking model.

Surgical tool tracking in endoscopic videos is not an easy task. In particular, laparoscopic data presents several challenges, such as the presence of blood stains on the tools, smoke from electric coagulation and cutting, motion blur for fast-moving tools, and the removal and re-insertion of the endoscope during the procedure. Furthermore, most endoscopic datasets are not fully exploited by deep learning methods because only a small fraction of the dataset can be spatially annotated with localization information. The implication is that most intriguing tasks are only explored and tested on a very tiny fraction of the dataset. Creating spatial annotations such as region boundaries and pixel-wise masks is indeed tedious and time-consuming. Since generating binary annotations just indicating the presence of the tools requires less effort, exploiting this information for tracking becomes an interesting research question.

Previous tool tracking work in the medical computer vision community relies on spatially annotated data ipcai:richa2011visual ; miccai:sznitman2011unified . In this paper, we propose a new deep learning object tracking method that circumvents the lack of spatially annotated surgical data with weak supervision

on binary presence labels. Weak supervision is here motivated by the idea that when a convolutional neural network (CNN) is trained in a fully convolutional manner for a classification task, some of the convolution layers before the dense layer learn a general notion about the detected object. The activations in these inner layers can therefore be exploited for other tasks than the ones they were originally trained for. Based on this observation, weak supervision has been employed for cancerous region detection

miccai:hwang2016self ; ieee:jia2017constrained , surgical tool center localization miccai:vardazaryan2018weakly and object instance segmentation arXiv:zhou2018weakly .

Following the same trend, we propose a weakly-supervised approach for surgical tool tracking. First, we train a surgical tool detector on image-level labels. From a class peak response, we learn the whole region boundaries of the surgical tools in the laparoscopic videos. Then, we employ a Convolutional LSTM (ConvLSTM) to learn the spatio-temporal coherence across the surgical video frames. Without any spatial appearance and motion cue, the ConvLSTM is naturally able to learn the tools’ spatio-temporal positions for tracking. To the best of our knowledge, this is the first study that builds a complete deep learning tracking model for endoscopic surgery using weak supervision and also the first study that evaluates surgical tool tracking performance on MOT metrics. Finally, we evaluate our approach on the largest public endoscopic video dataset to date, Cholec80, which is fully annotated with binary presence information for 7 tools and of which 5 videos have been annotated with bounding box information for testing.

The remaining of this paper presents a review of related literature (sect. 2), our proposed methods (sect. 3) and implementation details (sect. 4), followed by a comparative discussion of our results (sect. 5) and a conclusion (sect. 6).

2 Related Work

In the past, while many works have focused on surgical tools detection tmi:twinanda2017endonet ; ipcai:richa2011visual ; miccai:sznitman2014fast ; miccai:vardazaryan2018weakly ; miccai:sznitman2011unified ; mai:al2018monitoring ; wacv:jin2018tool , less have explored their localization miccai:vardazaryan2018weakly ; wacv:jin2018tool ; mai:rieke2016real and tracking ipcai:richa2011visual ; miccai:sznitman2011unified ; miccai:sznitman2012data from video data only. This may be because most localization tasks, and tracking by extension, have been traditionally approached with fully supervised methods that require spatially annotated datasets mai:bouget2017vision .

Surgical Tools Detection, Localization and Tracking:

We review some of the endoscopic tool detection, localization and tracking approaches from the literature, which are mostly concentrated in retinal microsurgery ipcai:richa2011visual ; miccai:sznitman2014fast ; miccai:sznitman2011unified ; mai:rieke2016real ; miccai:sznitman2012data , and laparoscopic surgery miccai:sznitman2014fast ; miccai:vardazaryan2018weakly ; wacv:jin2018tool . In most cases, the localization and/or tracking models rely on a fully supervised object detector miccai:sznitman2012data ; lstm_da:milan2017online ; mai:rieke2016real ; miccai:sznitman2014fast . Sometimes, a unified detector-tracker framework is used miccai:sznitman2011unified . Whereas some tracking models use an optical flow tracker miccai:lo2003episode , others have casted tracking as an energy minimization function using a gradient-based tracker ipcai:richa2011visual ; mai:rieke2016real ; miccai:sznitman2012data

, density estimation


, or an image similarity measure based on weighted mutual information

ipcai:richa2011visual . From another perspective, works in miccai:sznitman2014fast ; mai:rieke2016real model the tool articulation parts and estimate the instruments locations by either a non-maximum suppression technique miccai:sznitman2014fast or by template tracking mai:rieke2016real . In wacv:jin2018tool , a fully supervised region-based convolutional network is employed to detect and localize surgical tools in laparoscopic videos. While the model is able to detect tool presence and localize beyond the tool tips, it requires bounding box annotations for training. Also, the approach does not take into account the temporal consistency over time. The experiments are carried out on selected images from surgical videos in the m2cai-tool-locations dataset. In all the above-reviewed literature, the object detection and localization models, and, by extension, the trackers, are fully supervised on a spatially annotated dataset for position estimation.

Weak Supervision:

Considering the difficulty to annotate datasets spatially, miccai:vardazaryan2018weakly localized surgical tools on a whole laparoscopic video sequence using a weakly-supervised Fully Convolutional Networks (FCN) model. The localization is limited to the center pixels of the tools. Other interesting applications of weak supervision in medical imaging are seen in the segmentation of cancerous regions in histopathological images ieee:jia2017constrained and in the detection of the region of interest (ROI) in chest X-rays and mammograms miccai:hwang2016self . The aforementioned weakly-supervised approaches do not exploit the temporal coherence of a video sequence and do not perform tracking.

Temporal Coherence:

An effort to utilize the temporal interconnection of video frames in deep learning approaches is presented in cvpr:luo2018fast , where 3D object detection and motion forecasting are integrated to track moving objects. The core idea of this approach is the modeling of temporal coherence using early and late fusion in CNNs. However, the decisions on the birth/death of an object track are hard-coded by the aggregation of past, current and future predictions. A unified approach for processing temporal streams of images is also presented in arXiv:liu2017mobile , where ConvLSTM are injected in between convolution layers to refine feature map and propagate frame-level information across time. While these models exploit temporal-coherence of a video sequence, the approaches are all fully supervised. Temporal coherence has also been used to improve binary tool presence detection by adding an LSTM to the output of ResNet-50 cvpr:mishra2017learning and to the output of ensembled CNN architectures mai:al2018monitoring . These approaches are however not constructed for localization and tracking.

Constructing upon miccai:vardazaryan2018weakly , we implement a weakly-supervised approach to train a ConvLSTM for surgical tool tracking. The model is trained on image-level binary labels only. Like in wacv:jin2018tool , our model localizes the whole region boundaries of the tools beyond their center points. We leverage the ConvLSTM’s spatio-temporal ability to learn the surgical tool trajectories across the frames without requiring more than the image-level class labels. The ConvLSTM does not only improve presence detection, but also refines and propagates localization feature map across time and helps to track over occlusions. Its internal gating mechanism enables it to naturally handle the birth, propagation, and death of tool tracks.

3 Methodology

3.1 Architecture

Our models are built on the ResNet-18 architecture cvpr:he2016deep , which is popular for its excellent performance on object detection. We present below the architectures used in this work.

3.1.1 FCN Baseline

Detector: To build a tracker for surgical tools, we first reproduce the FCN model in miccai:vardazaryan2018weakly (illustrated in Fig. 1) with similar accuracy on surgical tool presence detection and spatial localization. The general configuration of the FCN baseline model is , where represents a modified ResNet-18 network and a convolution layer. is a 1x1 7-channel convolution layer that acts as the localization heat map (Lh-map). It replaces the FC-layer of

. The strides of the last two blocks of

ResNet-18 are adjusted from 2 to 1 pixel to obtain an Lh-map with higher resolution. The FCN model is trained only on tool presence binary labels. Taking an RGB input image, the layer extracts spatial feature maps and the layer uses convolution filters to convolve these maps into a -channel Lh-map. Each channel is by design constrained to learn and localize a distinct tool type out of the 7 tools present in the considered laparoscopic procedure. With wildcat spatial pooling cvpr:durand2017wildcat , we transform the Lh-map into a vector of class-wise confidence values indicating the probability of a tool being present or absent. The positive classes are selected by a threshold of 0.5.

Apart from the single-channel map (single-map) model () discussed above, a multiple channel map (multi-map cvpr:durand2017wildcat ) variant () is built by using a convolution layer with channels followed by an average pooling over each consecutive group of channels to give the final 7 channels. We retain as used in miccai:vardazaryan2018weakly . Both variants are trained on the normal images and on patch masked images (). During patch masking iccv:singh2017hide , random patches are created on the original images and their pixel values replaced with the mean pixel value of the entire training dataset. According to iccv:singh2017hide , this enables the network to learn meticulously the necessary details of the object of interest. We now have a total of four variants of the FCN models by pairing the two models based on single- or multi-map with the two models based on masked- or unmasked-input.

Figure 1: Architecture of FCN baseline model ( variant).

We leverage the separation of the tool type in the -channel Lh-map from the FCN detector to build a baseline model for tool tracking. For localization, the raw Lh-map

is resized to the original input image size by bilinear interpolation. Then, with a disc structuring element of size 12, we perform a morphological closing on the resized map to fill small holes in the image. On each channel of the

Lh-map, a segmentation mask is extracted from the connected component around the pixel with maximum value using Otsu automatic thresholding smc:otsu1979threshold . A bounding box is then drawn over the mask to extract the tool location coordinates.

For tracking, the Intersection over Union (IoU) of the bounding boxes between the current frame and the previous frame is computed for each detected tool. Tools detected at time are included in the previous trajectories if the IoU with previous detections at time is at least . In the case of multiple instances of the same tool, the closest tool instance compared to the detections in is selected. Unmatched tools are discarded as false detections, while untracked tools are discarded as dead tracks.

3.1.2 ConvLSTM Tracker

The aforementioned FCN baseline tracker is trained on images and does not utilize the temporal cues of video data. This may be a problem when a tool’s motion becomes irregular beyond what an IoU of with the previous frame can capture, since the tracking algorithm is hard-coded. Knowing that object motion is encoded in temporal information cvpr:luo2018fast ; arXiv:liu2017mobile , we propose to integrate a temporal model in the previous FCN framework, in a manner that still allows for weakly-supervised training. This results in an elegant end-to-end tracking method that can model the spatio-temporal motion of the tools and also adapt to the various types of motion appearing in a video.

As temporal model, we propose to use a recurrent neural network (RNN), with the aim to determine the current position of each tool from the input feature map along with information from prior images captured in RNN’s state. In designing this architecture, it is necessary to ensure that the overall network can still retain spatio-temporal information for each tool when being trained in a weakly-supervised manner on binary presence data, namely that the localization information per tool is not lost but remains the key information used for predicting the binary presence.

Using a fully convolutional architecture is key in this regard. We therefore employ a ConvLSTM unit for its ability to learn the spatio-temporal dependencies of the localization heat maps. The ConvLSTM achieve this by using a convolution kernel whose receptive field considers temporal information. Compared to stacking a regular LSTM, the spatial relationships are maintained. And unlike using a simple convolution layer, the ConvLSTM takes into account the features from the previous frames, thereby enforcing consistency across time. At the level of the ConvLSTM, the localization heat maps from each tool remain independent: in this final part of the network, information is indeed not shared across maps to retain the spatial information for each tool. Our ConvLSTM Tracker is constructed by adding a ConvLSTM unit to the FCN baseline detector, as illustrated in Figure 2. We have explored several variants of the architecture, described further below. By naturally smoothing out the class peak activations using temporal information, the ConvLSTM replaces the IoU-based selection from the baseline tracker and naturally handles the birth and death of tracks for each tool.

In practice, we construct the ConvLSTM trackers using the baseline model , which has the best performance across the 3 tasks (as shown in Tables 1-3). We select the single-map architecture, as the multi-map architecture is more complex and shows no better performance both in miccai:vardazaryan2018weakly and in our baseline spatial experiments (as shown in Tables 23). Like in ResNet (), which contains skip connections between its layers, we include skip connections in the and layers for their efficiency in training large networks cvpr:he2016deep ; iwwwcsw:wei2018residual .

To perform weakly supervised training on image-level labels , we transform the Lh-maps (see Figure 2) into class-wise probabilities using wildcat pooling cvpr:durand2017wildcat

. We then learn a weighted cross-entropy loss function

for multi-label classification:


where and are respectively the ground truth and predicted tool presence for class ,

is the sigmoid function, and

the weight for class . The effect of the class weights in this loss function is that decreases false negatives (FN) while decreases false positives (FP). With this, we counteract the polarizing effect of class imbalance by reducing FN for less frequent tools and reducing FP for dominant tools. The is calculated as in Equation 2, where is the median frequency of all tools in the train set and is the frequency of the tools in class :


We propose three different configurations with similar architectures:

where and are ResNet, Convolution and ConvLSTM respectively.

Figure 2: The ConvLSTM tracker architecture with the configuration.

In this configuration, illustrated in Figure 2, the ConvLSTM receives spatial input features from the layer, refines them with temporal information and outputs spatio-temporal Lh-maps. The motivation for adding the ConvLSTM unit immediately after the baseline FCN () is to refine the spatial Lh-maps with spatio-temporal information. This helps to smooth the class peak activations as well as the shape and size of the tools segmentation masks. It is important to note that the localization process is performed on the spatio-temporal Lh-maps.


With the ConvLSTM unit added before the last Convolution layer of the baseline FCN, it refines the spatial features with spatio-temporal information before localization by . This guides the model in choosing relevant features based on temporal information across the video frames. By doing so, the receptive fields of become aware of the temporal information. It is also important to note that the localization is on the layer, which receives a spatio-temporal feature map and outputs a spatial Lh-map. This model is expected to be more robust to occlusion and noise.


The last variant replaces the layer of the FCN baseline detector with a ConvLSTM () layer. Owing to its internal convolution process, the layer takes over the task of localization from the layer as well as the refinement of the feature map with temporal information. This results in a less complex architecture with the localization process on the layer that produces spatio-temporal Lh-maps.

4 Experimental Setup

4.1 Dataset

The dataset used in this experiment is Cholec80 tmi:twinanda2017endonet . It consists of 80 videos of cholecystectomy surgeries aimed at removing the gallbladder laparoscopically, monitored through an endoscope. The videos are recorded at the frame rate of and downsampled to at which the tool presence binary annotations are generated. While most of the videos are recorded at a resolution of pixels, a few are pixels with the same aspect ratio. For uniformity, all frames are resized to pixels in our experiment. For the tool detection task, the dataset is split into 40, 10 and 30 videos for training, validation, and testing respectively. For localization and tracking evaluation, we use 5 videos from the test set annotated with tool centers and bounding boxes around the tool tips. The tool shafts are excluded, following common practice.

4.2 Training

All the models presented in this paper are trained by transfer learning. The FCN baseline models are trained for 160 epochs with stepwise decaying learning rates starting at the initial values of

and for the and the layers respectively. We use the different learning rates to strike a learning balance for

that has been pretrained on ImageNet and

that is trained from scratch.

The ConvLSTM and the baseline models have the same backbone feature extractor which converges after 160 epochs. For spatial-temporal refinement, and of the ConvLSTM models are trained up to 120 epochs with an initial learning rate of that decays exponentially. During this period, the layer is frozen for fair comparison with the baseline. The training input images are masked by patches selected randomly at a probability of . This patch masking, together with rotation and horizontal flipping of the images, are the 3 data augmentation styles employed in training the model. In finetuning the ConvLSTM layer, the dataset augmentation is limited to image patch masking to reduce the training time, since the video dataset already contains lots of variability in the images.

All the models are trained for multi-label classification. The optimized loss function

is the weighted cross-entropy with logits presented in Equation 

1. An norm with a weight decay constant of for the baseline FCN and for the ConvLSTM models is applied to regularize the optimization. The models are trained with the momentum optimizer (initial momentum ) and using truncated back-propagation. Owing to our GPU memory constraints and large input dimension, the network is trained with a maximum batch size of 16 and the ConvLSTM models are unrolled for 16 timesteps. We also propagate the ConvLSTM states between

batches. To maintain continuity in a video, we initialize the ConvLSTM input states of every batch with the output states of the immediate previous batch. States propagation is performed during testing as well. Our model network is implemented in Tensorflow using TFRecords to build the dataset input pipeline and trained on GeForce GTX 1080 Ti GPUs.

5 Results and Discussion

5.1 Presence Detection Results

To quantify the tool presence detection results, we use average precision (AP), which is defined as the area under the precision-recall curve. Comparing the AP of our model with the baseline (as presented in Table 1) shows that temporal information is helpful in improving the tool presence detection by over . The performance improvement can also be seen across the tools. This suggests that the temporal information helps the detection of tools under occlusion and noise.








Specimen Bag



96.7 91.9 99.4 50.6 80.3 85.2 88.3 84.6
99.8 92.6 99.8 85.1 96.9 60.9 78.6 87.7
95.9 89.4 99.5 69.3 85.4 89.5 87.1 87.9
99.6 90.9 99.8 48.5 88.5 66.2 91.0 83.6


99.7 95.6 99.8 86.9 97.5 74.7 96.1 92.9
99.8 95.6 99.9 76.1 97.1 77.4 93.9 91.4
99.5 93.8 99.9 90.3 97.5 65.1 74.0 88.5
Table 1: Tool presence detection average precision (AP) for the evaluated models.

5.2 Spatial Localization Results

To quantify the network’s ability to localize the distinct tools in various frames, we compute the bounding box IoUs between the detected tools and the groundtruths. This performance measure does not take into account the temporal consistency of the tools across the frames. However, a localization is only considered to be correct if and only if the . Note that this is stricter than the center-in-bounding box localization metric in miccai:vardazaryan2018weakly , which does not takes the IoU into consideration. The localization results compared with our baseline model is presented in Table 2.








Specimen Bag



05.9 20.5 34.7 03.5 06.4 55.1 44.4 24.3
15.5 10.1 27.8 20.0 13.3 53.7 06.4 21.0
05.0 11.5 15.5 25.1 8.7 42.5 14.8 17.6
08.7 0.01 25.6 20.0 20.0 49.0 02.2 17.9


33.8 20.8 41.9 21.1 12.6 52.1 23.8 29.3
54.5 14.6 50.0 23.2 11.8 53.6 60.1 38.2
42.5 08.0 44.4 25.3 14.0 53.5 41.7 32.8
Table 2: Localization accuracy of tools detected at IoU for the evaluated models.

From this result, our model improved the spatial localization of five out of the seven surgical tools: grasper, bipolar, hook, scissors and specimen bag. For the irrigator and the clipper, for which the ConvLSTM models do not have the best performance, the performance is comparable. Generally, the ConvLSTM shows a good performance on this metric by improving the mean accuracy by , illustrating the benefits of using temporal information during training. Also, all the ConvLSTM models outperform all the baseline models on mean spatial localization accuracy. This shows that the temporal data modeling can help in understanding the full spatial boundaries of moving objects.

5.3 Motion Tracking Results

For the tracking performance evaluation, we adopted the widely used CLEAR MOT metrics mot:bernardin2008evaluating : multiple objects tracking precision (MOTP) and multiple objects tracking accuracy (MOTA). MOTP, a measure of the localization precision, gives the average overlap between all the correctly matched hypotheses and their corresponding targets for a given IoU threshold ().


where is the bounding box IoU of the tracked target with the groundtruth, is the number of matches in frame . The value typically ranges between [%, 100]. On the other hand, MOTA shows the tracker’s ability at keeping consistent trajectories. It evaluates the effectiveness of the tracker from three errors, namely FP, FN and identity switches (IDSW) in respect the number of groundtruth objects (GT) as in equation 4:


The score, which usually ranges between (-, 100], can be negative in cases where the number of errors made by the tracker exceeds the number of all objects in the scene. Refer to mot:bernardin2008evaluating for more details on the MOT metrics.



58.1 29.8 66.6 19.3 77.3 05.3 67.3 18.1
49.9 47.9 61.2 21.2 75.3 02.7 62.1 23.9
46.6 29.6 60.4 09.6 75.4 -00.3 60.8 13.1
48.3 40.4 61.0 15.3 75.8 01.9 61.7 19.2


58.0 46.4 65.9 29.4 77.4 03.2 67.1 26.3
59.0 59.6 65.9 41.0 77.3 09.0 67.4 36.5
54.4 47.7 63.3 26.1 76.7 00.3 64.8 24.7
Table 3: Tracking performance of the evaluated models.

The tracking results across varying in comparison with our baseline models are presented in Table 3. Our approach improved the baseline performance significantly. The results show that with comparable MOTP, ConvLSTM can improve the MOTA baseline by at , at and at a strict . Generally, the ConvLSTM shows its ability to learn a smoother trajectory by outperforming all the baseline in both mean MOTP and mean MOTA significantly.

5.4 Qualitative Results

The qualitative results in Figure 3 show visually how the ConvLSTM is able to leverage the temporal coherence for tracking and localization for the 7 tools. From the positioning of the bounding boxes around the tools, it can be seen that the ConvLSTM model learns the region boundaries better than the baseline. The Lh-maps show that the ConvLSTM helps to smooth the localization and approximates the shape and size of the tools in each image. The overlay shows that it satisfactorily learns a trajectory close to the ground truth. A supplementary video that further demonstrates the qualitative performance of our approach can be found here: https://youtu.be/vnMwlS5tvHE. Our experiments also show that the ConvLSTM model trained on videos at 1fps can generalize to unlabelled videos at 25fps, making it unconstrained by the fps, as can be seen here: https://youtu.be/SNhd1yzOe50.

Figure 3: Qualitative results showing the localization and tracking performance of the baseline and ConvLSTM models for the 7 tools. For each tool, we present a comparison of the detected bounding box (cyan in colour) with the ground truth (dotted yellow box), the Lh-map, and the overlay of the segmented mask with the original image (best seen in colour).

5.5 Discussion

The evaluation presented in this paper shows the positive contribution of the ConvLSTMs in modeling temporal data during weakly-supervised training for surgical tool tracking in laparoscopic videos. The most notable improvement is seen in the variant, which has the best results both in localization and in tracking. We believe that this is due to the fact that in this configuration, refines the feature map from with temporal considerations before they are localized separately by . This is more robust than in and , where the temporal refinement at the end of the pipeline may dilute the localization information and output a map with a slightly different semantic. In the variant, the temporal information across the video frames guides the model in choosing relevant features for the Lh-maps.

In the qualitative results, we observe failure cases in different situations. First, due to the nature of the model, tools are missed when multiple instances of the same class are present. It would be interesting to see if the low activations in the Lh-maps could be exploited in order to estimate the number of instances for each class. The qualitative results also show that the models fail to detect a tool when less than of its tip is visible. We also observe that our models only localize the tool’s tip, not its shaft, likely because shafts are similar for all tools and cannot be easily captured by a weakly-supervised approach relying on binary presence.

From the qualitative results, we however notice that the Lh-maps produce a weak segmentation of the tool tips, suggesting that this approach could be extended to segmentation.

6 Conclusion

This paper aims at tracking tools in laparoscopic surgical videos without using any spatial annotation during training. A weakly-supervised Convolutional LSTM approach that relies solely on binary tool presence information is proposed. First, we build a baseline tracker by performing a one-to-one data association on the localization results generated by the FCN proposed in miccai:vardazaryan2018weakly . Then, we propose a fully convolutional spatio-temporal model for end-to-end tracking that is suitable for weakly-supervised training. It relies on a ConvLSTM that leverages the temporal information present in the video to smooth the class peak activations and better detect the presence of tools, optimize their spatial localization and smooth their trajectory over time. This approach is evaluated on the Cholec80 dataset and yields overall improvement on MOTA, improvement on localization mean accuracy and improvement on tool presence detection mAP. The quantitative and qualitative results also suggest that the proposed approach could be integrated into a surgical video labeling software to initialize the tool annotations, such as their bounding boxes and segmentation masks.

This work was supported by French state funds managed within the Investissements dAvenir program by BPI France (project CONDOR) and by the ANR (references ANR-11-LABX-0004 and ANR-10-IAHU-02).


  • (1) Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging 36(1), 86–97 (2017)
  • (2) Zisimopoulos, O., Flouty, E., Luengo, I., Giataganas, P., Nehme, J., Chow, A., Stoyanov, D.: Deepphase: Surgical phase recognition in cataracts videos. In: MICCAI, pp. 265–272 (2018)
  • (3) Richa, R., Balicki, M., Meisner, E., Sznitman, R., Taylor, R., Hager, G.: Visual tracking of surgical tools for proximity detection in retinal surgery. In: IPCAI, pp. 55–66 (2011)
  • (4) Sznitman, R., Becker, C., Fua, P.: Fast part-based classification for instrument detection in minimally invasive surgery. In: MICCAI, pp. 692–699 (2014)
  • (5) Vardazaryan, A., Mutter, D., Marescaux, J., Padoy, N.: Weakly-supervised learning for tool localization in laparoscopic videos. arXiv preprint arXiv:1806.05573 (2018)
  • (6) Sznitman, R., Basu, A., Richa, R., Handa, J., Gehlbach, P., Taylor, R.H., Jedynak, B., Hager, G.D.: Unified detection and tracking in retinal microsurgery. In: MICCAI, pp. 1–8 (2011)
  • (7) Al Hajj, H., Lamard, M., Conze, P.H., Cochener, B., Quellec, G.: Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. Medical Image Analysis 47, 203–218 (2018)
  • (8) Jin, A., Yeung, S., Jopling, J., Krause, J., Azagury, D., Milstein, A., Fei-Fei, L.: Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. In: WACV, pp. 691–699 (2018)
  • (9) Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017)
  • (10) Milan, A., Rezatofighi, S.H., Dick, A.R., Reid, I.D., Schindler, K.: Online multi-target tracking using recurrent neural networks. In: AAAI, vol. 2, p. 4 (2017)
  • (11) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
  • (12) Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955)
  • (13) Mishra, K., Sathish, R., Sheet, D.: Learning latent temporal connectionism of deep residual visual abstractions for identifying surgical tools in laparoscopy procedures. In: CVPR Workshops, pp. 58–65 (2017)
  • (14) Hwang, S., Kim, H.E.: Self-transfer learning for weakly supervised lesion localization. In: MICCAI, pp. 239–246 (2016)
  • (15) Jia, Z., Huang, X., Eric, I., Chang, C., Xu, Y.: Constrained deep weak supervision for histopathology image segmentation. IEEE Transactions on Medical Imaging 36(11), 2376–2388 (2017)
  • (16) Zhou, Y., Zhu, Y., Ye, Q., Qiu, Q., Jiao, J.: Weakly supervised instance segmentation using class peak response. arXiv preprint arXiv:1804.00880 (2018)
  • (17) Rieke, N., Tan, D.J., di San Filippo, C.A., Tombari, F., Alsheakhali, M., Belagiannis, V., Eslami, A., Navab, N.: Real-time localization of articulated surgical instruments in retinal microsurgery. Medical Image Analysis 34, 82–100 (2016)
  • (18) Sznitman, R., Ali, K., Richa, R., Taylor, R.H., Hager, G.D., Fua, P.: Data-driven visual tracking in retinal microsurgery. In: MICCAI, pp. 568–575 (2012)
  • (19) Bouget, D., Allan, M., Stoyanov, D., Jannin, P.: Vision-based and marker-less surgical tool detection and tracking: a review of the literature. Medical Image Analysis 35, 633–654 (2017)
  • (20) Lo, B.P., Darzi, A., Yang, G.Z.: Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In: MICCAI, pp. 230–237 (2003)
  • (21) Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: CVPR, pp. 3569–3577 (2018)
  • (22) Liu, M., Zhu, M.: Mobile video object detection with temporally-aware feature maps. arXiv preprint arXiv:1711.06368 (2017)
  • (23)

    Durand, T., Mordan, T., Thome, N., Cord, M.: Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation.

    In: CVPR, vol. 2 (2017)
  • (24) Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)
  • (25) Wei, H., Zhou, H., Sankaranarayanan, J., Sengupta, S., Samet, H.: Residual convolutional lstm for tweet count prediction. In: Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 1309–1316. International World Wide Web Conferences Steering Committee (2018)
  • (26) Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing 2008, 1 (2008)