LapTool-Net: A Contextual Detector of Surgical Tools in Laparoscopic Videos Based on Recurrent Convolutional Neural Networks

05/22/2019 ∙ by Babak Namazi, et al. ∙ The University of Texas at Arlington Baylor Scott & White Health 0

We propose a new multilabel classifier, called LapTool-Net to detect the presence of surgical tools in each frame of a laparoscopic video. The novelty of LapTool-Net is the exploitation of the correlation among the usage of different tools and, the tools and tasks - namely, the context of the tools' usage. Towards this goal, the pattern in the co-occurrence of the tools is utilized for designing a decision policy for a multilabel classifier based on a Recurrent Convolutional Neural Network (RCNN) architecture to simultaneously extract the spatio-temporal features. In contrast to the previous multilabel classification methods, the RCNN and the decision model are trained in an end-to-end manner using a multitask learning scheme. To overcome the high imbalance and avoid overfitting caused by the lack of variety in the training data, a high down-sampling rate is chosen based on the more frequent combinations. Furthermore, at the post-processing step, the prediction for all the frames of a video are corrected by designing a bi-directional RNN to model the long-term task's order. LapTool-net was trained using a publicly available dataset of laparoscopic cholecystectomy. The results show LapTool-Net outperforms existing methods significantly, even while using fewer training samples and a shallower architecture.



There are no comments yet.


page 4

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Numerous advantages of minimally invasive surgery such as shorter recovery time, less pain and blood loss, and better cosmetic results, make it the preferred choice over conventional open surgeries (Velanovich, 2000). In laparoscopy, the surgical instruments are inserted through small incisions in the abdominal wall and the procedure is monitored using a laparoscope. The special way of manipulating the surgical instruments and the indirect observation of the surgical scene introduce more challenges in performing laparoscopic procedures (Ballantyne, 2002). The complexity of laparoscopy requires special training and assessment for the surgery residents to gain the required bi-manual dexterity. The videos from the previously accomplished procedures by expert surgeons can be used for such training and assessment. The tedium and cost of such an assessment can be dramatically reduced using an automated tool detection system, among other things and is, therefore, the focus of this paper.

In computer-aided intervention, the surgical tools are controlled by a surgeon with the aid of a specially designed robot (Antico et al., 2019), which requires a real-time understanding of the current task. Therefore, detecting the presence, location or pose of the surgical instruments may be useful in robotic surgeries as well (Du et al., 2016), (Allan et al., 2013), (Allan et al., 2018). Finally, the actual location and movement of the tools can be extremely useful in rating the surgeries as well as generating an operative summary.

In order to track the surgical instruments, several approaches have been introduced, which use the signals collected during the procedure (Elfring et al., 2010), (Reiter et al., 2012). For instance, in vision-based methods, the instruments can be localized using the videos captured during the operation (Wesierski and Jezierska, 2018)

. These methods are generally reliable and inexpensive. Traditional vision-based methods rely on extracted features such as shape, color, the histogram of oriented gradients etc., along with a classification or regression method to estimate the presence, location or pose of the instrument in the captured images or videos

(Bouget et al., 2017). However, these methods are dependent on pre-defined and painstakingly extracted hand-crafted features. Just logically defining and extracting such features alone is a major part of the detection process. Thus, these hand-crafted features and designs are not suitable for real-time applications.

Recent years have witnessed great advances in deep learning techniques in various computer vision areas such as image classification, object detection, and segmentation etc., and in medical imaging

(Litjens et al., 2017), due to the availability of large data and much improved computational power compared to the 1990s. The main advantage of deep learning methods over traditional computer vision techniques is that optimal high-level features can be directly and automatically extracted from the data. Therefore, there is a trend towards using these methods in analyzing the videos taken from laparoscopic operations (Twinanda et al., 2017a).

Compared with the other surgical video tasks, detecting the presence and usage of surgical instruments in laparoscopic videos has certain challenges that need to be considered.

Firstly, since multiple instruments might be present at the same time, detecting the presence of these tools in a video frame is a multilabel (ML) classification problem. In general, ML classification is more challenging compared to the well-studied multiclass (MC) problem, where every instance is related to only one output. These challenges include but are not limited to using correlation and co-existence of different objects/concepts with each other and the background/context and the variations in the occurrence of different objects.

Secondly, as opposed to other surgical videos, such as cataract surgery (Al Hajj et al., 2019), robot-assisted surgery (Sarikaya et al., 2017) or videos from a simulation (Zisimopoulos et al., 2017), where the camera is stationary or moving smoothly, in laparoscopic videos, the camera is constantly shaking. Due to the rapid movement and changes in the field of view of the camera, most of the images suffer from motion blur and the objects can be seen in various sizes and locations. Also, the camera view might be blocked by the smoke caused by burning tissue during cutting or cauterizing to arrest bleeding. Therefore, using still images is not sufficient for detecting the instruments.

Thirdly, surgical operations follow a specific order of tasks. Although the usage of the tools doesn’t strictly adhere to that order, it is nevertheless highly correlated with the task being performed. Using the information about the task and the relative position of the frame with regard to the entire video, the performance of the tool detection can be improved.

Lastly, since the performance of a deep classifier in a supervised learning method is highly dependent on the size and the quality of the labeled dataset, collecting and annotating a large dataset is a crucial task.

Endonet (Twinanda et al., 2017a) was the first deep learning model designed for detecting the presence of surgical instruments in laparoscopic videos, wherein Alexnet (Krizhevsky et al., 2012)

was used as a Convolutional Neural network (CNN), for feature extraction and is trained for the simultaneous detection of surgical phases and instruments. Inspired by this work, other researchers used different and more accurate CNN architectures with transfer learning

(Sahu et al., 2016), (Prellberg and Kramer, 2018) to classify the frames based on the visual features. For example, in (Zia et al., 2016), three CNN architectures are used, and (Wang et al., 2017) proposed an ensemble of two deep CNNs.

(Sahu et al., 2017)

were the first to address the imbalance in the classes in a ML classification of video frames. They balanced the training set according to the combinations of the instruments. The data were re-sampled to have a uniform distribution in label-set space and, class re-weighting was used to balance the data in a single class level. Despite the improvement gained by considering the co-occurrence in balancing the training set, the correlation of the tools’ usage was not considered directly in the classifier and the decision was made solely based on the presence of single tools.

(Abdulbaki Alshirbaji et al., 2018) used class weights and re-sampling together to deal with the imbalance issue.

In order to consider the temporal features of the videos, Twinanda et al. employed a hidden Markov model (HMM) in

(Twinanda et al., 2017a)

and Recurrent Neural Network (RNN) in

(Twinanda et al., 2017b)

. Sahu utilized a Gaussian distribution fitting method in

(Sahu et al., 2016) and a temporal smoothing method using a moving average in (Sahu et al., 2017) to improve the classification results, after the CNN was trained. (Mishra et al., 2017)

were the first to apply a Long Short-Term Memory model (LSTM)

(Hochreiter and Schmidhuber, 1997), as an RNN to a short sequence of frames, to simultaneously extract both spatial and temporal features for detecting the presence of the tools by end-to-end training.

Other papers invoked different approaches to address the issues in detecting the presence of surgical tools. (Hu et al., 2017) proposed an attention guided method using two deep CNNs to extract local and global spatial features. In (Al Hajj et al., 2018), a boosting mechanism was employed to combine different CNNs and RNNs. In (Jin et al., 2018a), the tools were localized by Faster RCNN (Ren et al., 2015) method, after labeling the dataset with bounding boxes containing the surgical tools.

It should be noted that none of the previous methods takes advantage of any knowledge regarding the order of the tasks and, the correlations of the tools are not directly utilized in identifying different surgical instruments.

In this paper, we propose a novel system called LapTool-Net to detect the presence of surgical instruments in laparoscopic videos. The main features of the proposed model are summarized as follows:

  1. Exploiting the spatial discriminating features and the temporal correlation among them by designing a deep Recurrent Convolutional Neural Network (RCNN)

  2. Taking advantage of the relationship among the usage of different tools by considering their co-occurrences

  3. The end-to-end training of the tool detector using a multitask learning approach

  4. Considering the inherent long-term pattern of the tools’ presence via a bi-directional RNN

  5. Using a small portion of the labeled samples considering the high correlation of the video frames to avoid overfitting

  6. Addressing the imbalance issue using re-sampling and re-weighting methods

  7. Providing state-of-the-art performance on a publicly available dataset on laparoscopic cholecystectomy

The remainder of the paper is organized as follows: the main approach of LapTool-Net is described in section 2 and is elaborated in section 3. The performance of LapTool-Net is evaluated through experiments described in section 4. Section 5 concludes the paper.

2 Approach

The uniqueness of our approach is based on the following three original ideas:

  • A novel ML classifier is proposed as a part of LapTool-Net, to take advantage of the co-occurrence of different tools in each frame – in other words, the context is taken into account in the detection process. In order to accomplish this objective, each combination of tools is considered as a separate class during training and testing and, is further used as a decision model for the ML classifier. To the best of our knowledge, this is the first attempt at directly using the information about the co-occurrence of surgical tools in laparoscopic videos in the classifier’s decision-making.

  • The ML classifier and the decision model are trained in an end-to-end fashion. For this purpose, the training is performed by jointly optimizing the loss functions for the ML classifier and the decision model using a multitask learning approach

  • At the post-processing step, the trained model’s prediction for each video is sent to another RNN to consider the order of the usage of different tools/tool combinations and long-term temporal dependencies – yet another consideration for the context.

The overview of the proposed model is illustrated in Fig. 1. Let be a ML dataset, where is the th frame of the th video and is the corresponding surgical instruments and is the set of all possible tools. Each subset of is called a label-set and each frame can have a different number of labels . The tools associations can also be represented as a

dimensional binary vector

, where each element is a 1 if the tool is present and a 0 otherwise. The goal is to design a classifier that maps the frames of surgical videos, to the tools in the observed scene.

Figure 1:

Block diagram of a) the proposed multiclass classifier F which consists of f and g, b) the architecture for Gated Recurrent Units (GRU) and c) The bi-directional RNN for post-processing.

In order to take advantage of the combination of the surgical tools in a laparoscopic video, the well-known label power-set (LP) method is adopted in a novel way. The output of is a label-set (also called a superclass) of size , where is the set of all possible subsets of .

In order to calculate the confidence scores for each tool, along with the final decision, which is the class index in , the classifier is decomposed into , where is the decision model, which maps the confidence scores of the frame of the video to the label-set . The model takes the video frames as input and produces the confidence scores

, where each element is the probability of the presence of one tool from the set

. It’s worth mentioning that is as an ML classifier, while the output of the decision model is the label-set and therefore, classifier is an MC classifier based on the LPs.

The ML classifier consists of a CNN and an RNN. The CNN is responsible for extracting the visual features, while the RNN uses the sequence of features extracted by CNN and calculates the confidence scores .

The output of the decision model for all the frames of each video forms a larger sequence of the model’s predictions. The sequence is used as the input to a bi-directional RNN to exploit the long-term order of the tool usage.

The overall system is designed and tested using the dataset from M2CAI16111 tool detection challenge1. The dataset contains 15 videos from cholecystectomy procedure, which is the surgery for removing the gallbladder. All the videos are labeled with 7 tools for every 25 frames. The tools are Bipolar, Clipper, Grasper, Hook, Irrigator, Scissors, and Specimen bags. There are 10 videos for training and 5 videos for validation.

The performance of LapTool-Net is measured through common metrics for ML and MC classification and a comparison is made with the current methods. The methodology derived from this approach is provided in more detail in the following section.

3 Methodology

3.1 Multilabel Classification

In ML classification, the goal is to assign multiple labels to each image. Higher dimensionality in label space and the correlation between the labels make the ML classification more challenging compared with MC problems. In the literature, two main approaches to deal with such issues in ML classification are accepted (Gibaja and Ventura, 2015)

. One approach is called adaption, which aims at adapting existing machine learning models to deal with the requirements of ML classification. Since the output of an ML classifier is the confidence scores for each class, a decision policy is needed to make the final prediction. This decision is usually made based on top-k or thresholding methods.

The second paradigm for ML classification is based on problem transformation. The goal of problem transformation is to transform the ML problem into a more well-defined binary or MC classification. The most popular methods include binary relevance (BR), chain of classifiers (Read et al., 2011) and LP. In BR, the problem is transformed into multiple binary classifiers for each class. This method doesn’t take the dependencies of the classes into account. On the other hand, in a classifier chain, the binary classifiers are linked in a chain to deal with the classes’ correlations. In LP, multiple classes are combined into one superclass and the problem is transformed into an MC classification. The advantage of this method is that the class dependencies are automatically considered. However, as the number of classes increases, the complexity of the model increases exponentially. This is not an issue in laparoscopic videos. The reasons are 1) there is a limit for the number of tools in each frame (usually 3 or 4) and 2) the combinations of the tools are known. Since the LP method is more efficient than the classifier chain due to the use of just one classifier, it was determined to be more efficient for detecting the usage of surgical tools. Thus, we propose a novel classifier with LP being the decision layer for an ML classifier.

3.2 Spatio-temporal Features

In order to detect the presence of surgical instruments in laparoscopic videos, the visual features (intra-frame spatial and inter-frame temporal features) need to be extracted. We use CNN to extract spatial features. A CNN consists of an input layer, multiple convolutional layers, non-linear activation units, and pooling layers, followed by a fully connected (FC) layer to produce the outputs, which are typically the classification results or confidence scores. Each layer passes the results to the next layer and the weights for the convolutional and FC layers are trained using back-propagation to minimize a cost function. The output of the last convolutional layer is a lower dimensional representation of the input and therefore, can be considered as the spatial features. As shown in Fig 1, the input frame is sent through the trained CNN and the output of the last convolutional layer (after pooling) forms a fixed size spatial feature vector .

In the literature, several approaches have been proposed for utilizing the temporal features in videos for tasks such as activity recognition and object detection in videos (Karpathy et al., 2014), (Simonyan and Zisserman, 2014). For instance, when there is a high correlation among video frames, it can be exploited to improve the performance of the tool detection algorithm.

An RNN is typically used to exploit the pattern of the instruments usage. It uses its internal memory (states) to process a sequence of inputs for time series and videos processing tasks (Jin et al., 2018b). Although the motion features are not extracted explicitly when using the RNN, the temporal features are exploited through the correlation of spatial features in the neighboring frames.

Since the point of the RNN along with the CNN is to capture the high correlation among the neighboring frames, short sequences of frames (say 5 frames) are selected. Also, shorter sequences help the RNN have a better and faster convergence.

For each frame , the sequence of the spatial features is the input for the RNN, where the hyper-parameters and are the number of frames in the sequence and the constant inter-frame interval respectively. The total length of the input is no longer than one second, which ensures that the tools remain visible during that time interval. Since the tool detection model is designed to be causal and to perform in real-time, only the previous frames with respect to the current frame can be used with the RNN.

We selected Gated Recurrent Unit (GRU) (Cho et al., 2014) as our RNN for its simplicity. The architecture is illustrated in Fig. 1.(b) and formulated as:


where and are the GRU weights, is the element-wise multiplication and

is the sigmoid activation function.

and are update gate and reset gate respectively. The final hidden state is the output of the GRU and is the input to a fully connected neural network . The output layer is of size

(the number of tools) and after applying the sigmoid function, produces the vector of confidence scores

for all classes.

We designed the above RCNN architecture as the ML classifier model f shown in Fig. 1.a, which exploits the spatiotemporal features of a given input frame and produces the vector of confidence scores of all the tools, which in turn is the input to the decision model .

3.3 Decision Model

One of the main challenges in ML classification is effectively utilizing the correlation among different classes. Using LP (as described earlier), uncommon combinations of the classes will automatically be eliminated from the output and the classifier’s attention is directed towards the more possible combinations.

As mentioned before, not all the combinations are possible in a laparoscopic surgery. Fig. 2 shows the percentage of the most likely combinations in the M2CAI dataset. The first 15 classes out of a possible maximum of 128 span more than 99.95% of the frames in both the training and the validation sets, and the tools combinations have almost the same distribution in both cases.

Figure 2: The distribution for the combination of the tools in M2CAI dataset

Since an LP classifier is MC, the cost function for training a machine learning algorithm has to be the conventional one-vs-all (categorical) loss. For example, Softmax cross-entropy (CE) is the most popular MC loss function. However, Softmax CE requires the classes to be mutually exclusive, which is not true in the LP method. In other words, while using a Softmax loss, each superclass is treated as a separate class, i.e. separate features activate a superclass. This causes performance degradation in the classifier and therefore, more data is required for training. We address this issue by a novel use of LP as the decision model , which we apply to the ML classifier . Our method helps the classifier to consider our superclasses as the combinations of classes rather than separate mutually exclusive classes.

The decision model is a fully connected neural network (), which takes the confidence scores of and maps them to the corresponding superclass. When the Softmax function is applied, the output of is the probability of each superclass where is the size of the superclass set. The final prediction of the tool detector is the index of the superclass with the highest probability and for frame of video is calculated as:


3.4 Class Imbalance

Class imbalance has been a well-studied area in machine learning for a long time (Buda et al., 2018)

. It is known that in skewed datasets, the classifier’s decision is inclined towards the majority classes. Therefore, it is always beneficial to have a uniform distribution for the classes during training. Two major approaches have been proposed in the literature to deal with imbalanced datasets.

One approach is called cost sensitive and is mainly based on class re-weighting. In this method, the outputs of the classes or the loss function during training are weighted based on the frequency of the classes. Although this approach works in some cases, the choice of the weights might not depend solely on the distribution of the data, since the complexity of the input is not known before training. Thus, class weights are another set of hyper-parameters that needs to be determined.

Another solution to an imbalanced dataset is to change the distribution in the input. This can be accomplished using over-sampling for the minority classes and under-sampling for the majority classes. However, in ML classification, finding a balancing criterion for re-sampling is challenging (Charte et al., 2015), since a change in the number of samples for one class might affect other classes as well.

The number of samples for each tool before balancing is shown in Table 1. In order to overcome this issue, we perform under-sampling to have a uniform distribution of the combination of the classes. The main advantage of under-sampling over other re-sampling methods is that it can also be applied to avoid overfitting caused by the high correlation between the neighboring frames of a laparoscopic video. Therefore, we try different under-sampling rate to find the smallest training set without sacrificing the performance.

Tool Train Validation
Bipolar 631 331
Clipper 878 315
Grasper 10367 6571
Hook 14130 7454
Irrigator 953 131
Scissors 411 158
Specimen Bag 1504 483
no tools 2759 1888
total 23421 12512
Table 1: Number of frames for each tool in M2CAI

Since this approach will not guarantee balance, a cost-sensitive weighting approach can be used along with an ML loss, prior to the LP decision layer; nonetheless, we empirically found that this doesn’t affect the performance of the ML classifier.

Figure 3 shows the relationship among the tools after re-sampling. It can be seen that the LP-based balancing method not only tends to a uniform distribution in the superclass space, it also improves the balance of the dataset in the single class space (with the exception of Grasper, which can be used with all the tools).

(a) before balancing
(b) after balancing
Figure 3: The chord diagram for the relationship between the tools before and after balancing based on the tools’ co-occurrences

3.5 Training

Since our tool detector is decomposed into an ML classifier and an MC decision model , the requirement of both models needs to be considered during training. In order to accomplish this, the model is trained using both ML and MC loss functions.

We propose to use joint training paradigm for optimizing the ML and MC losses as a multitask learning approach. In order to do that, two optimizers are defined based on the two losses with separate hyper-parameters such as learning rate and trainable weights. Using this technique, the extraction of the features is accomplished based on the final prediction of the model.

Having the vector of the confidence scores , the ML loss is the sigmoid cross-entropy and is formulated as:


where is the correct label for frame , is the total number of frames and is the training set.

The Softmax CE loss function for the decision model is formulated as:


The total loss function is the sum of the two losses and formulated as:



is a constant weight for adjusting the impact of the two loss functions. The training is performed in an end-to-end fashion using the backpropagation through time method.

3.6 Post-processing

The final decision of the RCNN model from the previous section is made based on the extracted spatio-temporal features from a short sequence of frames. In other words, the model benefits from a short-term memory using the correlation among the neighboring frames. However, due to the high under-sampling rate for the balanced training set, this method might not produce a smooth prediction over the entire duration of the laparoscopic videos. In order to deal with this issue, we model the order in the usage of the tools with an RNN over all the frames of each video (Namazi et al., 2018).

Due to memory constraints, the final prediction from equation 2 of the RCNN, ] for all the videos

, is selected as the input for the post-processing RNN. Since not all the videos have the same length, the shorter videos are padded with the no-tool class.

Our post-processing occurs offline, after the surgery is finished. Therefore, future frames can also be used along with past frames to improve the classification results of the current frame. In order to accomplish this, a bi-directional RNN is employed, which consists of two RNNs for the forward and backward sequences. The backward sequence is simply the reverse of . The outputs of the bi-RNN are concatenated and fed to for the final prediction ( in Fig. 1.c).

Since the input frames for the bi-RNN are in a specific order, it’s not possible to balance the input through re-sampling. Therefore, class re-weighting is performed to compensate for the minority classes. The class weights are chosen to be proportional to the inverse of the frequency of the superclasses. The loss function is:


where is the weight for the superclass , is the bi-RNN with 64 hidden layers and is the superclass probability vector.

4 Results

In this section, the performance of the different parts of the proposed tool detection model on M2CAI dataset is validated through numerous experiments using the appropriate metrics.

We selected Tensorflow

(Abadi et al., 2016) for all of the experiments. The CNN in all the experiments was Inception V1 (Szegedy et al., 2015)

. In order to have better generalization, extensive data augmentation, such as random cropping, horizontal and vertical flipping, rotation and a random change in brightness, contrast, saturation, and hue were performed during training. The initial learning rate was 0.001 with a decay rate of 0.7 after 5 epochs and the results were taken after 100 epochs. The batch size was 32 for training the CNN models and, 40 for the RNN-based models. All the experiments were conducted using an Nvidia TITAN XP GPU. The source code of the project is available on Github.

4.1 Metrics

Since the proposed model is MC, the corresponding evaluation metrics were chosen. Due to the high imbalance of the validation dataset, accuracy alone is not sufficient to evaluate the proposed model. Therefore, we used F1 score to compare the performance of different models in both per-class and overall metrics. These are calculated as:


where , , and are per-class precision/recall and overall precision/recall respectively and are calculated as:


where , and are the number of correctly predicted frames for class , the total number of frames predicted as , and the total number frames for class . Only frames with all the tools predicted correctly are considered exact matches.

In order to evaluate the RCNN model , we used ML metric - mean Average Precision (mAP), which is the mean of average precision (a weighted average of the precision with the recall at different thresholds) for all 7 tools.

4.2 CNN Results

In the first experiments, we assumed that the classifier is a CNN and the decision model is applied to the resulting scores from the CNN. Since the dataset was labeled only for one frame per second (out of 25 frames/sec), there was a possibility of using the unlabeled frames for training, as long as the tools remain the same between two consecutive labeled frames. We used this unlabeled data to balance the training set, according to the LPs. The CNN was trained with the loss function 3 with of size 7. Table 2 shows the results of our CNN with different training set sizes for the tools listed in Table 1.

Total Frames Balanced Acc(%) mAP(%) F1-macro(%)
23k No 77.23 61.02 58.48
150k Yes 75.90 71.15 70.49
75k Yes 74.78 77.24 74.81
25k Yes 75.40 78.58 74.64
6k Yes 74.36 78.22 74.43
3k Yes 73.10 73.69 70.85
Table 2: Results for the multi-label classification of the CNN

As was to be expected, the unbalanced training set results shown in the first row of Table 2 has the lowest performance on all the metrics. The high exact match accuracy (Acc) of 77.23% and the lower results on per-class metrics, such as F1-macro and mAP show that the model correctly predicted the majority classes (grasper and hook) but has poor performance for the less used tools such as scissors.

In order to balance the datasets, the following specific steps were taken: 1) 15 superclasses were selected and the original frames were re-sampled to have a uniform distribution in the set of label-sets . The numbers of frames for each superclass were randomly selected to be 10,000, 5,000, 1,666, 400 and 200. 2) Multiple copies of some frames were copied and pasted to the final set in the first two training sets, because of the availability of fewer frames in some tools such as scissors. This accomplished the intended over-sampling. 3) Similarly, under-sampling was performed in at least one class in all sets and, in all classes in the last two sets, because too many frames were available for some tools.

Under these conditions, we can discuss the results presented in rows 2 through 6 in Table 2. While the exact match accuracy is the highest in the 150K set, it has the lowest score on the per-class metrics. The likely reason is the high over-sampling rate, which causes overfitting for the less frequently occurring classes. On the other hand, a very high under-sampling rate in the 3K set results in lower accuracy, likely due to the lack of informative samples.

The best per-class results are for the 25K/6K versus the 150K/75K, which is due to the lower correlation among the inputs of the CNN during training. We used the 6K dataset for the rest of the experiments versus the 25K, because adding the RNN and decision model to the selected CNN would increase the size of the model (RCNN-LP), and the chances of overfitting increases.

In order to evaluate the effect of utilizing the co-occurrence of different surgical tools, we tested the LP method as the primary classifier, as well as the decision model, using different training strategies. The configurations for each experiment are shown in Table 3.

Exp. num Loss function FC1 size FC2 size Trainable weights Training method
1 (4) 15 - all -
2 (150k) (4) 15 - all -
3 (3)/(4) 15 15 CNN+FC1/ FC2 Sequential
4 (3)/(4) 7 15 CNN+FC1/ CNN+FC2 Alternate
5 (3)/(4) 7 15 CNN+FC1/ all Alternate
6 (3)/(4) 7 15 CNN+FC1/ FC2 Sequential
7 (5) 7 15 all Joint
Table 3: Setup configurations for training the multiclass CNN

In sequential training, the CNN was trained first, and the decision model was added on top of the trained model, while the CNN weights remained unchanged. In alternate training, the trainable weights change with the loss at every other step. The joint training method is explained in the previous section. We used MC metrics; exact match accuracy, micro and macro F1, and average per-class precision and recall. The results are shown in Table

4 and the precision and recall for each tool are shown in Table 5.

Exp. num Acc(%) F1-macro(%) F1-micro(%) Mean P(%) Mean R(%)
1 70.01 69.14 84.57 72.90 67.98
2 76.13 73.77 87.89 86.08 67.24
3 73.18 74.30 86.92 79.65 70.80
4 74.42 75.70 87.75 82.37 71.48
5 72.44 75.23 86.42 87.60 67.25
6 74.97 75.47 88.04 80.67 73.21
7 76.31 78.32 88.53 78.48 78.95
Table 4: Results for the multiclass CNNs
Bipolar Clipper Grasper Hook
Exp. P R P R P R P R
1 71.2 66.0 72.7 58.4 90.3 70.9 92.8 90.6
2 76.5 35.7 85.4 52.0 92.0 80.3 95.2 90.0
3 84.4 71.5 78.0 57.4 91.0 75.8 94.6 90.9
4 81.5 70.8 80.7 60.0 89.9 79.8 95.3 90.1
5 91.7 56.5 81.2 53.6 91.3 75.9 97.5 86.1
6 75.1 75.1 69.2 59.3 90.0 81.5 95.2 90.3
7 83.9 72.6 72.6 74.2 91.1 81.2 94.3 91.4
Table 5: Precision (P(%)) and Recall (R(%)) of each tool for the multiclass CNNs
Irrigator Scissors Specimen bag
Exp. P R P R P R
1 56.7 65.5 61.9 32.9 64.6 91.4
2 85.8 74.6 93.8 48.1 73.6 89.8
3 57.0 53.2 79.6 54.4 72.8 92.1
4 63.3 56.5 88.7 50.0 77.0 93.0
5 78.2 59.0 93.1 51.8 79.9 87.5
6 72.8 70.4 87.8 41.1 74.3 94.4
7 59.7 75.4 71.6 63.9 75.8 93.7

In the first two experiments, the LP method was used directly to map the video frames to the corresponding superclass. In order to accomplish this, the features extracted using CNN were connected to an FC layer of size 15 and the network was trained with the loss function from equation 4. We selected the balanced training sets from the previous experiments with 6K and 150K samples. It can be seen from experiment 1 and 2 (Table 4), which correspond to 6K sample and 150k samples respectively, that both accuracy and F1 scores increase, when the training set is larger. Also, the precision and recall in Table 5 show some improvements in almost all classes. However, compared with the results from Table 2, we observe minor improvements in accuracy and F1, when using 150k frames with LP classifier, while the metrics decrease with a smaller training set. Considering both training sets were balanced based on LP, the observation suggests that the LP-based classifier needs more examples for reasonable performance. This is because, in an LP classifier, the superclasses are treated as separate classes with different features from the corresponding single label classes, which requires the classifier to have more training examples to learn the discriminating features. This can also be confirmed by checking the relatively close precision/recall for grasper and hook in Table 5, which have more unique frames (due to lower under-sampling rate), in the two experiments.

In experiment 3, ML loss was tried instead of MC for training the LP classifier with 15 superclasses. was added as a decision model and was trained sequentially. As shown in Table 4, the per-class F1 score for experiment 3 improves compared to experiment 1, while the exact match accuracy is lower. This is probably because the model is still not aware that a superclass is a combination of multiple classes.

In experiments 4, 5 and 6, the CNN was trained using ML loss 3 with 7 classes and the decision layer was added on top of the confidence scores. We evaluated the model using different training strategies. All three of these experiments produced better results than experiment 3. This is likely because the model can learn the pattern of the 7 tools easier with the ML loss, compared with learning the pattern for the combination classes using 15 classes.

The point of performing the experiments 4 and 5 was to evaluate the effect of the decision model in training the feature extractor and ML classifier. In both experiments, the decision model and the CNN were trained alternately. The weights of the CNN were frozen in experiment 6, while in experiments 4 and 5 they were trained at each step. Therefore, in experiment 6, the role of the decision model was to just use the co-occurrence information to find the correct classes (superclasses) using the confidence scores of a trained model. The results show improvement in F1 scores in all three experiments compared with the results from Table 2. This is due to the fact that using LP as the decision model, the co-occurrence of surgical tools in each frame is considered directly in the classification method without learning separate patterns for superclasses.

In experiment 7, the loss is the weighted sum of the ML and MC loss functions and the training was performed on all the weights of the model. We can see that the end-to-end training of the RCNN-LP produces significantly better results compared with all other training methods, such as sequential and alternate training. The reason is that in end-to-end training, all parts of the model is trained simultaneously to reach better confidence scores and hence, better final decision.

4.3 LapTool-Net Results

In this section, the performance of the proposed model is evaluated after considering the spatio-temporal features using an RNN. Similar to the previous section, we tested the model before and after adding the decision model. The dataset for training is the 6K balanced set and all the models were trained end-to-end. For training the RCNN model, we used 5 frames at a time (current frame and 4 previous frames) with an inter-frame interval of 5, which resulted in a total distance of 20 frames between the first and the last frame. The RCNN model was trained with a Stochastic Gradient Descent (SGD) optimizer. The data augmentation for the post-processing model includes adding random noise to the input and randomly dropping frames to change the duration of the sequences.

Table 6 shows the results of the proposed LapTool-Net. For ease of comparison, we have copied the results from the previous section for the CNN with and without the LP decision model. It can be seen that by considering the temporal features through the RCNN model the exact match accuracy and F1 scores were improved. The higher performance of the LapTool-Net is due to the utilization of the frames from both the past and the future of the current frame, as well as the long-term order of the usage of the tools, by the bi-directional RNN.

Acc(%) F1-macro(%) F1-micro(%)
CNN 74.36 74.43 87.70
CNN-LP 76.31 78.32 88.53
RCNN 77.51 81.95 89.54
RCNN-LP 78.58 84.89 89.79
LapTool-Net 80.96 89.11 91.35
Table 6: Final results for the proposed model

The precision, recall and F1 score for each of the tools are shown in Table 2. Compared with the results from Table 5, we can see that the F1 score for clippers and scissors have significantly increased, because there is a high correlation between the usage of these tools and the tasks, i.e. the order in the occurrence of the tools (e.g. cutting only happens after clipping is completed). The lowest performance is for Irrigator, which is probably because of the irregular pattern in its use (only used for bleeding and coagulation, which can occur any time during the surgery). The higher over-all recall is likely because of the class re-weighting method. We believe the performance could improve with a better choice of the weights.

Tool Precision(%) Recall(%) F1(%)
Bipolar 0.82 0.95 0.88
Clipper 0.85 0.98 0.91
Grasper 0.89 0.88 0.89
Hook 0.94 0.94 0.94
Irrigator 0.74 0.91 0.82
Scissors 0.82 0.99 0.90
Specimen bag 0.92 0.89 0.91
Mean 0.85 0.93 0.89
Table 7: The precision, recall and F1 score of each tool for LapTool-Net

In order to localize the predicted tools, the attention maps were visualized using grad-CAM method (Selvaraju et al., 2017). The results for some of the frames are shown in figure 4. In order to avoid confusion with frames that multiple tools, only the class activation map of a single tool is shown based on the prediction of the model. The results show that the visualization of the attention of the proposed model can also be used in reliably identifying the location of each tool.

(a) Grasper
(b) Grasper/Hook
(c) Grasper/Clipper
(d) Grasper/Scissor
(e) Grasper/Hook
(f) Grasper/Bipolar
(g) Grasper
(h) Hook
(i) Grasper
(j) Scissors
(k) Hook
(l) Bipolar
Figure 4: The visualization of the class activation maps for some examples, based on the prediction of the model

4.4 Comparison with Current Work

In order to validate the proposed model, we compared it with previously published research on the M2CAI dataset. The result is shown in Table 8. Since all the methods reported their results using ML metrics such as mAP, we compared our ML classifier , which is the RCNN model, along with the final model. We show that our model out-performed previous methods by a significant margin even when choosing a relatively shallower model (Inception V1) and while using less than 25% of the labeled images.

Method CNN mAP(%) F1-Macro(%)
LapTool-Net Inception-V1 - 89.11
RCNN (ours) Inception-V1 87.88 81.95
(Hu et al., 2017) Resnet-101 (He et al., 2016) 86.9 -
(Sahu et al., 2017) Alexnet 65 -
(Wang et al., 2017) Inception-V3 (Szegedy et al., 2016) 63.8 -
(Sahu et al., 2016) Alexnet 61.5 -
(Twinanda et al., 2016) Alexnet 52.5 -
Table 8: Comparison of tool presence detection methods on M2CAI

5 Conclusion and Future Direction

The observation by surgical residents of the usage of specific surgical instruments and the duration of their usage in laparoscopic procedures gives great insight into how the surgery is performed. While identifying the tools in a recorded video of surgery is a trivial albeit tedious task for an average human, there are certain challenges in detecting the tools using computer vision algorithms. In order to tackle these challenges, in this paper, we proposed a novel deep learning system called LapTool-Net, for automatically detecting the presence of surgical tools in every frame of a laparoscopic video. The main feature of the proposed RCNN model is the context-awareness, i.e. the model learns the short-term and long-term patterns of the tools usages by utilizing the correlation between the usage of the tools with each other and, with the surgical steps, which follow a specific order. To achieve this goal, an LP-based model is used as a decision layer for the ML classifier and the training is performed in an end-to-end fashion. The advantage of this paradigm over direct LP classifier is that the training can be accomplished with a smaller dataset, due to having fewer classes and avoiding learning separate (and probably not useful) patterns for the superclasses. Furthermore, the order of occurrence of the tools is extracted through training a bi-RNN with the final prediction of a trained RCNN model. To overcome the high imbalance in the occurrence of the tools, we used under-sampling based on the tools’ combinations and the LP model. In addition to having a balanced dataset, the high under-sampling rate reduces the generalization error by avoiding overfitting, which is the main challenge in tool detection, due to the high correlation among videos frames. Our method outperformed all previously published results on M2CAI dataset, while using less than 1% of the total frames in the training set.

While our model is designed based on the previous knowledge of the cholecystectomy procedure, it doesn’t require any domain-specific knowledge from experts and can be effectively applied to any video captured from laparoscopic or even other forms of surgeries. Also, the relatively small dataset after under-sampling suggests that the labeling process can be accomplished faster by using fewer frames (e.g. one frame every 5 seconds). Moreover, the simple architecture of the proposed LP-based classifier makes it easy to use it with other proposed models such as (Al Hajj et al., 2018) and (Hu et al., 2017), or with weakly supervised models [38] to localize the tools in the frames. Moreover, the offline design can be useful in generating a summary report, assessment and procedure rating etc. Also, the proposed model in online mode has a processing time of less than 0.01 seconds/frame, which makes it suitable for real-time applications such as feedback generation during surgery.

We plan on implementing a few ways to improve the performance of the proposed model. Firstly, the CNN can be replaced by a deeper and more accurate model. In particular, we will use the Inception-Resnet-V2 (Szegedy et al., 2017) for the CNN and the cholec80 dataset (Twinanda et al., 2017a) for training. Secondly, since the RNN doesn’t extract the unique motion features of the tools, it can be replaced by a 3D CNN. The other way to improve the results is to choose better hyper-parameters, especially the class weights for balancing.

In future, we will investigate applying the findings in this paper to designing a semi-supervised learning based model

(Cheplygina et al., 2019), using only a fraction of the videos being labeled.


This work was supported by a Joseph Seeger Surgical Foundation award from the Baylor University Medical Center at Dallas.

The authors would like to thank NVIDIA Inc. for donating the TITAN XP GPU through the GPU grant program.


  • Abadi et al. [2016] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, Xiaoqiang Zheng, and Research. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Technical report, Google, 2016. URL
  • Abdulbaki Alshirbaji et al. [2018] Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut Möller. Surgical Tool Classification in Laparoscopic Videos Using Convolutional Neural Network. Current Directions in Biomedical Engineering, 4(1):407–410, 9 2018. ISSN 2364-5504. doi: 10.1515/cdbme-2018-0097.
  • Al Hajj et al. [2018] Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Béatrice Cochener, and Gwenolé Quellec. Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. Medical Image Analysis, 47:203–218, 7 2018. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2018.05.001.
  • Al Hajj et al. [2019] Hassan Al Hajj, Mathieu Lamard, Pierre-Henri Conze, Soumali Roychowdhury, Xiaowei Hu, Gabija Maršalkaitė, Odysseas Zisimopoulos, Muneer Ahmad Dedmari, Fenqiang Zhao, Jonas Prellberg, Manish Sahu, Adrian Galdran, Teresa Araújo, Duc My Vo, Chandan Panda, Navdeep Dahiya, Satoshi Kondo, Zhengbing Bian, Arash Vahdat, Jonas Bialopetravičius, Evangello Flouty, Chenhui Qiu, Sabrina Dill, Anirban Mukhopadhyay, Pedro Costa, Guilherme Aresta, Senthil Ramamurthy, Sang-Woong Lee, Aurélio Campilho, Stefan Zachow, Shunren Xia, Sailesh Conjeti, Danail Stoyanov, Jogundas Armaitis, Pheng-Ann Heng, William G. Macready, Béatrice Cochener, and Gwenolé Quellec. CATARACTS: Challenge on automatic tool annotation for cataRACT surgery. Medical Image Analysis, 52:24–41, 2 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2018.11.008.
  • Allan et al. [2013] M. Allan, S. Ourselin, S. Thompson, D. J. Hawkes, J. Kelly, and D. Stoyanov. Toward Detection and Localization of Instruments in Minimally Invasive Surgery. IEEE Transactions on Biomedical Engineering, 60(4):1050–1058, 4 2013. ISSN 0018-9294. doi: 10.1109/TBME.2012.2229278.
  • Allan et al. [2018] M. Allan, S. Ourselin, D. J. Hawkes, J. D. Kelly, and D. Stoyanov. 3-D Pose Estimation of Articulated Instruments in Robotic Minimally Invasive Surgery. IEEE Transactions on Medical Imaging, 37(5):1204–1213, 5 2018. ISSN 0278-0062. doi: 10.1109/TMI.2018.2794439.
  • Antico et al. [2019] Maria Antico, Fumio Sasazawa, Liao Wu, Anjali Jaiprakash, Jonathan Roberts, Ross Crawford, Ajay K. Pandey, and Davide Fontanarosa. Ultrasound guidance in minimally invasive robotic procedures. Medical Image Analysis, 54:149–167, 5 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2019.01.002.
  • Ballantyne [2002] Garth H Ballantyne. The pitfalls of laparoscopic surgery: challenges for robotics and telerobotic surgery. Surgical laparoscopy, endoscopy & percutaneous techniques, 12(1):1–5, 2 2002. ISSN 1530-4515.
  • Bouget et al. [2017] David Bouget, Max Allan, Danail Stoyanov, and Pierre Jannin. Vision-based and marker-less surgical tool detection and tracking: a review of the literature. Medical Image Analysis, 35:633–654, 1 2017. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2016.09.003.
  • Buda et al. [2018] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 10 2018. ISSN 0893-6080. doi: 10.1016/J.NEUNET.2018.07.011.
  • Charte et al. [2015] Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163:3–16, 9 2015. ISSN 0925-2312. doi: 10.1016/J.NEUCOM.2014.08.091.
  • Cheplygina et al. [2019] Veronika Cheplygina, Marleen de Bruijne, and Josien P.W. Pluim. Not-so-supervised: A survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Medical Image Analysis, 54:280–296, 5 2019. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2019.03.009.
  • Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , page 1724–1734, 6 2014.
  • Du et al. [2016] Xiaofei Du, Maximilian Allan, Alessio Dore, Sebastien Ourselin, David Hawkes, John D. Kelly, and Danail Stoyanov. Combined 2D and 3D tracking of surgical instruments for minimally invasive and robotic-assisted surgery. International Journal of Computer Assisted Radiology and Surgery, 11(6):1109–1119, 6 2016. ISSN 1861-6410. doi: 10.1007/s11548-016-1393-4.
  • Elfring et al. [2010] Robert Elfring, Matías de la Fuente, and Klaus Radermacher. Assessment of optical localizer accuracy for computer aided surgery systems. Computer Aided Surgery, 15(1-3):1–12, 2 2010. ISSN 1092-9088. doi: 10.3109/10929081003647239.
  • Gibaja and Ventura [2015] Eva Gibaja and Sebastián Ventura. A Tutorial on Multilabel Learning. ACM Computing Surveys, 47(3):1–38, 4 2015. ISSN 03600300. doi: 10.1145/2716262.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778. IEEE, 6 2016.
    ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.90.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
  • Hu et al. [2017] Xiaowei Hu, Lequan Yu, Hao Chen, Jing Qin, and Pheng-Ann Heng. AGNet: Attention-Guided Network for Surgical Tool Presence Detection. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 186–194. Springer, Cham, 2017. doi: 10.1007/978-3-319-67558-9{_}22.
  • Jin et al. [2018a] Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei-Fei. Tool Detection and Operative Skill Assessment in Surgical Videos Using Region-Based Convolutional Neural Networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 691–699. IEEE, 3 2018a. ISBN 978-1-5386-4886-5. doi: 10.1109/WACV.2018.00081.
  • Jin et al. [2018b] Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. SV-RCNet: Workflow Recognition From Surgical Videos Using Recurrent Convolutional Network. IEEE Transactions on Medical Imaging, 37(5):1114–1126, 5 2018b. ISSN 0278-0062. doi: 10.1109/TMI.2017.2787657.
  • Karpathy et al. [2014] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei Fei Li. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. ISBN 9781479951178. doi: 10.1109/CVPR.2014.223.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances In Neural Information Processing Systems, pages 1097–1105, 2012.
  • Litjens et al. [2017] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez. A survey on deep learning in medical image analysis. Medical Image Analysis, 42:60–88, 12 2017. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2017.07.005.
  • Mishra et al. [2017] Kaustuv Mishra, Rachana Sathish, and Debdoot Sheet. Learning Latent Temporal Connectionism of Deep Residual Visual Abstractions for Identifying Surgical Tools in Laparoscopy Procedures. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2233–2240. IEEE, 7 2017. ISBN 978-1-5386-0733-6. doi: 10.1109/CVPRW.2017.277.
  • Namazi et al. [2018] Babak Namazi, Ganesh Sankaranarayanan, and Venkat Devarajan. Automatic Detection of Surgical Phases in Laparoscopic Videos. In

    Proceedings on the International Conference in Artificial Intelligence (ICAI)

    , pages 124–130, 2018.
  • Prellberg and Kramer [2018] Jonas Prellberg and Oliver Kramer. Multi-label Classification of Surgical Tools with Convolutional Neural Networks. 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2018. doi: 10.1109/IJCNN.2018.8489647. URL https://cataracts.grand-challenge.org
  • Read et al. [2011] Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank, Carla J Brodley Read, B Pfahringer, G Holmes, E Frank, and J Read. Classifier chains for multi-label classification. Mach Learn, 85:333–359, 2011. doi: 10.1007/s10994-011-5256-5.
  • Reiter et al. [2012] Austin Reiter, Peter K. Allen, and Tao Zhao. Feature Classification for Tracking Articulated Surgical Tools. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2012), pages 592–600. Springer, Berlin, Heidelberg, 2012. doi: 10.1007/978-3-642-33418-4{_}73.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. ISBN 0162-8828 VO - PP. doi: 10.1109/TPAMI.2016.2577031.
  • Sahu et al. [2016] Manish Sahu, Anirban Mukhopadhyay, Angelika Szengel, and Stefan Zachow. Tool and Phase recognition using contextual CNN features. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages 186–194, 2016. URL
  • Sahu et al. [2017] Manish Sahu, Anirban Mukhopadhyay, Angelika Szengel, and Stefan Zachow. Addressing multi-label imbalance problem of surgical tool detection using CNN. International Journal of Computer Assisted Radiology and Surgery, 12(6):1013–1020, 6 2017. ISSN 1861-6410. doi: 10.1007/s11548-017-1565-x.
  • Sarikaya et al. [2017] Duygu Sarikaya, Jason J. Corso, and Khurshid A. Guru. Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection. IEEE Transactions on Medical Imaging, 36(7):1542–1549, 7 2017. ISSN 0278-0062. doi: 10.1109/TMI.2017.2665671.
  • Selvaraju et al. [2017] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 618–626. IEEE, 10 2017. ISBN 978-1-5386-1032-9. doi: 10.1109/ICCV.2017.74.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, pages 568–576. MIT Press, 2014.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June, pages 1–9, 2015. ISBN 9781467369640. doi: 10.1109/CVPR.2015.7298594.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826. IEEE, 6 2016. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.308.
  • Szegedy et al. [2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2 2017. URL
  • Twinanda et al. [2016] Andru P. Twinanda, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. Single- and Multi-Task Architectures for Tool Presence Detection Challenge at M2CAI 2016. arXiv preprint, 10 2016. URL
  • Twinanda et al. [2017a] Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. IEEE Transactions on Medical Imaging, 36(1):86–97, 1 2017a. ISSN 0278-0062. doi: 10.1109/TMI.2016.2593957.
  • Twinanda et al. [2017b] Andru Putra Twinanda, Nicolas Padoy, Mrs Jocelyne Troccaz, and Gregory Hager. Vision-based Approaches for Surgical Activity Recognition Using Laparoscopic and RBGD Videos. PhD thesis, University of Strasbourg, 2017b. URL
  • Velanovich [2000] V Velanovich. Laparoscopic vs open surgery. Surgical Endoscopy, 14(1):16–21, 1 2000. ISSN 0930-2794. doi: 10.1007/s004649900003.
  • Wang et al. [2017] Sheng Wang, Ashwin Raju, and Junzhou Huang. Deep learning based multi-label classification for surgical tool presence detection in laparoscopic videos. In Proceedings - International Symposium on Biomedical Imaging, pages 620–623, 2017. ISBN 9781509011711. doi: 10.1109/ISBI.2017.7950597.
  • Wesierski and Jezierska [2018] Daniel Wesierski and Anna Jezierska. Instrument detection and pose estimation with rigid part mixtures model in video-assisted surgeries. Medical Image Analysis, 46:244–265, 5 2018. ISSN 1361-8415. doi: 10.1016/J.MEDIA.2018.03.012.
  • Zia et al. [2016] Aneeq Zia, Daniel Castro, and Irfan Essa. Fine-tuning Deep Architectures for Surgical Tool Detection. In Workshop and Challenges on Modeling and Monitoring of Computer Assisted Interventions (M2CAI), 2016. URL
  • Zisimopoulos et al. [2017] Odysseas Zisimopoulos, Evangello Flouty, Mark Stacey, Sam Muscroft, Petros Giataganas, Jean Nehme, Andre Chow, and Danail Stoyanov. Can surgical simulation be used to train detection and classification of neural networks? Healthcare technology letters, 4(5):216–222, 10 2017. ISSN 2053-3713. doi: 10.1049/htl.2017.0064.