The recognition of the surgical workflow has been identified as a key research area in surgical data science, as this recognition enables the development of intra- and post-operative context-aware decision support tools fostering both surgical safety and efficiency. Pioneering work in surgical workflow recognition has mostly focused on phase recognition from endoscopic video [12, 1, 4, 22, 25, 7] and from ceiling mounted cameras [21, 2], gesture recognition from robotic data (kinematic [6, 5], video [24, 11], system events ) and event recognition, such as the presence of smoke or bleeding .
In this paper, we focus on recognizing fine-grained activities representing the instrument-tissue interactions in endoscopic videos. These interactions are modeled as triplets . Triplets represent the used instrument, the performed action, and the anatomy acted upon, as proposed in existing surgical ontologies [17, 10]. The target anatomy, while more challenging to annotate, adds substantial semantics to the recognized action/instrument. Triplet information has already been used to recognize phases , however, to the best of our knowledge, this is the first work aiming at recognizing triplets directly from the video data. The fine-grained nature of the triplets also makes this recognition task very challenging. For comparison, the action recognition task introduced within the Endovis challenge at MICCAI 2019 targeted the recognition of 4 verbs only (grasp, hold, cut, clip).
To perform this work, we present a new dataset, called CholecT40, containing 135K action triplets annotated on 40 cholecystectomy videos from the public Cholec80 dataset . The triplets belong to 128 action triplet classes, composed of 6 instruments, 8 verbs, and 19 target classes. Examples of such action triplets are ⟨grasper, retract, gallbladder⟩, ⟨scissor, cut, cystic duct⟩, ⟨hook, coagulate, liver bed⟩ (see also Fig. 1).
To design our recognition model, we build a multitask learning (MTL) network with three branches for the instrument, verb and target recognition. We also observe that triplets are instrument-centric: an action is only performed if an instrument is present. Indeed, clinically an action can only occur if a hand is manipulating the instrument. We therefore introduce a new module, called class activation guide (CAG), which uses the weak localization information from the instrument activation maps to guide the recognition of the verbs and targets. The idea is similar to 
, which uses the human’s ROI produced by FasterRCNN to inform the model on the likely location of the target. Other related works from the computer vision community[23, 19, 20] rely heavily on the overlap of the subject-object bounding boxes to learn the interactions. However, in addition to the fact that our work target triplets, our approach differs in that it does not rely on any spatial annotations in the dataset, which are expensive to generate.
Since instrument, verb, and target are multi-label classes, another challenge is to model their associations within the triplets. As will be shown in the experiments, naively assigning an ID to each triplet and classifying the IDs is not effective, due to the large amount of combinatorial possibilities. In[23, 19, 20] mentioned above, human
is considered to be the only possible subject of interaction. Hence, in those works data association requires only bipartite matching to match verbs to objects. This is solvable by using the outer product of the detected object’s logits and detected verb’s logits to form a 2D matrix of interaction at test time. Data association’s complexity increases however with a triplet. Solving a triplet relationship is a tripartite graph matching problem, which is an NP-hard optimisation problem. In this work, inspired by , we therefore propose a 3D interaction space to recognize the triplets. Unlike , where the data association is not learned, our interaction space learns the triplet relationships.
In summary, the contributions of this work are as follows:
We propose the first approach to recognize surgical actions as triplets of ⟨instrument, verb, target⟩ directly from surgical videos.
We present a large endoscopic action triplet dataset, CholecT40, for this task.
We develop a novel deep learning model that uses weak localization information from tool prediction to guide verb and target detection.
We introduce a trainable 3D interaction space to learn the relationships within the triplets.
2 Cholecystectomy Action Triplet Dataset
To encourage progress towards the recognition of instrument-tissue interactions in laparoscopic surgery, we generated a dataset consisting of 40 videos from Cholec80  annotated with action triplet information. We call this dataset CholecT40. The cholecystectomy recordings were first annotated by a surgeon using the software Surgery Workflow Toolbox-Annotate from the B-com institute. For each identified action, the surgeon sets times for the start and end frames, then labels the instrument, the verb and the target. Any change in the triplet configuration marks the end of the current action and the beginning of a different one. This first step was followed by a mediation on the annotations and a class grouping carried out by another clinician. The resulting action triplets span 128 classes encompassing 6 instruments, 8 verbs, and 19 target classes. For our experiments, we downsample the videos to 1 fps yielding a total of 83.2K frames annotated with 135K action-triplet instances. Table 1 shows the frequency of occurrence of the instruments, verbs and targets in the dataset. When a tool is idle, the verb and the target are both set to null. Additional statistics on the co-occurence distribution of the triplets are presented in the supplementary material. The video dataset is randomly split into training (25 videos, 50.6K frames, 82.4K triplets), validation (5 videos, 10.2K frames, 15.9K triplets) and testing (10 videos, 22.5K frames, 37.1K triplets) sets.
To recognize the instrument-tissue interactions in the CholecT40 dataset, we build a new deep learning model, called Tripnet, by following a multitask learning (MTL) strategy. The principal novelty of this model is its use of the instrument’s class activation guide and 3D interaction space to learn the relationships between the components of the action triplets.
3.0.1 Multitask Learning:
. Following this observation, we build a MTL network with three branches for the instrument (I), verb (V), and target (T) recognition tasks. The instrument branch is a two layers convolutional network trained for instrument classification. It uses global max pooling (GMP) to learn the class activation maps (CAM) of the instruments for their weak localization, as suggested in
. Similarly, the verb and the target branches learn the verb and target classifications using each two convolutional layers and one fully-connected (FC)-layer. All the three branches share the same ResNet-18 backbone for feature extraction.
3.0.2 Class Activation Guide:
The pose of the instruments is indicative of their interactions with the tissues. However, there is no bounding box annotation in the dataset that could be used to learn how to crop the action’s locations, as done in [8, 23, 19, 20]. We therefore hypothesize that the instrument’s CAM from the instrument branch, learnable in a weakly supervised manner, has sufficient information to direct the verb and target detection branches towards the likely region of interest of the actions. For convenience, we regroup the three branches of the MTL into two subnets: the instrument subnet and the verb-target subnet, as illustrated in Fig. 2a. The verb-target subnet is then transformed to a class activation guide (CAG) unit, as shown in Fig. 2b. It receives the instrument’s CAM as additional input. This CAM input is then concatenated with the verb and target features, concurrently, to guide and condition the model search space of the verb and target on the instrument appearance cue.
3.0.3 3D Interaction Space:
Recognizing the correct action triplets involves associating the right
components using the raw output vectors, also called logits, of the instrument, verb and target branches. In the existing work , where the data association problem involves only the object-verb pair, the outer product of their logits is used to form a 2D matrix of component interaction at test time. In a similar manner, we propose a 3D interaction space for associating the triplets, as shown in Fig. 2c. Unlike in , where the data association is not learned by the trained model, we model a trainable interaction space. Given the -logits, -logits and -logits for the I,V,T respectively, we learn the triplets using a 3D projection function as follows:
where , , , are the learnable weight vectors for projecting I, V and T to the 3D space and is an outer product operation. This gives an grid of logits with the three axes representing the three components of the triplets. For all the 3D point
represents a possible triplet. A 3D point with a probability above a threshold is considered a valid triplet. In practice, there are more 3D points in the space than valid triplets in the CholecT40 dataset. Therefore, we mask out the invalid points, obtained using the training set, at both train and test times.
3.0.4 Proposed Model:
The proposed network is called Tripnet and shown in Fig. 2
(a): it is an integration of the CAG unit and of the 3D interaction space within the MTL model. The whole model is trained end-to-end using a warm-up parameter which allows the instrument subnet to learn some semantics for a few epochs before guiding the verb-target subnet with instrument cues.
4.0.1 Implementation Details:
We perform our experiments on CholecT40. During training, we employ three types of data augmentation (rotation, horizontal flipping and patch masking) with no image preprocessing. The model is trained on images resized to
. All the individual tasks are trained for multi-label classification using the weighted sigmoid cross-entropy with logits as loss function, regularized by annorm with weight decay. The class weights are calculated as in 
. The Resnet-18 backbone is pretrained on Imagenet. All the experimented models are trained using learning rates with exponential decay and initialized with the values
for the subnets, backbone, and 3D interaction space, respectively. The learning rates and other hyperparameters are tuned from the validation set using grid search. Our network is implemented using TensorFlow and trained on GeForce GTX 1080 Ti GPUs.
4.0.2 Tasks and Metrics:
To evaluate the capacity of a model to recognize correctly a triplet and its components, we use two types of metrics:
Instrument detection performance: This measures the average precision (AP) of detecting the correct instruments, as the area under the precision-recall curve per instrument().
Triplet recognition performance: This measures the AP of recognizing the instrument-tissue interactions by looking at different sets of triplet components. We use three metrics: the instrument-verb (), instrument-target (), and instrument-verb-target () metrics. All the listed components need to be correct during the AP computation. evaluate the recognition of the complete triplets.
We build two baseline models. The naive CNN baseline is a ResNet-18 backbone with two additional 3x3 convolutional layers and a fully connected (FC) classification layer with units, where corresponds to the number of triplet classes (. The naive model learns the action-triplets using their IDs without any consideration of the components that constitute the triplets. We therefore also include an MTL baseline built with the , and branches described in Section 3. The outputs of the three branches are concatenated and fed to an FC-layer to learn the triplets. For fair comparison, the two baselines share the same backbone as Tripnet.
4.0.4 Quantitative Results:
Table 2 presents the AP results for the instrument detection across all triplets. The results show that the naive model does not understand the triplet components. This comes from the fact that it is designed to learn the triplets using their IDs: two different triplets sharing the same instrument or verb still have different IDs. On the other hand, the MTL and Tripnet networks, which both model the triplet components, show competing performance on instrument detection. Moreover, Tripnet outperforms the MTL baseline by mean AP. This can be attributed to its use of CAG unit and 3D interaction space to learn better semantic information about the instrument behaviors.
The triplet recognition performance is presented in Table 3. The naive CNN model has again the worst performance for the , and metrics, as expected from the previous results. The MTL baseline model, on the other hand, performs only slightly above the naive model despite its high instrument detection performance in Table 2. This is because the MTL baseline model, after learning the components of the triplets, dilutes this semantic information by concatenating and feeding the output to an FC-layer. However, Tripnet improves over the MTL baseline by leveraging the instrument cue from the CAG unit. It also learns better triplet association by increasing the by on average. Tripnet outperformed all the baselines in instrument-tissue interaction recognition by a minimum of . In general, it can be observed that it is easier to learn the instrument-verb components than the instrument-target components. This is likely due to the fact that (a) a verb has a more direct association to the instrument creating the action (b) the dataset contains many more target classes than verb classes (c) many anatomical structures in the abdomen are usually discriminated with difficulty by non-medical experts.
While the action recognition performance appears to be low, it follows the same pattern as other models in the computer vision literature on action datasets of even lesser complexity. For instance, on the HICO-DET dataset ,  achieves ,  achieves and  achieves action recognition AP, also known as . In fact, the current state-of-the-art performance on HICO-DET dataset is as reported on the leaderboard server. Similarly, the winner of the MICCAI 2019 subchallenge on action recognition, involving only four verb classes, scores F1-score. This shows the challenging nature of fine-grained action recognition.
4.0.5 Ablation Studies:
Table 4 presents an ablation study of the novel components of the Tripnet model. The CAG unit improves the and by approximately and , respectively, justifying the need for using instrument cues in the verb and target recognition. We also observe that learning the instrument-tissue interactions is better with a trainable 3D projection than with either the untrained 3D space or with an FC-layer. This results in a large improvement of the . We record the best performance in all four metrics by combining the CAG unit and the trained 3D interaction space. The two units complement each other and improve the results across all metrics.
4.0.6 Qualitative results:
To better appreciate the performance of the proposed model in understanding instrument-tissue interactions, we overlay the predictions on several surgical images in Fig. 4. The qualitative results show that Tripnet does not only improve the performance of the baseline models, but also localizes accurately the regions of interest of the actions. It is observed that the majority of incorrect predictions are due to one incorrect triplet component. Instruments are usually correctly predicted and localized. As can be seen in the complete statistics provided in the supplementary material, it is however not straightforward to predict the verb/target directly from the instrument due to the multiple possible associations. More qualitative results are included in the supplementary material.
In this work, we tackle the task of recognizing action triplets directly from surgical videos. Our overarching goal is to detect the instruments and learn their interactions with the tissues during laparoscopic procedures. To this aim, we present a new dataset, which consists of 135k action triplets over 40 videos. For recognition, we propose a novel model that relies on instrument class activation maps to learn the verbs and targets. We also introduce a trainable 3D interaction space for learning the ⟨instrument, verb, target⟩ associations within the triplets. Experiments show that our model outperforms the baselines by a substantial margin in all the metrics, hereby demonstrating the effectiveness of the proposed approach.
This work was supported by French state funds managed within the Investissements dﬂAvenir program by BPI France (project CONDOR) and by the ANR (references ANR-11-LABX-0004 and ANR-16-CE33-0009). The authors would also like to thank the IHU and IRCAD research teams for their help with the data annotation during the CONDOR project.
-  Blum, T., Feußner, H., Navab, N.: Modeling and segmentation of surgical workflow from laparoscopic video. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 400–407 (2010)
-  Chakraborty, I., Elgammal, A., Burd, R.S.: Video based activity recognition in trauma resuscitation. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). pp. 1–8 (2013)
-  Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: 2018 ieee winter conference on applications of computer vision (WACV). pp. 381–389 (2018)
-  Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. International journal of computer assisted radiology and surgery 11(6), 1081–1089 (2016)
DiPietro, R., Ahmidi, N., Malpani, A., Waldram, M., Lee, G.I., Lee, M.R., Vedula, S.S., Hager, G.D.: Segmenting and classifying activities in robot-assisted surgery with recurrent neural networks. International journal of computer assisted radiology and surgery14(11), 2005–2020 (2019)
-  DiPietro, R., Lea, C., Malpani, A., Ahmidi, N., Vedula, S.S., Lee, G.I., Lee, M.R., Hager, G.D.: Recognizing surgical activities with recurrent neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 551–558 (2016)
Funke, I., Jenke, A., Mees, S.T., Weitz, J., Speidel, S., Bodenstedt, S.: Temporal coherence-based self-supervised learning for laparoscopic workflow analysis. In: OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, pp. 85–93 (2018)
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8359–8367 (2018)
-  Jin, Y., Li, H., Dou, Q., Chen, H., Qin, J., Fu, C.W., Heng, P.A.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Medical image analysis 59, 101572 (2020)
-  Katić, D., Julliard, C., Wekerle, A.L., Kenngott, H., Müller-Stich, B.P., Dillmann, R., Speidel, S., Jannin, P., Gibaud, B.: Lapontospm: an ontology for laparoscopic surgeries and its application to surgical phase recognition. International journal of computer assisted radiology and surgery 10(9), 1427–1434 (2015)
Kitaguchi, D., Takeshita, N., Matsuzaki, H., Takano, H., Owada, Y., Enomoto, T., Oda, T., Miura, H., Yamanashi, T., Watanabe, M., et al.: Real-time automatic surgical phase recognition in laparoscopic sigmoidectomy using the convolutional neural network-based deep learning approach. Surgical Endoscopy pp. 1–8 (2019)
-  Lo, B.P., Darzi, A., Yang, G.Z.: Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In: Int. conference on medical image computing and computer-assisted intervention. pp. 230–237 (2003)
-  Loukas, C., Georgiou, E.: Smoke detection in endoscopic surgery videos: a first step towards retrieval of semantic events. The International Journal of Medical Robotics and Computer Assisted Surgery 11(1), 80–94 (2015)
-  Maier-Hein, L., Vedula, S., Speidel, S., Navab, N., Kikinis, R., Park, A., Eisenmann, M., Feussner, H., Forestier, G., Giannarou, S., et al.: Surgical data science: Enabling next-generation surgery. Nature Biomedical Engineering 1, 691–696 (2017)
-  Malpani, A., Lea, C., Chen, C.C.G., Hager, G.D.: System events: readily accessible features for surgical phase detection. International journal of computer assisted radiology and surgery 11(6), 1201–1209 (2016)
-  Mondal, S.S., Sathish, R., Sheet, D.: Multitask learning of temporal connectionism in convolutional networks using a joint distribution loss function to simultaneously identify tools and phase in surgical videos. arXiv preprint arXiv:1905.08315 (2019)
-  Neumuth, T., Strauß, G., Meixensberger, J., Lemke, H.U., Burgert, O.: Acquisition of process descriptions from surgical interventions. In: International Conference on Database and Expert Systems Applications. pp. 602–611 (2006)
-  Nwoye, C.I., Mutter, D., Marescaux, J., Padoy, N.: Weakly supervised convolutional lstm approach for tool tracking in laparoscopic videos. International journal of computer assisted radiology and surgery 14(6), 1059–1067 (2019)
-  Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.C.: Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 401–417 (2018)
-  Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1568–1576 (2018)
-  Twinanda, A.P., Alkan, E.O., Gangi, A., de Mathelin, M., Padoy, N.: Data-driven spatio-temporal rgbd feature encoding for action recognition in operating rooms. Int. journal of computer assisted radiology and surgery 10(6), 737–747 (2015)
-  Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Transactions on Medical Imaging 36(1), 86–97 (2017)
-  Xu, B., Wong, Y., Li, J., Zhao, Q., Kankanhalli, M.S.: Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
-  Zia, A., Hung, A., Essa, I., Jarc, A.: Surgical activity recognition in robot-assisted radical prostatectomy using deep learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 273–280 (2018)
-  Zisimopoulos, O., Flouty, E., Luengo, I., Giataganas, P., Nehme, J., Chow, A., Stoyanov, D.: Deepphase: surgical phase recognition in cataracts videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 265–272 (2018)
===== Supplementary Material =====††footnotetext: Accepted at International Conference on Medical Image Computing and Computer-Assisted Intervention MICCAI 2020.
Appendix I : Co-occurence Distribution of the Triplets