Modeling and recognition of surgical activities poses an interesting research problem as the need for assistance and guidance through automation is addressed by the community kumar . Although a number of recent works studied automatic recognition of surgical activities, generalizability of these works remain a challenge. Moreover, the need for representations with greater expressive power that we can use not only to recognize surgical activities but also in control of autonomous systems is growing.
Frame based overall image cues have been widely used for automatic recognition of surgical activities. Although these studies coupled with the recent advances in machine learning, particularly deep learning, have been tremendously successful in terms of high performance in accuracy, a major setback with these works is that their performances are limited to the dataset that they are modeled on. The generalizability across different tasks and different datasets remains a challenge. On the other hand, kinematic data captured from the surgeon and patient side manipulators, and derived trajectories, are also limited as the publicly released kinematic datasets do not provide the kinematics of the end effectors which may not be sensitive enough.
Although frame based image cues provide us context such as the surgical setting, objects, anatomical structures operated on, they are quite prone to overfitting, therefore the generalizability across different tasks and different datasets remains a challenge. Mitchell mitchell
gives an example to communicate this problem by studying a deep neural network trained to classify images on whether they contain an animal or not. She figures out that in part, the network learned to classify images with blurry backgrounds as “contains an animal” as their dataset had a high number of wildlife photographs that focused on the animals subjects and blurred out the background. Although this network performed well, the representations learned were highly dependent on the background, and they were not representative of animal features to be generalized across other datasets. Similarly, in terms of surgical activities, placing aTie Knot might occur during a task of “suturing on tissue” and also during the more specific and challenging task of “Urethrovesical Anastomosis (UVA)” duygu that involves stitching and reconnecting two anatomical structures together. If we heavily rely on image cues of the surgical scene, these surgical activities would have very different representations. In order to overcome the challenge of generalizability across different tasks and different datasets, we need to define more generic representations of surgical activities that are robust to scene variation.
To move towards more generalizable models that are robust to scene variation, many approaches using various modalities like appearance, optical-flows, depth, and skeleton have been proposed for activity recognition. The pose-based joint and skeleton representation suffers relatively little from the intra-class variances when compared to image cuesyao
. Representations of human body joints and skeletons, and their dynamics have been widely studied in activity recognition. These Representations of human body joints and skeletons, and their trajectories are robust to illumination change and scene variation. Although, pose estimation of surgical tools has been studiedBouget2017 ; Kurmann2017 ; Rieke2015 , these representations have not been used in surgical activity and gesture recognition yet. To our knowledge, our paper is the first to use representations of joints and skeletons in surgical activity recognition.
Shi et al. twostreamgcn
states that previous methods in pose based action recognition use rule based parsing techniques to manually group joints based on the structural connectivities of human body, the skeleton is structured as a sequence of joint-coordinate vectors as input to be fed into Recurrent Neural Networks (RNN) or Convolutional Neural Network (CNN)krizhevsky . However, representing the skeleton data as a vector sequence or a 2D grid cannot fully express the spatial temporal dynamics between joints. Recently, graph con- volutional networks (GCNs) have been successfully adopted to this problem to exploit the natural graph structure of the skeleton data.
In this paper, we introduce a modality independent of the scene, therefore robust to scene variation, based on spatial temporal graph representations that could be used either individually, in cascades or as a complementary modality for surgical activity recognition. To show the effectiveness of the modality we introduce, we propose to model and recognize surgical activities in surgical videos by first defining the graph representations of the surgical tools, and then learning hierarchical temporal relationships between these joints over time using Spatial Temporal Graph Convolutional Networks (ST-GCN) stgcn . Figure 1 shows an overview of spatial temporal graph representations of surgical tools.
2 Related Work
2.1 Surgical Activity Recognition
Ahmidi et al. jigsaws_benchmark did a comparative benchmark study on the recognition of gestures on JIGSAWS dataset jigsaws . In this study, in order to classify surgical gestures, three main methods are chosen: Bag of Spatio-Temporal Features (BoF), Linear Dynamical System (LDS) lin2005 ; lin2006Leong2007 ; Varadarajan2009 ; Varadarajan2011 . For BoF, features of both spatially and temporally high texture variations are extracted with Space-Time Interest Points (STIP) Laptev2005 . This cuboid of features are then combined with additional features such as HOG: histogram of orientation gradients Dalal2005 and HOF: histogram of optical flow Dalal2006 . A codebook is created and dimensionality of these visual representations are reduced via clustering. An SVM classifier is trained on videos’s histograms of the codebook words. Linear Dynamical System (LDS) on both the image intensities and kinematic data is proposed for the same problem. In this approach, the video frames are modeled as the output of a LDS, then the pairwise distances between the LDS models are measured. Finally, a classifier is trained to predict the class of the gesture frames. In the same work, a composite Gaussian Mixture Model- Hidden Markov Model: GMM-HMM models each gesture as an elementary HMM where each state corresponds to one Gaussian Mixture Model (GMM) on kinematic data lingling . Studies that use primarily kinematic data have also been suggested. Ahmidi et al. Ahmidi2013 proposed segmenting and recognizing surgical gestures using similarity metrics on the temporal model of surgical tool motion trajectories defined with descriptive curve coding, which transforms them into a coded string. In addition to the benchmark, more recent works have also proposed using both video and kinematic data Tao2013 ; Zappella2013 . Lea et al. Colin2015 proposed to use both modalities to perform segmentation and recognition using object cues and higher-order temporal relationships between action transitions using a variation of Conditional Random Field.
More recently, deep learning architectures krizhevsky have been proposed for this open research problem. DiPietro et al. DiPietro2016 proposed using Recurrent Neural Networks (RNN) trained on kinematic data for surgical gesture classification on JIGSAWS dataset. Sarikaya et al. Sarikaya2018 proposed a multi-modal convolutional recurrent neural network architecture whose inputs are video data and motion cues (optical flow flow ). They proposed to jointly learn surgical tasks and gestures with a multi-task learning approach. Their motiovation to use motion cues as a joint modality has a similar motivation to ours: generalized models that are more robust to scene variation. However optical flow’s performance can be affected by the camera zoom and motion. Lea et al. Colin2016 proposed Temporal Convolutional Network (TCN), that hierarchically captures temporal relationships at low, intermediate, and high-level time-scales for the segmentation of surgical activities, and Convolutional Action Primitives for multimodal time-series data including video data and kinematics to address the same problem. Funke et al. learn 3D convolutional neural networks to learn spatiotemporal features. 3D CNNs applies 3D convolutionals on the 3D temporal representation of the video instead of stacking 2D convolutions at each time frame, however they are known to have problems in training due to the explosion of parameter and they only marginally improve the frame based models on image cues Wang2013 . Other works addressed recognition of higher level activities such as the surgical phases using deep learning models Jin2018 ; Cadene2016 ; Twinanda2016 .
2.2 Representations of Joints and Skeletons
Representations of human body joints and skeletons, and their dynamics have been widely used in open research problems relating video understanding and activity recognition. These Representations of human body joints and skeletons, and their trajectories are robust to illumination change and scene variation, and they are easy to obtain using depth sensors or pose estimation algorithms stgcn ; Shotton2011 ; Cao2017 . Skeletons and joint representations of the hand have also been receiving attention in egocentric cameras and augmented reality systems where the interaction with the real world and the timely response is crucial Tekin2019 ; Lepetit2019 . Although, pose estimation of surgical tools has been studied Bouget2017 ; Kurmann2017 ; Rieke2015 , these representations have not been used for surgical activity recognition yet.
The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) jigsaws provides a public benchmark surgical activity dataset. In this video dataset, surgeons with varying expertise perform surgical tasks on the daVinci Surgical System (dVSS ). The dataset includes video data captured from endoscopic cameras of the dVSS at and at a resolution of . The videos are recorded during the performance of the tasks: Suturing , Needle Passing and Knot Tying , and the dataset provides gesture labels , which are the smallest action units where the movement is intentional and is carried out towards achieving a specific goal such as Reaching for needle with right hand. The gestures form a common vocabulary for small action segments that the higher-level tasks can be broken down into. We perform our experiments on the subset of JIGSAWS for the Suturing task which provides different gestures.
3.2 Annotation of Surgical Tool Poses
For the pose annotation of surgical tools, we labeled a subset of the JIGSAWS dataset and then we trained a deep residual network (ResNet50) ResNet to estimate the surgical tool poses in the rest of the dataset. We first extracted frames of each video at frames per second (fps), and then we clustered these frames using a simple clustering algorithm based on frame similarity. We pick the most distinguishable frames based on these clusters in order to use for annotation. In other words, for each video we label only frames. Before moving to annotation, we first define joints to capture the structure of a surgical grasper tool with respect to joints. Figure 4 demonstrates these joints which include the arm, the joint that connects the arms and the tool, the tool and its end effectors (grasper pair). We used DeepLabCut DeepLabCut
to mark the 2D coordinates of these joints on the video frames. DeepLabCut provides tools to define a skeleton and its joints, a graphical user interface (GUI) to label these joints, and efficient methods for automatic pose estimation based on transfer learning with deep neural networks. Using these deep learning based methods, it is possible to obtain results that match human labeling accuracy with minimal training data (typically- frames).
We trained the ResNet50 network on the annoted data, starting the learning process with transferred weights learned from ImageNet. Then, using the learned model, we estimated the pose coordinates for the rest of the frames. We automatically removed some outliers based on the pose coordinates in consequent frames. Please note that, we intentionally used frames from the same videos in both training and testing in this step, for efficient labeling purposes with minimal effort. Automatic pose estimation of robotic and laparoscopic tools are one of the better solved problems in surgical video understanding as they achieve high accuracyBouget2017 ; Kurmann2017 ; Rieke2015 . They are also more generalizable as the tools used are usually standard. So, it is possible to train a model and estimate poses in an unseen video using state of the art models which have achieved high pose estimation accuracy. However, since pose estimation is not the focus of our paper, we opted for an efficient, quick solution and there is room for improvement.
3.3 Preprocessing videos as Input to ST-GCN
We used pose estimations and confidence scores of each joint that we collect using our trained annotation network, to define our spatial temporal graph representations. So, for each frame we have the pose estimations and the confidence scores as input. We defined video segments Vt = (v t−15 , …, v t−1 , v t ) of consequent frames at fps which equals to seconds. We set the gesture label of this segment of activity with the gesture label of the frame at time . We collected these video segments in a sliding window manner with a step size of frames
, and we used these segments as data input to our network. For the initial segments, we pad the frames to the beginning of the video segment until it reaches the size offrames, by copying the first frame.
4 Material and Methods
4.0.1 Spatial Temporal Graph Construction
For each video segment of ( seconds) consequent frames, we construct an undirected spatial temporal graph G = (V, E) to form hierarchical representation of the joints over temporal sequences of frames. First, for each frame, we define nodes for each joint. Each node keeps the 2D coordinates of the corresponding joint, as well as the estimation confidence of the particular joint. We construct the spatial graphs by connecting these nodes with edges according to the connectivity of the surgical tool structure (Figure 4). Then, for the temporal part, we connect each joint to the same joint in the consecutive frame forming inter-frame edges representing the trajectory of the joint over time. The final representation of the spatial temporal graph can be seen in Figures 5 and 1.
4.0.2 Spatial Temporal Graph Convolution Network
After a spatial graph based on the connections of joints of the surgical tool and the temporal edges between corresponding joints in consecutive frames are defined, a distance-based sampling function is proposed for constructing the graph convolutional layer, which is then employed to build the final spatial temporal graph convolutional network (ST-GCN) stgcn ; twostreamgcn . ST-GCN then applies multiple layers of spatial temporal convolutions on the neighbouring spatial and temporal nodes in the spatial temporal graph input, similar to the workings of Convolutional Neural Networks (CNN) assuming an image as a regular 2D grid graph.
The pose estimation and the adjacency matrix of the skeleton of the surgical tools is used as input to our ST-GCN. A batch normalization layer is applied to normalize data. The ST-GCN model is composed of 9 layers of spatial temporal graph convolution operators (ST-GCN units). In order to apply convolutions, ST-GCN proposes partitioning strategies for constructing convolution operations on graphs. We use spatial configuration partitioning where the nodes are labeled according to their distances to the surgical tool skeleton gravity center. The gravity center of the surgical tool skeleton is chosen as the average coordinate of all joints in the skeleton at a frame. According to this partitioning strategy, nodes that are spatially closer to the skeleton gravity center, compared to the node where the convolution is applied (root), have shorter distances, while nodes that are further have longer distances. We also adopt the learnable edge importance weighting as mentioned byYan et al. stgcn
. Using these convolutions, the hierarchical representations are learned which captures the spatial and temporal dynamics of surgical activities. Following multiple layers of graph convolutions and pooling, a soft-max layer is applied which gives probability distribution for the corresponding surgical gesture labelsJasani2019 .
5 Experiments and Evaluation
We carried out our experiments with a TITAN X (Pascal architecture) GPU and an Intel Xeon (R) CPU E5 3.50 GHz x8 with a 31.2 GiB memory. All experiments are conducted on the PyTorch deep learning frameworkpytorch . We train the ST-GCN for epochs with stochastic gradient descent (SGD) optimization algorithm with a base learning rate of and then we decrease the learning rate using a step approach by diving the learning rate by at every epochs, we set the weight decay to . In order to avoid overfitting, we use a random dropout with probability. We also perform data augmentation; firstly, we perform random affine transformations which apply random combinations of different angle, translation and scaling factors on the skeleton sequences of all consequent frames. Secondly, we randomly sample fragments from the skeleton sequences of consequent frames.
For testing, we use the Leave-one-user-out (LOUO) experimentation split set which is provided by the JIGSAWS dataset. In the LOUO setup for cross-validation, there are eight folds, each one consisting of data from one the eight subjects. The LOUO setup can be used to evaluate the robustness of a method when a subject is not previously seen in the training data. We test our model for activity recognition following the LOUO experimentation split set for the eight folds, then average our accuracy. We predict the gesture label at every frames that is, frames per second. We compared the results of our model with the JIGSAWS Benchmark studies that use kinematics data and/or video data as input, and the more recent Convolutional Neural Network based studies that use video data (frame based image cues) (Table 1).
We perform all our experiments on the “Suturing” tasksubset of JIGSAWS dataset where the chance baseline for gesture recognition is ( there are different gestures available). Our results demonstrate average accuracy on this dataset which suggests a significant improvement. Our experimental results show that learned spatial temporal graph representations of surgical videos perform well in terms of recognizing low-level surgical activities (gestures) even when used individually.
|JIGSAWS Benchmark jigsaws_benchmark||
|GMM-HMM (76 dim. kinematic data)||73.95|
|CNN based models (Evaluation at 10 fps)||Delay||
|S-CNN (video) Colin2016||1 s||74.0|
|ST-CNN (video) Colin2016||10 s||77.7|
|2D ResNet-18 (video) ResNet||0 s||79.5|
|3D CNN (K) + window (video) Funke2019||3 s||84.3|
|ST-GCN (Evaluation at 10 fps)||Delay||
|Ours (2D joint pose estimations (X,Y coordinates))||0 s||68*|
For the initial segments, we pad the frames to the beginning of the video segment until it reaches the size of frames, by copying the first frame instead delaying until the number of frames are reached.
* Our results should be viewed while keeping in mind that our scope is to define more generic representations of surgical activities that are robust to scene variation and that can potentially better generalize over different tasks and across datasets. We introduce a new modality based on hierarchical spatial temporal graph representations of surgical tool structures and joints, and the spatial and temporal dynamics of these joints. The only input we use is the pose estimations of the surgical tool joints, that is 2D joint pose estimations (X,Y coordinates) for each surgical tool and the estimation confidence scores of each joint. These learned representations can be used either individually, in cascades or as a complementary modality in surgical activity recognition.
It should be also noted that there is room for improvement for our model at a couple of steps. Firstly, we labeled only
frames for each video while creating our training dataset annotations. Having more annotated data, and using a state of the art surgical tool pose estimation algorihm, we can achieve a higher accuracy for pose estimations. Since a separate Validation split is not provided by JIGSAWS dataset, we were not able to tune our hyperparameters, instead we opted for minimizing the training loss, using a Scholastic Gradient Descent (SGD) optimization algorithm with a fixed learning rate that decreases by a factor ofevery epochs.
Modeling and recognition of surgical activities poses an interesting research problem as the need for assistance and guidance through automation is addressed by the community. Although a number of recent works studied automatic recognition of surgical activities, generalizability of these works across different tasks and different datasets remains a challenge. In order to overcome the challenge of generalizability across different tasks and different datasets, we need to define more generic representations of surgical activities that are robust to scene variation. Kinematic data captured from the surgeon and patient side manipulators, and derived trajectories, are also limited as the publicly released kinematic datasets do not provide the kinematics of the end effectors which may not be sensitive enough. The pose-based joint and skeleton representations, on the other hand, suffers relatively little from the intra-class variances when compared to image cues yao . Although, pose estimation of surgical tools has been studied Bouget2017 ; Kurmann2017 ; Rieke2015 , these representations have not been used in surgical activity and gesture recognition yet.
In this paper, we introduce a modality independent of the scene, therefore robust to scene variation, based on spatial temporal graph representations of surgical tool structures and joints. To our knowledge, our paper is the first to use representations of joints and skeletons in surgical activity recognition. To show the effectiveness of the modality we introduce, we propose to model and recognize surgical activities in surgical videos. We first construct a spatial temporal graph on pose estimation of surgical tool joints to form a hierarchical representation of these joints over temporal sequences of frames. We then learn hierarchical temporal relationships between these joints over time using Spatial Temporal Graph Convolutional Networks (ST-GCN) stgcn by exploiting the natural graph structure of skeleton data and the structural connectivities of joints.
Our experimental results show that learned spatial temporal graph representations of surgical videos perform well in terms of recognizing low-level surgical activities (gestures) even when used individually. We experiment our model on the “Suturing” task of the JIGSAWS dataset where the chance baseline for gesture recognition is ( there are different gestures available). Our results demonstrate average accuracy on this dataset which suggests a significant improvement. These learned spatial temporal graph representations can be used either individually, in cascades or as a complementary modality in surgical activity recognition, therefore provide a benchmark for future studies. Moreover, the expressive power of these graph representations can potentially be coupled in control of autonomous systems bridging the gap between recognition and control.
Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent This articles does not contain patient data.
Acknowledgements.This work was supported by French state funds managed within the Investissements d’Avenir program by BPI France (project CONDOR).
- (1) Kumar S, Ahmidi N, Hager G, Singhal P, Corso JJ, Krovi V (2015) Surgical performance assessment, ASME Dynamics Systems and Control Magazine, 3(3):7-10
Melanie M (2019) Artificial Intelligence: A Guide for Thinking Humans, Farrar, Straus and Giroux
- (3) Sarikaya D, Corso JJ, Guru KA (2017) Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Transactions on Medical Imaging 36(7):1542-1549
- (4) Yan S, Xiong Y, Lin D (2018) Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, The Association for the Advancement of Artificial Intelligence (AAAI)
- (5) Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. Transaction of Biomedical Engineering
- (6) Lin HC, Shafran I, Murphy TE, Okamura AM, Yuh DD, Hager GD (2005) Automatic detection and segmentation of robot-assisted surgical motions. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 3749:802-810
- (7) Lin HC, Shafran I, Yuh D, Hager GD (2006) Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions. Computer Aided Surgery 11:220-230
- (8) Leong JJH, Nicolaou M, Atallah L, Mylonas GP, Darzi AW, Yang GZ (2006) HMM assessment of quality of movement trajectory in laparoscopic surgery. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 4190:752-759
- (9) Yang GZ , Varadarajan B, Reiley C, Lin H, Khudanpur S, Hager G (2009) Data-derived models for segmentation with application to surgical assessment and training. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 5761:426-434
- (10) Balakrishnan Varadarajan (2011) Learning and inference algorithms for dynamical system models of dextrous motion. PhD thesis Johns Hopkins University
Ivan Laptev (2005) On space-time interest points, International Journal of Computer Vision (IJCV), 64(2):107-123
Dalal N and Triggs B (2005) Histograms of oriented gradients for human detection (2005), IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1:886-893.
- (13) Dalal N, Triggs B, Schmid C. (2006) Human detection using oriented histograms of flow and appearance, European Conference on Computer Vision (ECCV), 428-441
- (14) Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden markov models for surgical gesture classification and skill evaluation. In: Proc. International Conference on Information Processing in Computer-Assisted Interventions (IPCAI) 167-177.
- (15) Ahmidi N, Gao Y, Vedula SS, Khudanpur S, Vidal R, Hager GD (2013) String Motif-Based Description of Tool Motion for Detecting Skill and Gestures in Robotic Surgery, Medical Image Computing and Computer-Assisted Interventions (MICCAI), 26-33
- (16) Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Bejar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) The JHU-ISI gesture and skill assessment working set (JIGSAWS): a surgical activity dataset for human motion modeling. In: Proc. Modeling Monitor. Comput. Assist. Interventions (MCAI)
- (17) Tao L, Zappella L, Hager GD, Vidal R (2013) Surgical Gesture Segmentation and Recognition, Medical Image Computing and Computer-Assisted Interventions (MICCAI), 8151:339-346
- (18) Zappella L, Benjamín B, Hager GD, Vidal R (2013) Surgical gesture classification from video and kinematic data, Medical Image Analysis 7(17):732-745
- (19) Lea C, Hager GD, Vidal R (2015) An Improved Model for Segmentation and Recognition of Fine-Grained Activities with Application to Surgical Training Tasks, IEEE Winter Conference on Applications of Computer Vision, 1123-1129
- (20) Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proc. Adv. Neural Information Processing Systems (NIPS) 1-2
- (21) DiPietro R, Lea C, Malpani A, Ahmidi N, Vedula SS, Lee GI, Lee MR, Hager HG (2016) Recognizing surgical activities with recurrent neural networks. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 551-558
- (22) Sarikaya D, Khurshid AG, Corso JJ (2018) Joint Surgical Gesture and Task Classification with Multi-Task and Multimodal Learning, arXiv(cs.CV):1805.00721
- (23) Thomas B, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping, European Conference on Computer Vision (ECCV) (ECCV) 25-36
- (24) Lea C, Vidal R, Reiter A, Hager GD, Hua G, Jégou H (2016) Temporal Convolutional Networks: A Unified Approach to Action Segmentation, European Conference on Computer Vision (ECCV) Workshops, 47-54
- (25) Funke I, Bodenstedt S , Oehme F, von Bechtolsheim F, Weitz J, Speidel S(2019) Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video, International Conference on Medical Image Computing and Computer-Assisted Interventions (MICCAI), 467-475
- (26) Wang H. and Schmid C, Action recognition with improved trajectories (2013) International Conference on Computer Vision
- (27) Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C, Heng P (2018), SV-RCNet: Workflow Recognition From Surgical Videos Using Recurrent Convolutional Network, IEEE Transactions on Medical Imaging (TMI), 37(5):1114-1126
- (28) Cadéne R, Robert T, Thome N, Cord M (2016) M2CAI workflow challenge: convolutional neural networks with time smoothing and hidden markov model for video frames classification. Computing Research Repository (CoRR) abs:1610.05541
- (29) Twinanda AP, Mutter D, Marescaux J, Mathelin M, Padoy N (2016) Single and multi-task architectures for surgical workflow challenge. In: Proc. Workshop and Challenges on Modeling and Monitoring of Computer Assisted Interventions (M2CAI) at Medical Image Computing and Computer-Assisted Interventions (MICCAI)
- (30) Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2011) Real-time human pose recognition in parts from single depth images, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
- (31) Cao Z, Simon T, Wei SE, Sheikh Y, Realtime multi-person 2d pose estimation using part affinity fields (2017) IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
- (32) Tekin B, Bogo F, Pollefeys M (2019) H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions, Computer Vision and Pattern Recognition (CVPR)
- (33) Oberweger M, Wohlhart P, Lepetit V (2019) Generalized Feedback Loop for Joint Hand-Object Pose Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)
- (34) Bouget D, Allan M, Stoyanov D, Jannin P (2017) Vision-based and marker-less surgical tool detection and tracking: a review of the literature, Medical image analysis, 35:633-654
- (35) Kurmann T, Márquez-Neila P, Du X, Fua P, Stoyanov D, Wolf S, Sznitman R (2017) Simultaneous Recognition and Pose Estimation of Instruments in Minimally Invasive Surgery, CoRR abs/1710.06668, arXiv:1710.06668
- (36) Rieke N, Tan DJ, Alsheakhali M, Tombari F, di San Filippo CA, Belagiannis V, Eslami A, Navab N (2015) Surgical Tool Tracking and Pose Estimation in Retinal Microsurgery, Medical Image Computing and Computer-Assisted Interventions (MICCAI), 266-273
- (37) Yao A, Gall J, Van Gool L (2012) Coupled action recognition and pose estimation from multiple views, International Journal of Computer Vision (IJCV), 100(1):16-37
- (38) He K, Zhang X, Ren S, Sun J (2015) Deep Residual Learning for Image Recognition, CoRR abs/1512.03385, arXiv:1512.03385
- (39) Mathis A, Mamidanna P, Cury KM, Abe T, Murthy VN, Mathis MW, Bethge M (2018) DeepLabCut: markerless pose estimation of user-defined body parts with deep learning, Nature Neuroscience
- (40) Kim UK, Lee DH, Moon H, Koo J, Choi H (2014) Design and realization of grasper-integrated force sensor for minimally invasive robotic surgery, IEEE/RSJ International Conference on Intelligent Robots and Systems, 4321-4326
- (41) Shi L, Zhang Y, Cheng J, Lu H (2019) Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
- (42) Jasani B, Mazagonwalla A (2019) Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space, ArXiv:abs/1911.11344
- (43) Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library Advances in Neural Information Processing Systems 32, NEURIPS, 8024-8035