Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem for the needs addressed by the community such as automation and early identification of technical competence for surgeons in training. We approach the problem of video understanding of RAS videos as modeling the motions of surgeons. By analyzing the scene and object features, motion, low-level surgical gestures and the transitions among the gestures, we could model the activities taking place during surgical tasks lingling . The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries duygu .
We propose to model the low-level surgical gestures; recurring common activity segments as described by Gao et al. jigsaws and the surgical tasks that are composed of gestures with a multimodal and multi-task learning approach. We use the different modalities of the visual features and motion cues (optical flow), and learn the temporal dynamics jointly. We argue that surgical tasks are better modeled by the visual features that are determined by the objects in the scene, while the low-level gestures are better modeled with motion cues. For example, the gesture Positioning the needle might take place in different tasks of Suturing and Needle Passing jigsaws . Or, on a higher level of actions, placing a Tie Knot might occur during a task of suturing on a tissue and a more specific and challenging task of Urethrovesical Anastomosis (UVA) duygu that involves stitching and reconnecting two parts together. If we rely on visual features of the object and the scenes only, these smaller segments of activities would have very different representations. Motion cues of low-level gestures, on the other hand, are independent of object and scene features. This helps us to classify common low-level gestures that are generic in nature and that reoccur in different tasks with different objects and under different settings better, and also helps reduce overfitting. However, we also believe that visual features are complementary to the motion cues, which are independent of object and scene features. The gesture “Positioning the needle” might take place in different tasks of Suturing and Needle Passing, however is not likely to take place in Knot Tying tasks as a needle is not used for this task.
In this paper, we address the problem of simultaneous classification of common low-level gestures and surgical tasks. While surgical task recognition, which is characteristically defined by the scene and the objects, is an easy task, the low-level gesture recognition remains a challenge. In our work, we focus on this challenging task by making use of complex relationships between the visual and motion cues, and the relationship between surgical activities at different levels of complexity. We propose a novel architecture that simultaneously classifies the common low-level gestures and the surgical tasks by making use of these joint relationships, combining shared and task-specific representations with a multi-task learning approach. Moreover, our architecture supports multimodal learning and is trained on both visual features and motion cues to achieve better performance. An overview of our method is shown in Figure 1. We use the inputs of RGB frames for visual features and the RGB representation of the optical flow information of the same frames to refer to motion cues. We extract the high level features of these inputs using the convolutional neural networks we have trained on each task separately and use them as input to our recurrent joint model. After we convolve the two streams of input modality pairs, we concatenate the higher level features and use it as an input to our recurrent model that learns temporal dynamics of consequent frames.
2 Related Work
Surgical activity recognition is an active research area that gets a lot of attention from the computer vision and medical imaging communities. The potential of autonomous or human-robot collaborative surgeries, automated real-time feedback, guidance and navigation during surgical tasks is exciting for the community. There are studies that address this open research problem. The methods range from using SVM classification with Linear Discriminant Analysis (LDA)lin2005 ; lin2006
, Gaussian mixture models (GMMs)leong yang ; varadarajan and more recently, to convolutional neural networks and recurrent neural networks m2cai_smooth ; colin ; m2cai_lstm .
Ahmidi et al. jigsaws_benchmark do a comparative benchmark study on the recognition of gestures on JIGSAWS dataset. In this study, in order to classify surgical gestures, three main methods are chosen: Bag of Spatio-Temporal Features (BoF), Linear Dynamical System (LDS) and a composite Gaussian Mixture Model- Hidden Markov Model: GMM-HMM. HMMs are often used in studies that use additional modalities of various sensors sensor1 ; sensor2 , and to classify high level surgical activities sensor3 .
Deep neural networks have been introduced to the activity recognition tasks simonyan ; lrcn . However, in the medical field these advances are only very recently being explored. DiPietro et al. colin propose using Recurrent Neural Networks (RNN) trained on kinematic data for surgical gesture classification on JIGSAWS dataset. Some other recent studies focus on classifying surgery phases; a higher level of surgical activity that includes a sequence of different tasks. A recent work by Cadéne et al. m2cai_smooth uses deep residual networks to extract visual features of the video frames and then applies temporal smoothing with averaging. The authors of this work finally model the transitions between the surgery phase steps with an HMM. Twinanda et al. m2cai_lstm offer a study on classification of surgery phases by first extracting visual features of video frames via a CNN and then passing them to an SVM to compute confidences of a frame belonging to a surgery phase. These confidences are used as inputs to an HMM and also to an LSTM network.
HMMs are often used in surgical activity recognition as they address the temporal dynamics of the gestures and relationships between surgical phases. Although there are studies that rely on visual features of the surgical video frames, a large number of promising studies rely highly on kinematic data. The problem with capturing kinematic data is that; it requires additional equipment, and it is possible that the kinematic data is sampled over periods and might be missing important motion information. For the latter case, we need to preprocess the kinematic data by interpolation to estimate the motion information of these missing parts. Kinematic data also requires preprocessing based on specific attributes by the equipment that has captured it. Visual features on the other hand, when used alone, are highly dependent on the objects and scenes, and might be missing important motion cues.
In this paper, we address the video understanding problem of activity classification in Robot-assisted Surgery (RAS) videos. We simultaneously classify common low-level gestures and surgical tasks. We focus on modeling the common low-level gestures that are generic and that reoccur in different tasks. We believe that the low-level gestures are better modeled by motion cues, while the surgical tasks are better modeled by the visual features that are characteristically defined by the objects in the scene. However, we also believe that visual features are complementary to the motion cues, which are independent of object and scene features.
Many recent studies have shown that exploiting the relationship across different tasks, jointly reasoning multi-tasks ubernet ; mingsheng ; yao and taking advantage of a combination of shared and task-specific representations crossstitch perform astonishingly better than their single-task counterparts.
We propose using recurrent neural networks rnn , specifically long term short memory (LSTM) neural networks lstm , in order to model the complex temporal dynamics and sequences. We aspire to model the sequences of gestures occurring in surgical-task videos with a model that is deep over temporal dimensions. Our architecture is based on the principles of LRCN lrcn , a specialized LSTM network that jointly learns temporal dynamics and visual features by convolutional network models. Our model supports multimodal learning, and uses the RGB frames and the RGB representation of the optical flow information relating the motion cues in the same frames as input. Our architecture simultaneously classifies the low-level gestures and surgical tasks by a multi-task learning approach making use of joint relationships, combining shared and task-specific representations. In this sense, our model is truly end-to-end and novel in the way that it supports multimodal and multi-task learning for surgical low-level gesture classification.
The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) jigsaws provides a public benchmark surgical activity dataset. In this video dataset, surgeons with varying expertise perform surgical tasks on the daVinci Surgical System (dVSS ) (Figure 3). The dataset includes video data recorded during the performance of the tasks: Suturing, Needle Passing and Knot Tying and provide low-level gesture labels, which are the smallest action units where the movement is intentional and is carried on towards achieving a specific goal. The gestures reoccur in various tasks. A surgical gesture is a small, however a meaningful and purposeful segment of activity that, the tasks could be broken into. For example, the gesture of G2: Positioning needle could take place in both Suturing and Needle Passing tasks. However, the objects and the scene in these two tasks are quite different; while Suturing involves a task of placing sutures on a tissue model, Needle Passing takes place on a set of metal hoops jigsaws . The gestures form a common vocabulary for small action segments that reoccur in different tasks. We focus on recognizing these low-level gestures, even when they reoccur across different tasks with varying accompanying objects and are performed on different sets.
We first clipped task videos into gesture clips using the annotation files provided with the dataset. We have converted the frame information to time and clipped the tasks into gesture segments. Then we extracted frames per second from these videos. We resized these frames to . We computed the optical flow information of the clipped videos with the method suggested by Thomas et.al flow and transformed this information into RGB representation images. Although there are gestures in the gesture vocabulary defined, there is no video data available for the : Pulling Suture with right hand, so we only have gesture action labels. Most studies on JIGSAWS dataset use the gestures through to only colin ; jigsaws_benchmark , resulting in gestures, however, we experiment on all of the gesture labels available. We rename our data of 14 gestures excluding the empty set of . For the remaining of this paper we will address them as .
4 Material and Methods
Neural networks typically assume that all inputs are independent of each other. However, with sequential tasks and time-series, in order to make a better prediction, we should consider the previous computations and temporal dynamics between the sequential inputs. Recurrent Neural Networks rnn
, feed-forward networks that unroll over time where each time step is equivalent to a layer, are able to make use of sequential information. In these networks, the hidden unit activations of former inputs of a sequence feeds into the network along with the inputs, making it possible for these networks to learn sequential, time-series models and temporal dynamics. In RNNs, learning is done through back-propagation on the unfolded network. This is called Back Propagation Through Time (BPTT). The total cost function then, is the sum of the error function over time, and at each time step, weights are adjusted according to this error rate. As with feed-forward neural networks, vanishing gradients becomes a problem, and naturally, for the Recurrent Neural Network, this problem leads to exponentially small gradients through time. Recurrent neural networks are difficult to train for this reason. In our work, we specifically use Long Short Term Memory Networks (LSTM)lstm
, a type of a Recurrent Neural Networks. LSTMs bring a solution to the vanishing gradients problem by introducing the concept of amemory unit. LSTMs have a gate mechanism that decides when to forget and when to remember hidden states for future time steps. With this feature, LSTMs are able to train models on long term dependencies.
LRCN introduced by Donahue et al. lrcn proposes a model that incorporates a deep hierarchical visual feature extractor and a model that can learn long-term temporal dynamics. In this model, a deep hierarchical visual feature extractor, specifically, a Convolutional Neural Network (CNN), takes each visual input
and extracts its features. This feature transformation of CNN activations produces a fixed length vector representation of the visual inputs:which are then passed into a recurrent learning module based on an LSTM. This recurrent model maps inputs to a sequence of hidden states and outputs . In order to predict the distribution at time step t, we obtain a vector of class probabilities across our vocabulary by applying softmax to a linear prediction layer on the outputs of the sequential model.
In the LRCN model for activity recognition, the sequential visual inputs are first individually processed with Convolutional Networks (CNN), convolutional layer activations are then fed into a single-layer LSTM with 256 hidden units that learns the framesﬂ temporal dynamics. Predictions based on averaging scores across all frames from a fixed vocabulary of actions are made. In their study, Donahue et al. lrcn train two separate LRCN networks on RGB frames and RGB representations of the flow information in the same frames. Decisions are made at a final stage where the network predictions are averaged with fixed weights.
4.1 Our Model
We propose a novel architecture that is based on the principles of LRCN, however our arcchitecture supports multimodal learning and jointly learns temporal dynamics on rich representations of visual features and motion cues. We use the two modalities of the video as input: RGB frames and the RGB representation of the optical flow information in the same frames. We extract higher level features of the individual frames using CNNs that we train on separate tasks. After we convolve the two streams of input pairs, we concatenate the CNN features and use it as an input to our recurrent model that learns temporal dynamics of consequent frames. Our architecture simultaneously classifies the common low-level gestures and the surgical tasks with a multi-task learning approach. Please see Figure 1 for an overview of our model.
We draw our input data of RGB and RGB optical flow representations of the same frames from a joint data layer and convolve both these inputs in parallel convolutional neural network (CNN) streams. We concatenate the activations after the 5th convolutional layers. Before pooling this fusion of representations, we apply another convolutional layer and decrease the dimension of the representations. We then apply a fully convolutional layer and feed its output to the LSTM layer along with the sequence clip markers. In our model, we define two different tasks and losses, one for the surgical task labels which are Knot Tying, Suturing and Needle Passing and another for the 14 different common low-level gesture labels .
The convolutional body of our architecture is similar to the architecture proposed by Zeiler and Fergus ZF and AlexNet krizhevsky by Krizhevsky et al., with convolutional layers, then another one that comes after fusing the activations of these convolutional layers, followed by a fully convolutional layer. We first train two CNNs for a maximum of iterations, one for each task; on RGB and flow, that classifies the visual inputs based on individual frames without the notion of temporal dynamics and recurrent networks. We initialize these networks by transferring weights from a pre-trained model on the 1.2M image ILSVRC-2012 dataset Imagenetsub for a strong initialization and also to prevent overfitting. We then train two LSTM networks, one for each task; on RGB and flow, with the weights transferred from the individual frame models. We train the flow LSTM and RGB LTSM networks for a maximum iterations of
. Finally, we define our joint model, combining the weights from the convolutional and fully connected layers of the two separate LSTM models we have trained, and transferring them to our model. We perform stochastic gradient descent optimization. We set our learning rate at, weight decay at , and our learning policy to dynamically decrease the learning rate by a factor of for every iterations, for all training models. We train the LSTM network with a maximum iteration number of
. Additionally, we set a threshold (15) based clipping gradient for the LSTM models. The weight for both tasks are equal, so they both contribute to the prediction equally. With time and resource concerns, we chose all mentioned hyperparameters based on our experimental observations with a base of earlier works of Zeiler and FergusZF , Krizhevsky krizhevsky and Donahue et al. lrcn . Using grid-search, further optimization of these hyperparameters could be possible.
We train our joint model end-to-end with video clips of length greater than frames. We augment our data by clipping, mirroring, and cropping the frames in order to prevent overfitting. At each time step, our model predicts both the common low-level gesture and surgical task labels across our vocabulary of gestures and tasks. In order to classify a whole segment of video, we average the predictions of each consequent frame for each task. The averaging is done to agree on a single label for the whole video segment. To elaborate, we test our trained model on the multiple clips of
frame length that we extract with a stride offrames from each video. In order to label the whole video segment, we average the predictions of individual frames and then we average across all clips extracted from the video segment.
5 Experiments and Evaluation
We evaluate our novel architecture on JIGSAWS dataset customized to our problem as discussed in the Dataset section. We create 6 splits for both the individual frame and LSTM models. We first create randomized lists of videos for training and testing the LSTM models, and then we extract the frames of these videos and randomize them again for the individual frame models. We train our model on a fixed random set of gesture video segments and use the rest for testing. This results in around gesture frames sampled for training and for testing. The video segments have varying sizes from greater than just frames to , however, most range around to . We have set our clip frame length on the base of the shorter videos, however, there is room for improvement for this selection. During training we resize both RGB and flow frames to and we augment the data by taking crops and mirroring. We use a setting of multiple GHz CPU processors and Geforce GTX with computation capability of .
Our experimental results show that our multi-modal and multi-task approach is superior compared to an architecture that classifies the surgical tasks and the common low-level gestures separately on visual cues and motion cues respectively (Table 2). While the conventional approach reaches a Mean Average Precision (MAP) of only 29.13% , our architecture reaches a MAP of 50.83% for 3 tasks and 14 possible gesture labels, resulting in an improvement of 22% (21.7%). We observed robust improvements of a median of 22.5%, ranging from 18% to 25.1% for all six split experiments.
In training, every iterations take about 2 hours and 10 minutes. For a video of frame size 30, the test time is around 7-8 seconds (excluding the preprocessing time required to extract optical flow, which takes a few seconds for a generic image as recorded by Thomas et al. flow ). Our code does not take full advantage of the GPU and parallelization. For a more efficient study, our code could be modified to fully take advantage of the GPU and parallel computation.
JIGSAWS dataset benchmark studies jigsaws_benchmark ; colin train and test on gestures that only belong to one type of surgical task. They train on a set of gestures that occur only during a specific surgical task (e.g. Suturing) and also test on this same task. Their experimentation is limited as it trains models only on the gestures that take place in a specific task and not the whole gesture vocabulary of gestures. Their Suturing experiments are done on gesture labels, while Knot Tying and Needle Passing are done on only gestures. Our approach differs greatly; we focus on recognizing a specific gesture whether it takes place in Suturing or Knot Tying. Therefore, our experiments are done on the whole dataset, across multiple surgical tasks and recognizes all common low-level gesture labels in our gesture vocabulary.
We propose a novel architecture as a solution to simultaneously classify common low-level gestures and surgical tasks in Robot-assisted Surgery (RAS) video segments. Our architecture simultaneously classifies the gestures and surgical tasks with a multi-task learning approach by making use of joint relationships, combining shared and task-specific representations. Our architecture is multimodal and uses two modalities of the video as input: RGB frames and the RGB representation of the optical flow information in the same frames referring to motion cues. After we convolve the two streams of input pairs, we concatenate the CNN features that we extract and use this fusion of higher level representations as an input to our recurrent model that learns temporal dynamics of the gestures and surgical tasks. We focus on recognizing common low-level gestures even when they occur across different tasks. In this manner, it differs greatly compared to the benchmark studies on JIGSAWS jigsaws_benchmark ; colin . Our multi-modal and multi-task learning approach is superior compared to an architecture that classifies the tasks and the gestures separately on visual cues and motion cues respectively.
Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent This articles does not contain patient data.
- (1) Tao L, Elhamifar E, Khudanpur S, Hager GD, Vidal R (2012) Sparse hidden markov models for surgical gesture classification and skill evaluation. In: Proc. International Conference on Information Processing in Computer-Assisted Interventions (IPCAI) 167-177.
- (2) Sarikaya D, Corso JJ, Guru KA (2017) Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Transactions on Medical Imaging 36(7):1542-1549
- (3) Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Bejar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) The JHU-ISI gesture and skill assessment working set (JIGSAWS): a surgical activity dataset for human motion modeling. In: Proc. Modeling Monitor. Comput. Assist. Interventions (MCAI)
- (4) Lin HC, Shafran I, Murphy TE, Okamura AM, Yuh DD, Hager GD (2005) Automatic detection and segmentation of robot-assisted surgical motions. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 3749:802-810
- (5) Lin HC, Shafran I, Yuh D, Hager GD (2006) Towards automatic skill evaluation: detection and segmentation of robot-assisted surgical motions. Computer Aided Surgery 11:220-230
- (6) Leong JJH, Nicolaou M, Atallah L, Mylonas GP, Darzi AW, Yang GZ (2006) HMM assessment of quality of movement trajectory in laparoscopic surgery. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 4190:752-759
- (7) Yang GZ , Varadarajan B, Reiley C, Lin H, Khudanpur S, Hager G (2009) Data-derived models for segmentation with application to surgical assessment and training. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 5761:426-434
- (8) Balakrishnan Varadarajan (2011) Learning and inference algorithms for dynamical system models of dextrous motion. PhD thesis Johns Hopkins University
- (9) Cadéne R, Robert T, Thome N, Cord M (2016) M2CAI workflow challenge: convolutional neural networks with time smoothing and hidden markov model for video frames classification. Computing Research Repository (CoRR) abs:1610.05541
- (10) DiPietro R, Lea C, Malpani A, Ahmidi N, Vedula SS, Lee GI, Lee MR, Hager HG (2016) Recognizing surgical activities with recurrent neural networks. In: Proc. Medical Image Computing and Computer-Assisted Intervention (MICCAI) 551-558
- (11) Twinanda AP, Mutter D, Marescaux J, Mathelin M, Padoy N (2016) Single and multi-task architectures for surgical workflow challenge. In: Proc. Workshop and Challenges on Modeling and Monitoring of Computer Assisted Interventions (M2CAI) at Medical Image Computing and Computer-Assisted Intervention (MICCAI)
- (12) Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. Transaction of Biomedical Engineering
- (13) Meißner C, Meixensberger J, Pretschner A, Neumuth T (2014) Sensor-based surgical activity recognition in unconstrained environments. Minimally Invasive Therapy and Allied Technologies 23(4)
- (14) Bardram JE, Doryab A, Jensen RM, Lange PM, Nielsen KLG, Petersen ST (2011) Phase recognition during surgical procedures using embedded and body-worn Sensors. In: Proc. IEEE International Conference on Pervasive Computing and Communications (PerCom) 45-53
- (15) Bouarfa L, Jonker PP, Dankelman J (2011) Discovery of high-level tasks in the operating room. Journal of Biomedical Informatics 44(3)455:462
- (16) Simonyan K, Zisserman A (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. In: Proc. Neural Information Processing Systems (NIPS) 568-576
- (17) Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(4):677-691
Graves A, Jaitly N (2014) Towards End-To-End Speech Recognition with Recurrent Neural Networks. In: Proc. International Conference on Machine Learning 1764-1772
- (19) Graves A (2013) Generating Sequences With Recurrent Neural Networks. Computing Research Repository (CoRR) abs:1308.0850
- (20) Kokkinos I (2016) UberNet: Training a ’Universal’ Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory. Computing Research Repository (CoRR), abs:1609.02132
- (21) Long M, Wang J (2015) Learning Multiple Tasks with Deep Relationship Networks. Computing Research Repository (CoRR) abs:1506.02117
- (23) Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch Networks for Multi-Task Learning. In: Proc. Computer Vision and Pattern Recognition (CVPR)
- (24) Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735-1780
- (25) Thomas B, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: Proc. Eur.Conf. Comput. Vis. (ECCV) 25-36
- (26) Zeiler MD and Fergus R (2014) Visualizing and understanding convolutional networks. In: Proc. Eur. Conf. Comput. Vis. (ECCV) 818-833
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proc. Adv. Neural Information Processing Systems (NIPS) 1-2
- (28) Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet Large Scale Visual Recognition Challenge. In: Proc. International Journal of Computer Vision (IJCV) 115(3)