miccai2016surgicalactivityrec
Recognizing Surgical Activities with Recurrent Neural Networks
view repo
We apply recurrent neural networks to the task of recognizing surgical activities from robot kinematics. Prior work in this area focuses on recognizing short, lowlevel activities, or gestures, and has been based on variants of hidden Markov models and conditional random fields. In contrast, we work on recognizing both gestures and longer, higherlevel activites, or maneuvers, and we model the mapping from kinematics to gestures/maneuvers with recurrent neural networks. To our knowledge, we are the first to apply recurrent neural networks to this task. Using a single model and a single set of hyperparameters, we match stateoftheart performance for gesture recognition and advance stateoftheart performance for maneuver recognition, in terms of both accuracy and edit distance. Code is available at https://github.com/rdipietro/miccai2016surgicalactivityrec .
READ FULL TEXT VIEW PDF
We propose a novel multimodal and multitask architecture for simultane...
read it
Surgical gesture recognition is important for surgical data science and
...
read it
Recognition of surgical gesture is crucial for surgical skill assessment...
read it
Gestures are a natural communication modality for humans. The ability to...
read it
Many machine learning problems such as speech recognition, gesture
recog...
read it
Deep neural networks (DNNs) are so overparametrized that recent researc...
read it
We consider the automation of the wellknown pegtransfer task from the
...
read it
Recognizing Surgical Activities with Recurrent Neural Networks
Automated surgicalactivity recognition is a valuable precursor for higherlevel goals such as objective surgicalskill assessment and for providing targeted feedback to trainees. Previous research on automated surgicalactivity recognition has focused on gestures within a surgical task [10], [15], [9], [13]. Gestures are atomic segments of activity that typically last for a few seconds, such as grasping a needle. In contrast, maneuvers are composed of a sequence of gestures and represent higherlevel segments of activity, such as tying a knot. We believe that targeted feedback for maneuvers is meaningful and consistent with the subjective feedback that faculty surgeons currently provide to trainees.
Here we focus on jointly segmenting and classifying surgical activities. Other work in this area has focused on variants of hidden Markov models (HMMs) and conditional random fields (CRFs)
[10], [15], [9], [13]. HMM and CRF based methods often define unary (labelinput) and pairwise (labellabel) energy terms, and during inference find a global label configuration that minimizes overall energy. Here we put emphasis on the unary terms and note that defining unaries that are both general and meaningful is a difficult task. For example, of the works above, the unaries of [10] are perhaps most general: they are computed using learned convolutional filters. However, we note that even these unaries depend only on inputs from fairly local neighborhoods in time.In this work, we use recurrent neural networks (RNNs), and in particular long shortterm memory (LSTM), to map kinematics to labels. Rather than operating only on local neighborhoods in time, LSTM maintains a memory cell and
learns when to write to memory, when to reset memory, and when to read from memory, forming unaries that in principle depend on all inputs. In fact, we will rely only on these unary terms, or in other words assume that labels are independent given the sequence of kinematics. Despite this, we will see that predicted labels are smooth over time with no postprocessing. Further, using a single model and a single set of hyperparameters, we match stateoftheart performance for gesture recognition and improve over stateoftheart performance for maneuver recognition, in terms of both accuracy and edit distance.The goal of this work is to use kinematic signals over time to label every time step with one of surgical activities. An individual sequence of length is composed of kinematic inputs , with each
, and a collection of onehot encoded activity labels
, with each . (For example, if we have classes 1, 2, and 3, then the onehot encoding of label 2 is .) We aim to learn a mapping from to in a supervised fashion that generalizes to users that were absent from the training set. In this work, we use recurrent neural networks to discriminatively model for all when operating online and for all when operating offline.Though not yet as ubiquitous as their feedforward counterparts, RNNs have been applied successfully to many diverse sequencemodeling tasks, from texttohandwriting generation [6] to machine translation [14].
A generic RNN is shown in Figure 1(a). An RNN maintains a hidden state , and at each time step , the nonlinear block uses the previous hidden state and the current input to produce a new hidden state and an output .
If we use the nonlinear block shown in Figure 1(b), we end up with a specific and simple model: a vanilla RNN with one hidden layer. The recursive equation for a vanilla RNN, which can be read off precisely from Figure 1(b), is
(1) 
Here, , , and are free parameters that are shared over time. For the vanilla RNN, we have . The height of is a hyperparameter and is referred to as the number of hidden units.
In the case of multiclass classification, we use a linear layer to transform to appropriate size
and apply a softmax to obtain a vector of class probabilities:
(2)  
(3) 
where .
RNNs traditionally propagate information forward in time, forming predictions using only past and present inputs. Bidirectional RNNs [12] can improve performance when operating offline by using future inputs as well. This essentially consists of running one RNN in the forward direction and one RNN in the backward direction, concatenating hidden states, and computing outputs jointly.
Vanilla RNNs are very difficult to train because of what is known as the vanishing gradient problem [1]. LSTM [8] was specifically designed to overcome this problem and has since become one of the most widelyused RNN architectures. The recursive equations for the LSTM block used in this work are
(4)  
(5)  
(6)  
(7)  
(8)  
(9) 
where represents elementwise multiplication and . All matrices and all biases are free parameters that are shared across time.
LSTM maintains a memory over time and learns when to write to memory, when to reset memory, and when to read from memory [5]. In the context of the generic RNN, , and is the concatenation of and . is the memory cell and is updated at each time step to be a linear combination of and , with proportions governed by the input gate and the forget gate . , the output, is a nonlinear version of that is filtered by the output gate . Note that all elements of the gates , , and lie between 0 and 1.
This version of LSTM, unlike the original, has forget gates and peephole connections, which let the input, forget, and output gates depend on the memory cell. Forget gates are a standard part of modern LSTM [7], and we include peephole connections because they have been found to improve performance when precise timing is required [4]. All weight matrices are full except the peephole matrices , , and , which by convention are restricted to be diagonal.
Because we assume every is independent of all other given , maximizing the log likelihood of our data is equivalent to minimizing the overall cross entropy between the true labels and the predicted labels . The global loss for an individual sequence is therefore
All experiments in this paper use standard stochastic gradient descent to minimize loss. Although the loss is nonconvex, it has repeatedly been observed empirically that ending up in a poor local optimum is unlikely. Gradients can be obtained efficiently using backpropagation
[11]. In practice, one can build a computation graph out of fundamental operations, each with known local gradients, and then apply the chain rule to compute overall gradients with respect to all free parameters. Frameworks such as Theano and Google TensorFlow let the user specify these computation graphs symbolically and alleviate the user from computing overall gradients manually.
Once gradients are obtained for a particular free parameter , we take a small step in the direction opposite to that of the gradient: with being the learning rate,
The JHUISI Gesture and Skill Assessment Working Set (JIGSAWS) [2] is a public benchmark surgical activity dataset recorded using the da Vinci. JIGSAWS contains synchronized video and kinematic data from a standard 4throw suturing task performed by eight subjects with varying skill levels. All subjects performed about 5 trials, resulting in a total of 39 trials. We use the same measurements and activity labels as the current stateoftheart method [10]. Measurements are position (, , ), velocity (, , ), and gripper angle () for each of the left and right slave manipulators, and the surgical activity at each time step is one of ten different gestures.
The Minimally Invasive Surgical Training and Innovation Center  Science of Learning (MISTICSL) dataset, also recorded using the da Vinci, includes 49 righthanded trials performed by 15 surgeons with varying skill levels. We follow [3] and use a subset of 39 righthanded trials for all experiments. All trials consist of a suture throw followed by a surgeon’s knot, eight more suture throws, and another surgeon’s knot. We used the same kinematic measurements as for JIGSAWS, and the surgical activity at each time step is one of 4 maneuvers: suture throw (ST), knot tying (KT), grasp pull run suture (GPRS), and intermaneuver segment (IMS). It is not possible for us to release this dataset at this time, though we hope we will be able to release it in the future.
JIGSAWS has a standardized leaveoneuserout evaluation setup: for the th run, train using all users except and test on user . All results in this paper are averaged over the 8 runs, one per user. We follow the same strategy for MISTICSL, averaging over 11 runs, one for each user that does not appear in the validation set, as explained below.
We include accuracy and edit distance (Levenshtein distance) as performance metrics. Accuracy is the percentage of correctlyclassified frames, measuring performance without taking temporal consistency into account. In contrast, edit distance is the number of operations needed to transform predicted segmentlevel labels into groundtruth segmentlevel labels, here normalized for each dataset using the maximum number (over all sequences) of segmentlevel labels.
Here we include the most relevant details regarding hyperparameter selection and training; other details are fully specified in code, available at
https://github.com/rdipietro/miccai2016surgicalactivityrec.
For each run we train for a total of approximately 80 epochs, maintaining a learning rate of 1.0 for the first 40 epochs and then halving the learning rate every 5 epochs for the rest of training. Using a small batch size is important; we found that otherwise the lack of stochasticity let us converge to bad local optima. We use a batch size of 5 sequences for all experiments.
Because JIGSAWS has a fixed leaveoneuserout test setup, with all users appearing in the test set exactly once, it is not possible to use JIGSAWS for hyperparameter selection without inadvertently training on the test set. We therefore choose all hyperparameters using a small MISTICSL validation set consisting of 4 users (those with only one trial each), and we use the resulting hyperparameters for both JIGSAWS experiments and MISTICSL experiments. We performed a grid search over the number of RNN hidden layers (1 or 2), the number of hidden units per layer (64, 128, 256, 512, or 1024), and whether dropout [16] is used (with ). 1 hidden layer of 1024 units, with dropout, resulted in the lowest edit distance and simultaneously yielded high accuracy. These hyperparameters were used for all experiments.
Using a modern GPU, training takes about 1 hour for any particular JIGSAWS run and about 10 hours for any particular MISTICSL run (MISTICSL sequences are approximately 10x longer than JIGSAWS sequences). We note, however, that RNN inference is fast, with a running time that scales linearly with sequence length. At test time, it took the bidirectional RNN approximately 1 second of compute time per minute of sequence (300 time steps).
JIGSAWS  MISTICSL  

Accuracy (%)  Edit Dist. (%)  Accuracy (%)  Edit Dist. (%)  
MsMCRF [15]  72.6  —  —  — 
SDSDL [13]  78.7  —  —  — 
SCCRF [9]  80.3  —  —  — 
LCSCCRF [10]  82.5 5.4  14.8 9.4  81.7 6.2  29.7 6.8 
Forward LSTM  80.5 6.2  19.8 8.7  87.8 3.7  33.9 13.3 
Bidir. LSTM  83.3 5.7  14.6 9.6  89.5 4.0  19.5 5.2 
Table 1 shows results for both JIGSAWS (gesture recognition) and MISTICSL (maneuver recognition). A forward LSTM and a bidirectional LSTM are compared to the Markov/semiMarkov conditional random field (MsMCRF), Shared Discriminative Sparse Dictionary Learning (SDSDL), SkipChain CRF (SCCRF), and LatentConvolutional SkipChain CRF (LCSCCRF). We note that the LCSCCRF results were computed by the original author, using the same MISTICSL validation set for hyperparameter selection.
We include standard deviations where possible, though we note that they largely describe the usertouser variations in the datasets. (Some users are exceptionally challenging, regardless of the method.) We also carried out statisticalsignificance testing using a pairedsample permutation test (
value of 0.05). This test suggests that the accuracy and editdistance differences between the bidirectional LSTM and LCSCCRF are insignificant in the case of JIGSAWS but are significant in the case of MISTICSL. We also remark that even the forward LSTM is competitive here, despite being the only algorithm that can run online.Qualitative results are shown in Figure 3 for the trials with highest, median, and lowest accuracies for each dataset. We note that the predicted label sequences are smooth, despite the fact that we assumed that labels are independent given the sequence of kinematics.


In this work we performed joint segmentation and classification of surgical activities from robot kinematics. Unlike prior work, we focused on highlevel maneuver prediction in addition to lowlevel gesture prediction, and we modeled the mapping from inputs to labels with recurrent neural networks instead of with HMM or CRF based methods. Using a single model and a single set of hyperparameters, we matched stateoftheart performance for JIGSAWS (gesture recognition) and advanced stateoftheart performance for MISTICSL (maneuver recognition), in the latter case increasing accuracy from 81.7% to 89.5% and decreasing normalized edit distance from 29.7% to 19.5%.
Lea, C., Hager, G.D., Vidal, R.: An improved model for segmentation and recognition of finegrained activities with application to surgical training tasks. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1123–1129. IEEE (2015)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 2673–2681 (1997)
Comments
There are no comments yet.