Surgical gestures  are the basic elements of every surgical process. Recognizing which surgical gesture is being performed is crucial for understanding the current surgical situation and for providing meaningful computer assistance to the surgeon. Automatic surgical gesture recognition also offers new possibilities for surgical training. For example, it may enable a computer-assisted surgical training system to observe whether gestures are performed in the correct order or to identify with which gestures a trainee struggles the most.
Especially appealing is the exploitation of ubiquitous video feeds for surgical gesture recognition, such as the feed of the laparoscopic camera, which displays the surgical field in conventional and robot-assisted minimally invasive surgery.
The problem of video-based surgical gesture recognition is formalized as follows:
A video of length is a sequence of video frames . The problem is to predict the gesture performed at time for each , where
is the set of surgical gestures. Variations of surgical gesture recognition differ in the amount of information that is available to obtain an estimateof the current gesture, e.g., only the current video frame, i.e., (frame-wise recognition), only frames up until the current timestep, i.e., (on-line recognition), or the complete video, i.e., (off-line recognition).
The main challenge in video-based surgical gesture recognition is the high dimensionality, high level of redundancy, and high complexity of video data. State-of-the-art methods tackle the problem by transforming video frames into feature representations, which are fed into temporal models that infer the sequence of gestures based on the input sequence. These temporal models have been continuously improved in the last years, starting with variants of Hidden Markov Models and Conditional Random Fields [9, 13]3], Temporal Convolutional Networks (TCN) 
, and Deep Reinforcement Learning (RL). To obtain feature representations from video frames, early approaches compute bag-of-features histograms from feature descriptors extracted around space-time interest points or dense trajectories . More recently, Convolutional Neural Networks (CNNs
) became a popular tool for visual feature extraction. For example, Lea et al. train a CNN (S-CNN) for frame-wise gesture recognition  and use the latent video frame encodings as feature representations, which are further processed by a TCN for gesture recognition . A TCN combines 1D convolutional filters with pooling and channel-wise normalization layers to hierarchically capture temporal relationships at low-, intermediate-, and high-level time scales. Features extracted from individual video frames cannot represent the dynamics in surgical video, i.e., changes between adjacent frames. To alleviate this problem, Lea et al.  propose adding a number of difference images to the input fed to the S-CNN. For timestep , difference images are calculated within a window of 2 seconds around frame . Also, they suggest to use a spatiotemporal CNN (ST-CNN) , which applies a large temporal 1D convolutional filter to the latent activations obtained by a S-CNN. In contrast, we propose to use a 3D CNN to learn spatiotemporal features from stacks of consecutive video frames, thus modeling the temporal evolution of video frames directly. To the best of our knowledge, we are the first to design a 3D CNN for surgical gesture recognition that predicts gesture labels for consecutive frames of surgical video. An evaluation on the suturing task of the publicly available JIGSAWS  dataset demonstrates the superiority of our approach compared to 2D CNNs that estimate surgical gestures based on spatial features extracted from individual video frames. Averaging the dense predictions of the 3D CNN over time even achieves compelling frame-wise gesture recognition accuracies of over 84%. Source code can be accessed at https://gitlab.com/nct_tso_public/surgical_gesture_recognition.
In the following, we detail the architecture and training procedure of the proposed 3D CNN for video-based surgical gesture recognition.
2.1 Network Architecture
Ji et al.  proposed 3D CNNs as a natural extension of well-known (2D) CNNs. While 2D CNNs apply 2D convolutions and 2D pooling kernels to extract features along the spatial dimensions of a video frame , 3D CNNs apply 3D convolutions and 3D pooling kernels to extract features along the spatial and temporal dimensions of a stack of video frames . Recently, Carreira et al.  suggested to create 3D CNN architectures by inflating etablished deep 2D CNN architectures along the temporal dimension. This basically means that all kernels are expanded into their cubic counterparts.
The proposed 3D CNN for surgical gesture recognition is based on 3D ResNet-18 , which is created by inflating an 18-layer residual network . Input to the network are stacks of 16 consecutive video frames (as proposed in ) with a resolution of pixels. More precisely, to obtain an estimate of the gesture being performed at time , we feed the video snippet to the network. Because we process the video at 5 fps, the network can refer to the previous three seconds of video in order to infer . At this point, we abstain from feeding future video frames to the network so that the method is applicable for online gesture recognition.
The original 3D ResNet-18 architecture is designed to predict one distinct action label per video snippet using a one-hot encoding. In contrast, surgical gesture recognition is a dense labeling problem, where each frameof a video snippet has a distinct label . This means that one video snippet may contain frames that belong to different gestures. To account for this, we adapt our network to output dense gesture label estimates . Here, denotes the number of distinct surgical gestures. The component of is the estimate for gesture label , obtained at time .
Specifically, we adapt the max pooling layer of 3D ResNet-18 so that downsampling is only performed along the spatial dimensions. Thus, the feature maps after the final average pooling layer have a dimension of. This is upsampled to the output dimension using a transposed 1D convolution (
) with kernel size 11 and stride 5.
An overview of the network architecture is given in table 1. The input is downsampled in the initial convolutional and max pooling layers and then passed through a number of residual blocks. When convolutions are applied with stride 2 to downsample feature maps, the number of feature maps is doubled. For details on residual blocks, please see the original papers [4, 5]
|layer type||output size|
2.2 Network Training
We train our 3D CNN on video snippets to predict the corresponding ground truth gesture labels . Therefore, we minimize the loss , where denotes the cross entropy loss. We found it to be beneficial to penalize the errors made on more current predictions harder and therefore train with weighting factors .
Because of their large number of parameters, 3D CNNs are difficult to train, especially on small datasets .
Thus, it is important to begin training from a suitable initialization of network parameters. We investigate two approaches for network initialization:
Initializing the network with parameters obtained by training on Kinetics , one of the largest human action datasets available so far. For this, a publicly available pretrained 3D ResNet-18 model111https://github.com/kenshohara/3D-ResNets-PyTorch#pre-trained-models  is used.
Bootstrapping network parameters from an ImageNet-pretrained 2D ResNet-18 model that was further trained on individual video frames to perform frame-wise gesture recognition.
As described in from the training videos.
Per epoch, we sample about 3000 snippets in a class-balanced manner, which means that we ensure that each gesture
Bootstrapping network parameters from an ImageNet-pretrained 2D ResNet-18 model that was further trained on individual video frames to perform frame-wise gesture recognition. As described in, the 3D filters of the 3D ResNet-18 are initialized by repeating the weights of the corresponding 2D filters times along the temporal dimension and then dividing them by . During training, we sample video snippets at random temporal positions
from the training videos. Per epoch, we sample about 3000 snippets in a class-balanced manner, which means that we ensure that each gestureis represented equally in the set of sampled snippets. For data augmentation, we use scale jittering and corner cropping as proposed in . Here, all frames within one training snippet are augmented in the same manner. We train the 3D CNN for 250 epochs using the Adam  optimizer with a batch size of 32 and an initial learning rate of . The learning rate is divided by factor 5 every 50 epochs. Our 3D CNN implementation is based on code222https://github.com/kenshohara/3D-ResNets-PyTorch provided by .
We evaluate our approach on 39 videos of robot-assisted suturing tasks performed on a bench-top model, which are taken from the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) . The recorded tasks were performed by eight participants with varying surgical experience. The videos were annotated with surgical gestures such as positioning the tip of the needle or pushing needle through the tissue. In total, different gestures are used. We follow the leave-one-user-out (LOUO) setup for cross-validation as defined in . Thus, for each experiment, we train one model per left-out user.
We report the following evaluation metrics:Frame-wise accuracy, i.e., the ratio of correctly predicted gesture labels in a video. Average score, where we calculate the Edit score, as proposed in , which employs the Levenshtein distance to assess the quality of predicted gesture segments. Segmental score with threshold 10% (), as proposed in . Here, a predicted gesture segment is considered a true positive if its intersection with the corresponding ground truth segment is over 10%, and the score is calculated regarding the total number of true positives, false positives, and false negatives. For each experiment, evaluation metrics are calculated for every video in the dataset and then averaged. Method Look Acc Avg. Edit ahead Evaluation at 5 fps 2D ResNet-18 0s 79.9 73.3 41.4 55.4 3D CNN (B) 0s 79.9 73.7 64.0 75.2 3D CNN (K) 0s 81.8 75.8 58.7 71.1 3D CNN (B) + window 3s 84.0 78.4 80.7 87.2 3D CNN (K) + window 3s 84.2 78.4 80.0 87.1 Evaluation at 10 fps S-CNN  1s 74.0 – 37.7 – S-CNN + TCN, , causal 1s 76.8 71.5 57.3 69.6 S-CNN + TCN, 3s 76.1 69.9 68.2 77.9 ST-CNN  10s 77.7 – 68.0 – S-CNN + TCN, 22.5s 81.4 77.6 84.9 89.6 S-CNN + TCN + Deep RL  – 81.4 – 88.0 92.0 2D ResNet-18 0s 79.5 73.1 30.6 44.2 3D CNN (B) 0s 79.5 73.6 49.5 62.8 3D CNN (K) 0s 81.3 75.1 46.3 60.1 3D CNN (B) + window 3s 84.0 78.6 80.6 87.0 3D CNN (K) + window 3s 84.3 78.6 80.0 87.0 As baseline experiment, we train a 2D ResNet-18 , i.e., the 2D counterpart to the proposed 3D CNN, for frame-wise gesture recognition. Here, we follow the training procedure described in section 2.2 except for the fact that we train on video snippets of size 1, i.e., individual video frames. The 2D ResNet-18 is initialized with ImageNet-pretrained weights. Additionally, we perform two experiments where we train the proposed 3D CNN for surgical gesture recognition: one where we initialize the 3D CNN with Kinetics-pretrained weights (3D CNN (K)) and one where we bootstrap weights from a pretrained 2D ResNet-18 as described in section 2.2 (3D CNN (B)). To account for the stochastic nature of CNN optimization, we repeat the three experiments four times and report the averaged results. For the 3D CNN (B) experiment, we initialize the models in the experiment repetition by bootstrapping weights from the corresponding 2D ResNet-18 models (with respect to the LOUO splits) that were trained during the repetition of the baseline experiment. We evaluate the trained 3D CNN models either snippet-wise or in combination with a sliding window (+ window). For snippet-wise evaluation, the estimated gesture label at time is simply . With the sliding window approach, we accumulate the dense predictions of the 3D CNN over time. This yields the overall estimate for the gesture at time . To obtain , information of 15 future time steps is used, which corresponds to the next three seconds of video. To make comparisons to prior studies possible, we additionally evaluate the 2D ResNet-18 and the 3D CNN models at 10 fps. This means that we extract video snippets at 10 Hz, instead of 5 Hz, from the video. For the 3D CNNs, the individual snippets still consist of 16 frames sampled at 5 fps. To apply the sliding window approach, we temporally upsample the prediction to , where The experimental results are listed in table 2. For comparison, we state the results of some previous methods that were described in section 1. Further experiments can be found in the supplementary document. S-CNN + TCN refers to the method where spatial features are extracted from video frames using a S-CNN and fed to a TCN that predicts surgical gestures [8, 10]. Here, the results were reproduced using the ED-TCN architecture described in  with 2 layers and temporal filter size . For causal evaluation, filters are applied from to instead of to . We use source code333https://github.com/colincsl/TemporalConvolutionalNetworks provided by the authors of [8, 10]. The reported results are averaged over four LOUO cross-validation runs.
As can be seen in table 2, the proposed variant of 3D ResNet-18 for snippet-wise gesture recognition yields comparable or better frame-wise evaluation results (accuracy and average ) and considerably better segment-based evaluation results (edit score and ) compared to the 2D counterpart. This demonstrates the benefit of modeling several consecutive video frames to capture the temporal evolution of video.
Accumulating the 3D CNN predictions using a sliding window with a duration of three seconds provides a further boost to recognition performance. Not only does the sliding window approach produce better gesture segments, it also improves frame-wise accuracies. Considering future video snippets most likely helps to resolve ambiguities in individual snippets.
Minor differences can be observed between both network initialization variants, Kinetics pretraining (K) and 2D weight bootstrapping (B): while pretraining on Kinetics yields higher frame-wise accuracies, the other approach yields better gesture segments. In combination with the sliding window the differences are marginal.
When testing at 10 fps instead of 5 fps, we observe a notable degradation of the segment-based measures for both the 2D ResNet-18 and the 3D variants. Most likely, the high evaluation frequency enhances noise in the gesture predictions, which is penalized by the edit score and the metric. For the 3D CNNs, this effect can be alleviated by filtering with the sliding window.
Compared to the ST-CNN, the 3D CNN yields considerably better results with regards to all evaluation metrics when being evaluated with the sliding window approach. Apparently, for the given task, modeling spatiotemporal features in video snippets achieves better results than modeling spatial and temporal information separately, as is the case for the ST-CNN.
In combination with the sliding window, the proposed 3D CNN also outperforms the state-of-the-art methods S-CNN + TCN and S-CNN + TCN + Deep RL in terms of accuracy and average . These methods apply very long temporal filters while the proposed approach only processes a few seconds of video to estimate the current gesture. Thus, it is surprising that the quality of gesture segments, as measured by edit score and , is almost equal.
Note that the proposed method operates with a delay of only 3 seconds and can therefore provide information, such as feedback in a surgical training scenario, in a more timely manner than methods with a longer look ahead time.
We present a 3D CNN to predict dense gesture labels for surgical video. The conducted experiments demonstrate the benefits of using an inherently spatiotemporal model to extract features from consecutive video frames. Future work will investigate options for combining spatiotemporal feature extractors with models that capture high-level temporal dependencies, such as LSTMs or TCNs.
The authors thank Colin Lea for sharing code and precomputed S-CNN features to reproduce results from  as well as the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) for granting access to their GPU cluster.
-  Ahmidi, N., Tao, L., Sefati, S., Gao, Y., Lea, C., Haro, B.B., et al.: A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64(9), 2025–2041 (2017)
-  Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the Kinetics dataset. In: CVPR. pp. 4724–4733. IEEE (2017)
-  DiPietro, R., Lea, C., Malpani, A., Ahmidi, N., Vedula, S.S., Lee, G.I., et al.: Recognizing surgical activities with recurrent neural networks. In: MICCAI. pp. 551–558. Springer, Cham (2016)
-  Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV-W. pp. 3154–3160. IEEE (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778. IEEE (2016)
-  Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1), 221–231 (2013)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
-  Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR. pp. 156–165. IEEE (2017)
-  Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: ECCV. pp. 36–52. Springer, Cham (2016)
-  Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: ECCV-W. pp. 47–54. Springer, Cham (2016)
-  Liu, D., Jiang, T.: Deep reinforcement learning for surgical gesture segmentation and classification. In: MICCAI. pp. 247–255. Springer, Cham (2018)
-  Tao, L., Elhamifar, E., Khudanpur, S., Hager, G.D., Vidal, R.: Sparse hidden markov models for surgical gesture classification and skill evaluation. In: IPCAI. pp. 167–177. Springer, Berlin, Heidelberg (2012)
-  Tao, L., Zappella, L., Hager, G.D., Vidal, R.: Surgical gesture segmentation and recognition. In: MICCAI. pp. 339–346. Springer, Berlin, Heidelberg (2013)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. pp. 20–36. Springer, Cham (2016)