Robotic surgery has made tremendous progress in recent years within the medical field. Compared with traditional practice, surgical robots nowadays enable surgeons to remotely control thin rods with intricate instruments and video cameras. This has been applied to many types of surgery including tumor resection, microsurgical blood vessel reconstruction, and organ transplantation. Because of the smaller incisions and decreased blood loss during the operation, patients generally experience less trauma and recover faster. The precision, stability, and flexibility of the robotic devices also allow surgeons to perform complex procedures that would otherwise have been difficult or impossible.
The digitization of surgical vision and control has made this field ripe for the next AI revolution. While there are many algorithmic and data challenges, the hardware is ready. In fact, there is a push for rapid technological advancements: new robotic systems are competing to make surgeons see the targeted anatomy better and perform procedures that are ever more delicate. At least nine surgical devices and platforms have been approved by the FDA and more are under review (Peters et al., 2018). Accompanying this trend is the development of simulation platforms for training surgeons with similar console input, mimic exercises in virtual reality, and scoring systems for grading skills.
A clear aspiration in the field is to build autonomous surgical robots. This however will remain out of reach for some time due to the complexity of instrument-environment interaction and soft tissue dynamics. Most AI efforts in surgery have so far been incremental, aimed at building up our understanding of the data recorded from the robotic surgery platforms (Zhou et al., 2019). A growing number of studies have shown that machine learning can help evaluate surgeon skills or delineate task structures with video and motion data. These studies typically focus on a benchmark dataset, JIGSAWS (Gao et al., 2014), with three representative surgical tasks performed by eight surgeons binned into three skill levels: novice, intermediate, and expert. The fact that these surgeons had wide practice gaps (from less than 10 hours of robotic operation time for novices to over 100 hours for experts) made the classification task well defined.
In this study, we focus on a new dataset created by surgical fellows at UIC performing training exercises with similar levels of high proficiency. We explore the possibility of surgeon and task prediction in this challenging setting without video data. We demonstrate that, with a simple encoding of the motion sequences, we could predict surgeon with accuracy and exercise with accuracy using kinematic data alone. Our efficient, binary encoding of motion characteristics outperforms models on raw features and enables us to convert the models to spiking neural networks with sparse input and no performance loss.
2. Background and Related Work
While AI has revolutionized many fields, its impact in surgery has so far been multifaceted but incremental (Hashimoto et al., 2018). Medical images are processed with deep learning to help pre-operative planning and intra-operative guidance (Zhou et al., 2019). Surgical videos are analyzed to assess human proficiency (Funke et al., 2019), recognize gestures (Gao et al., 2020; Huang et al., 2018; Sarikaya and Jannin, 2019) as well as segment surgical instruments (Ni et al., 2019; Qin et al., 2020) and human anatomy (Allan et al., 2020; Hashimoto et al., 2018). Likewise, kinematic logs recorded from clinical surgical devices have been used in skill assessment (Ismail Fawaz et al., 2018; Nguyen et al., 2019; Zhou et al., 2015) and gesture recognition (Krishnan et al., 2018; Sefati et al., 2015; Sarikaya and Jannin, 2020) via convolutional, recurrent, graph and neural networks. Many of these studies focused on the aforementioned JIGSAWS dataset and achieved high classification accuracy.
There has been much effort to convert commonly used classical networks into spiking neural networks (SNN) (Esser et al., 2016; Diehl et al., 2016), in favor of its energy efficiency and biological plausibility. Recent successes in application of SNN to event-driven data (Blouw et al., 2018; Stromatias et al., 2017) reveal potential in handling difference-encoded time-series data in surgery. We hypothesize that it will take advances in novel neuromorhpic hardware architectures and SNN design to enable breakthroughs in AI advanced robotic surgery.
Surgical fellows at UIC have endeavoured to create a dataset of paired video and kinematic sequences recorded from surgical simulation exercises. These procedures were performed on MIMIC’s FlexVR portable 3D standalone surgical system. In this study, we focus on a subset 120 exercises performed by 4 surgical fellows on 4 representative tasks training hand-eye coordination, ambidexterity, and fine motor control. For each exercise, there are on average 4000 timesteps of kinematic data describing instrument positions, orientations, and gripper angles that correspond to 0.5-6 minute long videos. Data was stratified by both surgeon and exercise, maintain a fair class-balance and avoid any data-leak.
3.2. Sparse encoding of kinematic sequence
Using the MIMIC simulator dataset described in Section 3.1, deep learning models were trained to predict two ground truth labels extracted from the kinematic logs. The logs begin in the format of 30 positional recordings per second describing the location and configuration of the camera and apertures. This is converted from positional data at each timestep to the difference or movement between timesteps. While the accuracy of the models trained on these movements was good, we ultimately convert these movements to binary events. All movements above a small threshold are coded as an event for that feature. While this results in an overall loss of information, the motivation is twofold; (1) the capture rate of events is sufficient that the micro movements between subsequent timesteps may be primarily operator or sensor noise and (2) encoding the data stream as binary spikes dramatically increases the sparsity of the data, potentially greatly reducing any memory or computation footprints, and directly enabling an efficient spike-time encoding for use with neuromorphic hardware.
3.3. LSTM and CNN models
We trained a long short-term memory (LSTM) recurrent neural network (RNN) consisting of two bidirectional LSTM layers followed by two dense layers and final dense softmax output. For comparison, we developed convolutional neural network (CNN) with a single one dimensional convolutional layer followed by two dense layers and softmax output. The number of neurons used in each layer are [128, 64, 64, 16] and [128, 128, 16] respectively. Additional convolutional, dense layers, or increased number of neurons/filters did not improve accuracy. Dropout and batch normalization were employed between layers.
3.4. Conversion to spiking neural networks
The conversion from classical neural network to spiking neural network (SNN) is made with Nengo (Bekolay et al., 2014), which is a neural simulator based on the flexible Neural Engineering Framework capable of supporting multiple backends, including neuromorphic hardware such as the Intel Loihi chip (Davies et al., 2018), with minimal change in code. It also has a deep learning extension module NengoDL (Rasmussen, 2018)
, which is a bridge module between Keras and Nengo. Similar to other classical to spiking model conversion software like SNN Toolbox(Rueckauer et al., 2017), Nengo-DL has a converter for a deep neural network model in Keras to a native Nengo neuron model by replacing Keras layers with equivalent implementations wired with Nengo neurons.
We used the builtin converter in NengoDL to convert the aforementioned deep neural networks to spiking neural network models. The converter uses a differentiable approximation of the non-differentiable spiking functions in neurons during the training phase and replaces it with the original function at inference (Hunsberger and Eliasmith, 2016). The dense models were converted to native Nengo models without modification. The data of the convolution model was flattened as Nengo nodes only accept one-dimensional inputs. Due to the lack of recurrent support in NengoDL, the LSTM layer cannot be directly converted. As a workaround, a hybrid SNN model was created with the LSTM layer executed in Keras and the rest in Nengo.
4.1. Task and Surgeon Prediction
As a baseline, the best performing traditional learning algorithm (LGBM) does not outperform any deep learning methods on the event encoded data. However, the baseline model does outperform all but the LSTM on the raw kinematic motion data. For the deep learning approaches, accuracy was improved with the lossy conversion to binary event sequences. This supports our hypothesis that neuromorhpic, particularly recurrent approaches will offer an advantage for real-time event based surgical data.
From the confusion matrix shown in Figure 2 it is clear that most of the misclassifications are between the Ring & Rail and Thread the Rings tasks. These tasks frequently require rotating of the wrists, direct hand offs of objects, and manually moving the camera. The Pick and Place task does not require any of these key movements. Note that while the data was stratified by surgeon and exercise the total number of sequences varies slightly due to completion time.
4.2. Visualization of kinematic actions
Following the input encoding and model training in the previous sections, visualizations of the latent-layer kinematic actions are generated as follows:
The 20 dimensional, 40 step time-series kinematic sequences are propagated through the network until the final 16 neuron hidden Dense layer. This is the layer immediately preceding the output of the model.
These 2D coordinates are plotted and colored according to the ground truth label used to train the model.
We may observe in 2D the separability of the classes, as well as clusters of similar actions across 2 or more classes. This is particularly noticeable for the surgeon prediction visualization shown in figure 5 which has a noticeable cluster for each permutation of the surgical fellows. Also of note is the difference in spread of the surgeon’s movements. Fellow A’s actions are fairly tightly distributed, while Fellow C and D’s movements are spread farther apart despite having completed the same surgical tasks. The fact that our model appears to capture motion signatures of different surgical fellows of similar proficiency is promising. This indicates neuromorphic chips deployed on edge could be potentially useful in providing personalized assistance in surgery and training.
4.3. Neuromorphic approach
We compared the performance of the base and converted SNN models with the binary-encoded dataset. As shown in Table 1 and 2, the best-performing model is the LSTM, which achieves 79% test accuracy for surgery task prediction and 83% for surgeon prediction. Meanwhile, the convolution neural network model and fully connected model have roughly 10% and 7% accuracy drain respectively. Among the converted models, both the hybrid LSTM model and the other native Nengo models have roughly achieved the same accuracy as their DNN counterparts, suggesting no noticeable loss of performance in the spiking neural network conversion.
4.4. Feature Importance
To identify the kinematic features that distinguish different surgeons, we examined the contribution of each feature to prediction accuracy. A traditional method is to query machine learning models directly for feature importance. However, this method would give the importance of a feature instance at some specific time stamp instead of its influence over the whole time series. To address this issue, we repeatedly removed one feature from the input data and ran the top performing LSTM models with identical configurations. The resulting accuracy changes are shown in Fig. 6 for all single features. The results show that certain features play an essential role in surgeon and task prediction. For example, Pitch and Roll actions on the left arm are top features for both surgeon and task classification. This could be explained by the nature of the exercises: Pick and Place requires almost no rotation, while in Thread the Rings, the surgeons tend to pick, rotate and pierce the needle through the ring with the left arm. There appears to be no standout feature that have an oversized impact, which again suggests that the model has learned representations of motion characteristics from the sequence of actions.
5. Future work
We are excited by recent advancements in neuromorphic hardware and algorithm design and applicability to open challenges in robotic surgery. Given our success with RNNs, on short sequences, we are keen to explore Applied Brain Research’s LMU (Voelker et al., 2019) networks for long-term memory associations in very long sequences. Additionally, we must explore the potential of models and techniques for domain adaptation, leveraging data from programmable simulators for clinical application. Following the success of our event encoding scheme for this novel time-series kinematic data set and conversion to SNN with no accuracy loss, we must investigate the computation speed and other advantages of real-time on-chip processing for surgical tasks using neuromorphic hardware.
Acknowledgements.Part of this work is supported by the Laboratory Directed Research & Development Program at Argonne National Laboratory. We thank the surgical fellows, Dr. Alberto Mangano, Dr. Gabriela Aguiluz, Dr. Roberto Bustos, Dr. Valentina Valle and Dr. Yevhen Pavelko from Dr. Pier Cristoforo Giulianotti’s robotic surgery team, for helping create and continuing to expand this dataset. We would also like to show gratitude to Mimic Technologies Inc. for loaning the FlexVR simulation device. We are also grateful to Nathan Wycoff for discussion and valuable comments on the manuscript.
- 2018 robotic scene segmentation challenge. External Links: Cited by: §2.
- Nengo: a Python tool for building large-scale functional brain models. Frontiers in Neuroinformatics 7 (48), pp. 1–13. External Links: Cited by: §3.4.
- Benchmarking keyword spotting efficiency on neuromorphic hardware. External Links: Cited by: §2.
- Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (1), pp. 82–99. Cited by: §3.4.
- Conversion of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware. External Links: Cited by: §2.
- Convolutional networks for fast, energy-efficient neuromorphic computing. Note: PNAS 113 (2016) 11441-11446 External Links: Cited by: §2.
- Video-based surgical skill assessment using 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14 (7), pp. 1217–1225. External Links: Cited by: §2.
Automatic gesture recognition in robot-assisted surgery with reinforcement learning and tree search. External Links: Cited by: §2.
- Jhu-isi gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In MICCAI Workshop: M2CAI, Vol. 3, pp. 3. Cited by: §1.
- Artificial intelligence in surgery: promises and perils. Annals of Surgery 268, pp. 1. External Links: Cited by: §2.
- Neural task graphs: generalizing to unseen tasks from a single video demonstration. External Links: Cited by: §2.
- Training spiking deep networks for neuromorphic hardware. External Links: Cited by: §3.4.
- Evaluating surgical skills from kinematic data using convolutional neural networks. In International Conference On Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 214–221. Cited by: §2.
- LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3146–3154. External Links: Cited by: §4.1.
- Transition state clustering: unsupervised surgical trajectory segmentation for robot learning. In Robotics Research: Volume 2, A. Bicchi and W. Burgard (Eds.), pp. 91–110. External Links: Cited by: §2.
- Surgical skill levels: classification and analysis using deep neural network model and motion signals. Computer methods and programs in biomedicine 177, pp. 1–8. Cited by: §2.
- RASNet: segmentation for tracking surgical instruments in surgical videos using refined attention segmentation network. External Links: Cited by: §2.
- Review of emerging surgical robotic technology. Surgical endoscopy 32 (4), pp. 1636–1655. Cited by: §1.
- Towards better surgical instrument segmentation in endoscopic vision: multi-angle feature aggregation and contour supervision. External Links: Cited by: §2.
- NengoDL: combining deep learning and neuromorphic modelling methods. arXiv 1805.11144, pp. 1–22. External Links: Cited by: §3.4.
- Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in Neuroscience 11. External Links: Cited by: §3.4.
- Surgical gesture recognition with optical flow only. External Links: Cited by: §2.
- Towards generalizable surgical activity recognition using spatial temporal graph convolutional networks. External Links: Cited by: §2.
- Learning shared , discriminative dictionaries for surgical gesture segmentation and classification. Cited by: §2.
- An event-driven classifier for spiking neural networks fed with synthetic or dynamic vision sensor data. Frontiers in Neuroscience 11. External Links: Cited by: §2.
- Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Cited by: 2nd item.
- Legendre memory units: continuous-time representation in recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 15544–15553. Cited by: §5.
Learning deep features for discriminative localization. External Links: Cited by: §2.
- Artificial intelligence in surgery. arXiv preprint arXiv:2001.00627. Cited by: §1, §2.