In the field of surgical robotics research, the development of autonomous and semi-autonomous robotic surgical systems is among the most popular emerging topics[moustris2011evolution]. Such systems allow RAS to go beyond teleoperation and assist the surgeons in many ways, including autonomous procedures, user interface (UI) integration, and providing advisory information [chalasani2018computational, dimaio2018interactive]. One prerequisite for these applications is the perception of the current state of the surgical task being performed. These states include the actions performed or the changes in the environment observed by the system. For instance, during suturing, the system needs to know if the needle is visible from the endoscopic view before providing more advanced applications such as advising the needle position or autonomous suturing. Additionally, the recognition of higher-level surgical states, or surgical phases, has a wide range of applications in post-operative analysis and surgical skill evaluation [zia_miccai_2018].
The recognition and segmentation of the robot’s current action is one of the main pillars of the surgical state estimation process. Many models have been developed for the segmentation and recognition of fine-grained surgical actions that last for a few seconds, such as cutting [lea2016learning, dipietro2016recognizing, menegozzo2019surgical, mavroudi2018end], as well as surgical phases that last for up to 10 minutes, such as bladder dissection [yu2018learning, zia2017temporal, yengera2018less]
. The recognition of fine-grained surgical states is particularly challenging due to their short duration and frequent state transitions. Most work in this field has focused on developing models using only one type of input data, such as kinematics or vision. Some studies have focused on learning based on robot kinematics, using models such as Hidden Markov Models[tao2012sparse, rosen2006generalized, volkov2017machine] and Conditional Random Fields (CRF) [tao2013surgical]. Zappella et al. proposed methods of modeling surgical video clips for single-action classification [zappella2013surgical]
. The Transition State Clustering (TSC) and Gaussian Mixture Model methods provide unsupervised or weakly-supervised methods for surgical trajectory segmentation[krishnan2018transition, van2019weakly]. More recently, deep learning methods have come to define the state-of-the-art, such as Temporal Convolutional Networks (TCN) [lea2016temporal]
, Time Delay Neural Network (TDNN)[menegozzo2019surgical]
, and Long-Short Term Memory (LSTM)[dipietro2016recognizing, dipietro2019segmenting]
. Instead of using robot kinematics data, vision-based methods have been developed based on Convolutional Neural Networks (CNN). Vision-based models in RAS use the vision data that is readily available from the endoscopic view. Concatenating spatial features on the temporal axis with spatio-temporal CNNs (ST-CNN) has been explored in[lea2016segmental]. Jin et al. introduced the post-processing of predictions using prior knowledge inference [jin2017sv]. TCN can also be applied to vision data for action segmentation, taking the encoding of a spatial CNN as input [lea2016temporal]. Ding et al. proposed a hybrid TCN-BiLSTM network [ding2017tricornet]. The limitation shared by single-input action recognition models is the large discrepancy among states’ representative vision and kinematics features, making them distinguishable through different types of input data.
Comparing to action recognition datasets such as ActivityNet [caba2015activitynet], RAS data enjoys the luxury of having synchronized vision, system events, and robot kinematics data. The attempts of incorporating multiple types of input data have been focusing on using derived values as additional variables to a single model. Lea et al. measured two scene-based features in JIGSAWS as additional variables to the robot kinematics data in their Latent Convolutional Skip-Chain CRF (LC-SC-CRF) model [lea2016learning]. Zia et al. collected the robot kinematics and system events data from RAS to perform surgical phase recognition [zia2017temporal]. While these attempts have proven to improve the model accuracy, to the best of the authors’ knowledge, there is yet to be a unified method that incorporates multiple data sources directly for fine-grained surgical state estimation.
In addition to robot actions, the finite state machine (FSM) of a surgical task should also include the environmental changes observed by the robot. The non-action states were omitted in popular surgical action segmentation datasets such as JIGSAWS [ahmidi2017dataset] and Cholec80 [twinanda2016endonet]; however are important for applications such as autonomous procedures. They are also challenging to recognize as some non-action states may not be well-reflected in a single-source dataset.
Contributions: In this paper, we propose a unified approach of fine-grained state estimation in RAS using multiple types of input data collected from the da Vinci® surgical system. The input data we use includes the endoscopic video, robot kinematics, and the system events of the surgical system. Our goal is to achieve the real-time fine-grained state estimation of the surgical task being performed. To re-emphasize, we refer to fine-grained states as states that last in the scale of seconds. Our main contributions include:
Implement a unified state estimation model that incorporates vision-, kinematics-, and event-based state estimation results;
Improve the frame-wise state estimation accuracy of state-of-the-art methods by up to 11% through the incorporation of multiple sources of data;
Demonstrate the advantages of a multi-input state estimation model through the comparison of single-input models’ performances in recognizing states with different representative features or levels of granularity in a complex and realistic surgical task.
We evaluated the performance of our model using JIGSAWS and a new RIOUS (robotic intra-operative ultrasound) dataset we developed. RIOUS consists of phantom and porcine experiments on a da Vinci® Xi surgical system (Fig. 1). Comparing to JIGSAWS, which is relatively simple as it only contains dry-lab tasks with no camera motion nor non-action annotations, RIOUS dataset better resembles real-world surgical tasks. This is because RIOUS dataset contains dry-lab, cadaveric and in-vivo experiments111All in-vivo experiments were performed on porcine models under Institutional Animal Care and Use Committee (IACUC) approved protocol., as well as camera movements and annotations of both action and non-action states. We evaluated the accuracy of multiple state estimation models in the recognition of states with different representative features. Each model has its respective strengths and weaknesses, which supports the superior performance of our unified approach of state estimation.
Our proposed model (Fig. 2) consists of four single-source state estimation models based on vision, kinematics, and system events, respectively. The outputs are fed to a fusion model that makes a comprehensive inference. In this section, we discuss each individual model as well as the fusion model which effectively combines the outputs of each model.
2.1 Vision-based Method
The vision-based state estimation model is a CNN-TCN model [lea2016temporal] that takes the endoscopic camera stream as the input in the form of a series of video frames. The CNN architecture we deploy is VGG16 [simonyan2014very]. The spatial CNN component serves as a feature extractor and maps each
RGB image to a vectorwhere is the number of features. is then fed to the TCN component, which is an encoder-decoder network (Fig. 3). At time step , the input vector is denoted by for . For the 1-D convolutional layers (), filters of kernel size are applied along the temporal axis that capture the temporal progress of the input data. is the number of time steps in the
layer. In each layer, the filters are parameterized by a weight tensor
and a bias vector. The raw output activation vector for the layer at time , , is calculated from a subsection of the normalized activation matrix from the previous layer
wherenair2010rectified]. The pooling layer is followed by a normalization layer, which normalizes the activation vector at time t, , using its highest value
where is a small number to ensure non-zero denominators, and is the normalized output activation vector. In the decoder part, an upsampling layer that repeats each data point twice proceeds each temporal convolutional and normalization layers. The output vector is calculated and normalized in the same manner as the encoder part. The state estimation at frame
is done by a time-distributed fully-connected layer with softmax to normalize the logits.
: The training of the CNN feature extractor starts with the VGG16 network initialized with ImageNet pre-trained weights. We fine-tune the weights by training with one fully-connected layer on top of the VGG16 model for state estimation. The feature vector. We use with , and for the JIGSAWS suturing dataset and for the RIOUS dataset. For training, we use the cross entropy loss with Adam optimization algorithm [kingma2014adam].
For our application of real-time state estimation, the model can only use the information from the current and preceding time steps; therefore for the RIOUS dataset, we assume a causal setting and pad the temporal input withzeros on the left side before the convolutional layer and crop data points on the right side afterwards.
2.2 Kinematics-based Methods
We incorporate both forward LSTM and TCN to better capture states with different duration. LSTM has no constraints on learning only from the nearby data on the temporal axis. Rather, it maintains a memory cell and learns when to read/write/reset the memory [gers2000recurrent]. It has been shown that LSTM-based approaches exceed the state-of-the-art performance in longer-duration action recognition [dipietro2016recognizing]. We incorporate both TCN, which applies temporal convolution to learn local temporal dependencies, and LSTM, which is able to capture longer-term data progress. Although the bi-directional LSTM model yields a higher accuracy [dipietro2016recognizing], it is not applicable for the real-time state estimation task where no future data is available; therefore we use a forward LSTM with forget gates and peephole connections [gers2000recurrent]
: For the LSTM model, we perform a grid search over the initial learning rate (0.5 or 1.0), the number of hidden layers (1 or 2), the number of hidden units per layer (256, 512, 1024, or 2048), and the dropout probability (0 or 0.5). The optimized set of parameters is 1 hidden layers with 1024 hidden units and 0.5 dropout probability for JIGSAWS, and 512 hidden units for the RIOUS dataset. The optimized initial learning rate is 1.0. For the TCN model, we mostly follow the same protocol of the vision-based TCN model described earlier. We usewith . The feature vector for the kinematics data , where for the JIGSAWS suturing dataset and for the RIOUS dataset.
2.3 Event-based Method
. We performed grid search over the parameters of each model and evaluated each model’s performance using the Area Under the Receiver Operating Characteristic Curve (ROC AUC) score[bradley1997use]. The evaluation process was iterated 200 times, with an early stopping criterion of score improvement under . At each iteration, we recorded the best-performing model with replacement. The top three models that were selected most frequently are included, and the final state estimation result is the mean of each model’s prediction. The three top-performing models for our RIOUS dataset are RF (=500, minsamplessplit=2), SVM (penalty=, kernel=linear, =2, multiclass=crammersinger), and RF (=400, minsamplessplit=3).
2.4 Fusion of Multiple Models
The individual state estimation models have their respective strengths and weaknesses, since different states have inherent features that make them easier to be recognized by one type of data than the other(s). For instance, the ‘transferring needle from left to right’ state in the JIGSAWS suturing dataset can be distinctly characterized by the sequential opening and closing of the left and right needle drivers which is captured by the kinematics data.
We therefore use a weighted voting method that incorporates the prediction vectors in all models. At time , let , where is the number of models and is the total number of possible states in a dataset. Row vector is the output vector of the model at time and . The overall probability for the system to be in the state at time - according to the models - is then
where is the weighting factor for the model predicting the state.
is calculated from the diagnostic odds ratio (OR) derived from the model’s accuracy in recognizing each state in the training data:
where the ’s components of TP, TN, FP, FN are the number of true positives, true negatives, false positives, and false negatives of the model on recognizing the state, respectively. is a placeholder such that the denominator is not zero. is normalized proportionally such that . The comprehensive estimate of state at time is then made by
3 Experimental Evaluations
We used two datasets to evaluate our models: JIGSAWS and RIOUS datasets (Table I).
JIGSAWS: The JIGSAWS dataset consists of three types of finely-annotated RAS tasks captured by an endoscope [ahmidi2017dataset]. These tasks are performed in a benchtop setting. The dataset contains synchronized video and kinematics data. We used the suturing dataset of JIGSAWS, which has 39 trials recorded at 30Hz, each around 1.5 minutes and contains close to 20 action instances. There are 9 possible actions (Fig. 4a). The kinematics variables we used include the end effector positions, velocities, and gripper angles of the patient-side manipulator (PSM). The raw kinematics data uses the rotation matrix to represent the end-effector’s orientation. To reduce data dimensionality, we converted the rotation matrix (9 variables) to Euler angles (3 variables).
RIOUS: To explore the full potential of our unified model, we collected a robotic intra-operative ultrasound (RIOUS) dataset on a da Vinci® Xi surgical system at Intuitive Surgical Inc. (Sunnyvale, CA), in which we performed ultrasound scanning on both phantom and porcine kidneys. In RAS, using a drop-in ultrasound probe to scan the organs is a common technique practiced by surgeons to localize underlying anatomical structures including tumors and vasculature. The real-time state estimation of this task allows us to develop smart-assist technologies for surgeons as well as enabling supervised autonomous techniques to perform such tasks.
|JIGSAWS Suturing Dataset|
|Action ID||Description||Duration (s)|
|G1||Reaching for the needle with right hand||2.2|
|G2||Positioning the tip of the needle||3.4|
|G3||Pushing needle through the tissue||9.0|
|G4||Transferring needle from left to right||4.5|
|G5||Moving to center with needle in grip||3.0|
|G6||Pulling suture with left hand||4.8|
|G8||Using right hand to help tighten suture||3.1|
|G9||Dropping suture and moving to end points||7.3|
|State ID||Description||Duration (s)|
|S1||Probe released, out of endoscopic view||17.3|
|S2||Probe released, in endoscopic view||10.6|
|S3||Reaching for probe||4.1|
|S5||Lifting probe up||2.2|
|S6||Carrying probe to tissue surface||2.3|
The RIOUS dataset contains 30 trials performed by 5 users with no RAS experience but familiar with the da Vinci® surgical system. Each trial is around 5 minutes and contains roughly 80 action instances. 26 trials are performed on a phantom kidney in dry-lab setting and 4 are performed on a porcine kidney in operating room setting. The data is annotated with eight states (Fig. 4b). Two out of the four arms were used, one holding an endoscope and the other holding a pair of Prograsp™ forceps. The ultrasound machine used is the bk5000 with a robotic drop-in probe from BK Medical Holding Company, Inc. Both video and kinematics entries were synchronized and down-sampled to 30Hz. The kinematics variables we used include the instrument’s end-effector positions, velocities, gripper angles, and the endoscope positions. We used the same pre-processing method as the suturing kinematics data. We also collected six system events data from the da Vinci® surgical system, including camera follow, instrument follow, surgeon head in/out of the console, master clutch for the hand controller, and two ultrasound probe events. The ultrasound probe events detect if the probe is being held by the forceps and if the probe is in contact with the tissue, respectively. All events are represented as binary on/off time series.
We use two evaluation metrics for our state estimation model: the frame-wise state estimation accuracy and the edit distance. The frame-wise accuracy is the percentage of correctly recognized frames, which is measured without taking temporal consistency into account. This is because the model has only the knowledge of the current and preceding data entries in the real-time state estimation setting. The edit distance, or Levenshtein distance[levenshtein1966binary], measures the number of operations (insertion, deletion, and substitution) needed to convert the inferred sequence of states in the segment level to the ground truth. We normalize the edit distance following [lea2016learning, dipietro2016recognizing]. We evaluate both datasets using Leave One User Out as described in [gao2014jhu]. For the ultrasound imaging task, we assume a causal setting, in which the models only have knowledge of the current and preceding time steps. This is to mimic the real-time state estimation application of our model, in which the robot cannot foresee the future. For the JIGSAWS suturing task, we assume a non-causal setting for more direct comparisons with the reported accuracy of the state-of-the-art methods. The edit distance is therefore only used for JIGSAWS.
4 Results and Discussions
Table II compares the performances of the state-of-the-art surgical state estimation models with an ablated version of our model (Fusion-KV), consisting of the kinematics- and vision-based models as well as the fusion model. Table III compares the performances of our full fusion model (Fusion-KVE) and Fusion-KV with their single-source components using the RIOUS dataset. In Fig. 5, we show an example of state estimation results of our fusion models and their components for a string of ultrasound imaging sequences. Fig. 6 shows the weight matrix distributions used in our fusion models. A large indicates that the model performs well in estimating the state during training.
In Table II, Fusion-KV achieves a frame-wise accuracy of 86.3% and edit distance score of 87.2 for the JIGSAWS suturing dataset, both improving the state-of-the-art surgical state estimation models. For the RIOUS dataset (Table III), Fusion-KVE achieves a frame-wise accuracy of 89.4%, with an improvement of 11% comparing to the best-performing single-input model. Fusion-KV also achieves a higher accuracy comparing to single-input models.
A closer observation of the inferred state sequences by various models and their weighting factors as shown in Fig. 5 and Fig. 6 reveals the key aspects of improvements of our method. Although kinematics-based state estimation models generally have a higher frame-wise accuracy comparing to vision-based models (Tables II and III), which are very sensitive to camera movements, each model has its respective strengths and weaknesses. For instance, at around 200s of the illustrated sequence in Fig. 5, both kinematics-based models show a consecutive block of errors where the models fail to recognize the ‘probe released and in endoscopic view’ state. Considering the relatively random robotic motions in this state, this is to be expected. The low weighting factors for both kinematics-based model in estimating this state, as shown in Fig. 6, also support this observation. On the other hand, the vision-based model correctly estimates this state, since the state is more visually distinguishable. When incorporating both vision- and kinematics-based methods, our fusion models perform weighted voting based on the training accuracy of each model. In this example, the weighting factor for the vision-based model is higher than the kinematics-based models; therefore, our fusion models are able to correctly estimate the current state of the surgical task. In other states where the robotic motions are more consistent but the vision data is less distinguishable, the kinematics-based models have higher weighting factors.
|Method||Input data type||Accuracy (%)||Edit Dist.|
|Method||Input data type||Accuracy (%)|
The incorporation of system events further improves the accuracy of our fusion model. Comparing Fusion-KV and Fusion-KVE, we observe fewer errors - many are corrected where for the event-based model is high, such as states with shorter duration or frequent camera movements. At around 250s to 300s of the presented sequence, frequent state transitions can be observed. Fusion-KVE is able to estimate the states more accurately and shows fewer fluctuations comparing to other models. The event-based model is less sensitive to environmental noises, as the events are collected directly from the surgical system. Additionally, when the state transition is frequent, models that solely explore the temporal dependencies of input data, such as TCN and LSTM, are less accurate. As the event-based model does not take the temporal correlations into consideration, incorporating such data source reduces the fluctuation in state estimation results, especially when the state transition is frequent or the duration of each state is short.
The average duration of each state in both JIGSAWS suturing dataset and the RIOUS dataset varies significantly, as shown in Table I. To better capture states with different lengths of duration, we implemented two kinematics-based state estimation models: TCN and forward LSTM. Fig. 6 supports our decision. When the average duration of a state is high, the LSTM-based model has a higher weighting factor. Similarly, the TCN-based model has a higher weighting factor for shorter-duration states.
As mentioned before, the RIOUS dataset is more complex compared to JIGSAWS and resembles real-world surgical tasks more closely. It is, therefore, more complicated and harder to be well-captured by a single-input state estimation model. Furthermore, our application of real-time state estimation limits the amount of data available to the model. Although running multiple state estimation models at the same time inevitably requires higher computing power, our fusion state estimation model is robust against complex and realistic surgical tasks such as ultrasound imaging and achieves a superior frame-wise accuracy.
5 Conclusions and Future Work
In this paper, we introduce a unified approach of fine-grained state estimation for various surgical tasks using multiple sources of input data from the da Vinci® Xi surgical system. Our models (including Fusion-KV and Fusion-KVE) improve the state-of-the-art performance for both the JIGSAWS suturing dataset and the RIOUS dataset. Fusion-KVE, which takes advantage of the system events (absent in the JIGSAWS dataset), further improves Fusion-KV. Our RIOUS dataset is more complex than JIGSAWS and resembles the real-world surgical tasks, with dry-lab, cadaveric and in-vivo experiments, as well as camera movements and annotations of both action and non-action states. Our unified model proves its robustness against complex and realistic surgical tasks by achieving a superior frame-wise accuracy even in a causal setting, where the model has knowledge of only the current and preceding time steps.
We show how different types of input data (vision, kinematics, and system events) have their respective strengths and weaknesses in the recognition of fine-grained states. The fine-grained state estimation of surgical tasks is challenging due to the duration of various states and frequent state transitions. We show that by incorporating multiple types of input data, we are able to extract richer information during training and more accurately estimate the states in a surgical setting. A possible next step of our work would be to use the weighting factor matrix for boosting methods to more efficiently train the unified state estimation model. In the future, we also plan to apply this state estimation framework to applications such as smart-assist technologies and supervised autonomy for surgical subtasks.
This work was funded by Intuitive Surgical, Inc. We would like to thank Dr. Azad Shademan and Dr. Pourya Shirazian for their support of this research.