Predicting the Timing of Camera Movements From the Kinematics of Instruments in Robotic-Assisted Surgery Using Artificial Neural Networks

09/23/2021 ∙ by Hanna Kossowsky, et al. ∙ Ben-Gurion University of the Negev 0

Robotic-assisted surgeries benefit both surgeons and patients, however, surgeons frequently need to adjust the endoscopic camera to achieve good viewpoints. Simultaneously controlling the camera and the surgical instruments is impossible, and consequentially, these camera adjustments repeatedly interrupt the surgery. Autonomous camera control could help overcome this challenge, but most existing systems are reactive, e.g., by having the camera follow the surgical instruments. We propose a predictive approach for anticipating when camera movements will occur using artificial neural networks. We used the kinematic data of the surgical instruments, which were recorded during robotic-assisted surgical training on porcine models. We split the data into segments, and labeled each either as a segment that immediately precedes a camera movement, or one that does not. Due to the large class imbalance, we trained an ensemble of networks, each on a balanced sub-set of the training data. We found that the instruments' kinematic data can be used to predict when camera movements will occur, and evaluated the performance on different segment durations and ensemble sizes. We also studied how much in advance an upcoming camera movement can be predicted, and found that predicting a camera movement 0.25, 0.5, and 1 second before they occurred achieved 98 accuracy relative to the prediction of an imminent camera movement. This indicates that camera movement events can be predicted early enough to leave time for computing and executing an autonomous camera movement and suggests that an autonomous camera controller for RAMIS may one day be feasible.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotic-assisted minimally invasive surgeries (RAMIS) have gained popularity over the past decades [25, 32, 10, 20]. During RAMIS, e.g. with the Da Vinci surgical system (Intuitive Surgical Inc., Sunnyvale, California), the surgeon uses Master Tool Manipulators to control a camera arm, and Patient Side Manipulators (PSMs), on which surgical instruments are mounted; the surgeon uses a foot pedal to switch between their control [21]. RAMIS offer many advantages to both patients and surgeons, compared to open or laparoscopic surgeries. Patients benefit from less postoperative pain, shorter hospital stay, reduced complication rates and less tissue damage and blood loss [25, 42, 38]. Surgeons benefit from reduced tremors compared to open surgeries [32]

, as well as more degrees of freedom, better visualization, and improved ergonomics compared to laparoscopic surgeries

[31, 32].

Despite the many benefits offered by RAMIS, they are not without disadvantages, such as an increase in the operative time [25, 1] and the need for special training [25]. Furthermore, unlike open surgeries, in RAMIS the visual feedback is limited; only a specific region of the surgical scene is visible to the surgeon, who controls an endoscopic camera arm to adjust the viewpoint [39]. Studies have shown that the camera is moved frequently, especially by expert surgeons [28], and the value of good camera control has been demonstrated in several works [36, 7, 49, 28]. Poor visualization and sub-optimal viewpoints, on the other hand, can have detrimental effects; they can lower surgeons’ awareness to the surgical environment and lead to an increase in surgical errors, causing injuries and impairing patients’ safety [16, 18, 39, 35]. Simultaneously controlling the camera and the surgical instruments is not possible in RAMIS [39, 9], necessitating surgeons to release control of the instruments every time they wish to re-position the camera [14, 9, 39]. Such manual re-positioning can be time consuming and challenging [25, 14], and lead to disruptions in the flow of the surgery [15] and to distractions [13].

A potential solution to this challenge is the automation of the camera movements [51, 8]. One approach for such automation is tracking the movements of the surgical instruments and moving the camera such that the instruments are always in the field of view [35, 15, 13, 51, 54, 37, 52], or in the surgeons’ preferred area [33]. While this method has been shown to be both feasible and beneficial, it assumes the instruments to be the only deciding factor in the surgical scene, while in reality, additional features likely affect the optimal viewpoint [43]. Other approaches are based on tracking the surgeon’s eyes [3], head [9] or head and and upper-body motions [19] to control the camera. These methods necessitate adding tracking systems to the autonomous camera controller. An additional commonality among many of these methods is the fact that the camera is constantly moving to follow the instruments, eyes or head, whereas in RAMIS, the camera is often stationary and moved at specific times. Additionally, the tracking methods focus on reactive, rather than predictive, camera automation.

Several works have investigated the potential for predictive camera control. One such study positioned the camera in accordance with anticipated instrument trajectories that were predicted using Markov models

[50]. Furthermore, several works have designed algorithms for predicting camera viewpoints and movements using data collected in dry-lab tasks [2, 29, 43]. For example, in [2]

, participants performed a pick-and-place task while controlling simulated camera movements with their heads; inverse reinforcement learning was then used to predict the camera movements. However, the dry-lab tasks in these studies

[2, 29, 43] are more structured and less complex than surgical procedures. To the best of our knowledge, no works to date have focused on the prediction of camera movements in real and complex surgical environments.

Because it is currently unknown what exactly drives surgeons’ camera movements and their timings, an alternative method that can be used to predict camera movements in RAMIS is the use of artificial neural networks. Artificial neural networks have been used previously in various works in RAMIS; for example, in the automatic segmentation of surgical procedures into gestures [27, 22, 41, 53], the classification of surgeons’ skill levels [17]

, and the estimation of contact forces

[4]. However, to the best of our knowledge, predicting camera movements from RAMIS data using artificial neural networks was not demonstrated to date.

The kinematics of the instruments reflect the surgeons’ hand movements. We posit that this information can be indicative of upcoming camera movements as head, eye and hand movements have been shown to be related [5]. Additionally, reactive systems in which the camera follows the instruments have been shown to improve participants’ performances in dry-lab tasks, further supporting the hypothesis that the instrument and camera movements may be related [35, 15, 13, 51, 54]. If this is the case, it will enable the use of simpler and smaller artificial neural networks than those that would be required when inputting the video stream.

The problem of predicting the camera movements in RAMIS can be divided into two sub-problems: (1) predicting the timing and (2) predicting the trajectories of the camera movements. In this study we use artificial neural networks to predict the timing of the camera movements from the kinematics of the surgical instruments, recorded during surgical training using the Da Vinci surgical system. We approached the problem of predicting the timing of camera movements as a classification problem. We split the kinematic data of the instruments into segments, and labeled each segment either as one that immediately precedes a camera movement, or one that does not. We then trained neural networks to classify segments of kinematic data either as immediately preceding a camera movement or not immediately preceding a camera movement.

A critical challenge when predicting the timing of the camera movements is that the data is highly imbalanced: there are many more kinematic segments that do not immediately precede a camera movement than those that do. Imbalanced data impairs the ability of neural networks to learn, as they tend to misclassify minority class samples as belonging to the majority class, due to the latter’s increased prior probability

[30]. One solution is under-sampling the majority class, such that the network is trained on a balanced subset of the data. However, this solution discards much of the available data, and the subset of samples drawn from the majority class may not be representative. Another solution, used in sentiment classification [55] and credit scoring [47], is to train an ensemble of networks, each on a balanced subset of the data, and to then combine the predictions of all the networks. Each of the networks receives all the minority class samples and a different subset of majority class samples, which equals in size to the number of samples in the minority class.

In this study, we use an ensemble of neural networks to analyze kinematic data that was recorded during surgical training on porcine models with the Da Vinci surgical system. Our results indicate that the timing of the camera movements can be predicted from the kinematic data. We examine the performance of the ensemble for several ensemble sizes and segment durations. Furthermore, in addition to the prediction of imminent camera movements, we examine the possibility of advance prediction of upcoming camera movements to open the possibility of online prediction of camera movement timing during RAMIS. Finally, we discuss directions for future improvements of our ensemble.

The contributions of our work are as follows:

  • This work serves as a proof of concept for the possibility of predicting the timing of camera movements in RAMIS, and shows that this is possible using solely the kinematics of the surgical instruments, with no video input.

  • We examine several kinematic segment durations and demonstrate which contains the most relevant information for the prediction of camera movements in RAMIS.

  • We show that, in addition to predicting imminent camera movements, advance prediction is possible, indicating that online prediction of the timing of camera movements in RAMIS can potentially be possible.

  • We show that predicting camera movement events is possible in unstructured and complex data recorded during surgical training using the Da Vinci surgical system.

Ii Methods

Ii-a Data

The data used in this work are recordings of surgical training on a porcine model using the Da Vinci surgical system, provided to us through a collaboration agreement with Intuitive Surgical Inc. The protocol, titled ”Computer Enhanced Minimally Invasive Surgery-Surgeon and Staff Training” was approved on July 31, 2019. The dataset consists of recorded procedures performed by seven surgeons (three experts and four novices), each performing two tasks, Uterine Horn Dissection and Simulated Cuff Closure. That is, a total of 14 recorded surgical tasks were used in this work.

We used the kinematic data of the three surgical instruments that were used in these tasks. The instruments were mounted on the three PSMs, which were sampled at 50Hz. As in [11, 27]

, the kinematic data we used were the endpoint position in the three Cartesian axes, the three linear velocities, and the gripper angle of each PSM. There were therefore a total of 21 features per sample. The velocity was numerically calculated by differentiating the position according to the time. We then filtered the position and velocity data using second order zero-phase Butterworth filters, with cutoff frequencies of 5Hz and 8Hz, respectively. Finally, we normalized each of the features by subtracting its mean and dividing it by the standard deviation (z-score normalization).

Ii-A1 Imminent Camera Movement Prediction

To label the data, we used the camera arm’s endpoint Cartesian position to identify the timestamps in which the camera moved, and in which it was stationary. Fig 1(a) shows an example of the camera’s endpoint position in one of the axes, where the timestamps identified as those in which the camera moved are highlighted in gray. We then split the kinematic data of the PSMs into segments, and labeled each segment either as immediately preceding a camera movement (designated a before camera movement segment) or not immediately preceding a camera movement (designated a not before camera movement segment). Hence, the inputs to the neural networks were kinematic data, comprised of 21 features, each with a length of N samples. Note that the kinematic data from timestamps during which there were camera movements were not included in the segments. Fig 1(b) shows an example of the position of one of the PSMs in one of the axes (i.e., one of the 21 kinematic features), with a loose illustration of the splitting of the kinematic data into segments. The overlap between every two consecutive segments was N-1.

We aimed to find the segment duration that would lead to the highest success in the prediction of the timing of camera movements, and therefore created several versions of the segmented kinematic data, each with a different segment length, N = 25, 50, 100 and 200 samples, corresponding to segment durations of 0.5, 1, 2 and 4 seconds. Fig. 1(c) illustrates a segment (network input), comprised of 21 features, each of length N.

Fig. 1: Data Preparation. (a) The position of the camera in one of the axes. The gray shaded areas indicate the timestamps in which camera movements were detected. (b) The position of one of the PSMs in one of the axes relative to the camera position shown in (a). Note that the instrument position is recorded relative to the camera, leading to the different position values immediately before and after a camera movement. The splitting of the kinematic data into segments is illustrated, where segments that immediately precede a camera movement are indicated in red, and those that do not are black. Data from timestamps during which there was a camera movement were not included in the segments. (c) An illustration of a sample (network input). The colored rectangles represent the 21 kinematic features, seven for each of the three PSMs. The length of each feature is N samples. represents the position of the ith PSM in the j axis, where i = 1, 2, 3 and j = x, y, z. Similarly, represents the velocity, and , the gripper angle.

We created test, validation and train sets for each of the four segment durations. The test sets were created by randomly selecting 15% of the before camera movement segments, and an equivalent number of not before camera movement segments, thus achieving a balanced test set of 130 segments in total. This process was repeated using the remaining data to create validation sets, containing 112 segments. The remaining, unbalanced data were our potential train sets. All the before camera movement segments in the train set were used to train each of the networks in the ensemble, while a different subset of not before camera movement segments was used for each network (Fig 2(a)). This allowed us to use a larger portion of the data, while training each of the networks on a balanced dataset of 534 segments.

As a subset of the not before camera movement segments

was used to train each of the networks in the ensemble, the segments were not fed into the networks in a time-consecutive order. Furthermore, to prevent overfitting and biases, random shuffling was used, such that the order of the segments was different in each training epoch. Hence, the networks were trained to receive a kinematic segment and recognize if it would, or would not, be immediately followed by a camera movement. Such a trained network could then potentially be used in RAMIS; the network could continuously receive data segments and indicate those that should be followed by a camera movement.

Ii-A2 Advance Camera Movement Prediction

After finding the segment duration that yielded the best results, we examined the possibility of predicting that a camera movement would occur in a certain amount of time, rather than immediately following the kinematic segment. This is important for opening the possibility of online execution of an automated camera movement following the prediction that a camera movement should occur. We therefore examined the performance of the ensemble when trained to predict that a camera movement would occur in 0.25, 0.5 and 1 second, and compared these results to the ability of the ensemble to predict an imminent camera movement.

Fig. 2: Ensemble of networks. (a) Illustration of the training process. Each of the networks in the ensemble received all the before camera movement segments in the training set and an equivalent number of not before camera movement segments from the training set. (b) Illustration of the prediction stage. Each of the networks in the ensemble received the unknown segment and outputted its prediction (0 for a not before camera movement segment or 1 for a before camera movement segment). The prediction of the ensemble was determined using a majority vote of the predictions of all of the networks in the ensemble.

Ii-B Networks

Ii-B1 Network Evaluation

To evaluate the ensembles’ performances, we used three metrics defined using the Confusion Matrix (Table

I): accuracy, True Positive Rate (TPR), and True Negative Rate (TNR).

True ClassPredicted Class Not Before
Camera
Movement Before
Camera
Movement
Not Before Camera Movement TN FP
Before Camera Movement FN TP
TABLE I: Confusion Matrix

The accuracy is defined as the number of correctly classified segments divided by the total number of classified segments.

(1)

The additional two metrics reflect the ability of the ensemble to detect a before camera movement segment when there is one (TPR), or a not before camera movement segment when there is no imminent camera movement (TNR):

(2)
(3)

Ii-B2 Ensemble Size and Segment Duration

Each of the networks in the ensemble was an LSTM network [24], as this network architecture is able to learn sequential data due to a memory unit. To choose the ensemble size, based on [27]

and several pilots, we began with a two LSTM layer network. We examined the performance of nine network architectures to choose the ensemble size: each LSTM layer could be comprised of either 100, 300 or 500 neurons. To stabilize the learning process and reduce over-fitting we added a dropout layer with a probability of 0.2 after each LSTM layer, and a batch normalization layer

[26] between the two LSTM-dropout blocks. We used the adam optimizer with a learning rate of 0.0001, and decayed the learning rate with every epoch according to:

(4)

where is the learning rate in epoch , and .

For small ensemble sizes, each additional network has a larger effect on the performance of the ensemble, and this effect decreases as the ensemble size increases. We tested several ensemble sizes, and the number of composing networks was selected such that a stable performance of the ensemble was achieved [47]. To combine between the outputs of the networks in the ensemble, we used majority voting [23]: each of the networks predicted the class of the segment, and the output of the ensemble was the class predicted by the majority of the networks (Fig 2(b)).

After choosing the ensemble size, we then proceeded to assess which of the segment durations would lead to the best performance of the ensemble. We continued with the previously described nine networks for this stage. We examined the accuracy, TPR and TNR of the ensembles for the four different segment durations, and selected the duration that produced the best results when predicting when camera movements would occur.

Ii-B3 Hyperparameter Tuning

After choosing the ensemble size and segment duration, we tuned the hyperparameters of the networks in the ensemble. We tested the use of one or two LSTM layers, the number of neurons in each layer (100, 300, 500, 700, 900 and 1100), dropout after each LSTM layer (0.1, 0.2 and 0.3), recurrent dropout in each LSTM layer (0.0 and 0.2), the number of batch normalization layers (zero, one and two), learning rate (0.001 and 0.0001), learning rate decay rate (0.90, 0.99 and 1), batch size (32, 64, 128 and 256), and L2 regularization (0.1, 0.01, 0.001 and 0.0001).

The weights of the LSTM layers were initialized according to the number of neurons in the layer, by selecting their values from a normal distribution,

. We monitored the validation set loss, and used early stopping to end the training if the loss did not decrease for three consecutive epochs. We additionally evaluated the performance of the ensemble when using L1 regularization and when normalizing the data according to its maximal value, but both these options yielded poorer results and we do not report them here. We also assessed the performance when averaging the outputs of the networks in the ensemble instead of majority voting, but this too led to poorer performance.

Iii Results

We chose the ensemble size, segment duration and network hyperparameters by training the described networks on the train set, and testing them on the validation set. We ran each of the networks 10 times with different random seeds to ensure that the results did not vary greatly between the different seeds. Once arriving at our final ensemble architecture, we tested its performance on the yet untouched test set, and present these results at the end of this section.

Iii-a Ensemble Size

We evaluated nine ensemble architectures to choose the ensemble size. Fig. 3 shows the accuracy of one of the ensembles as a function of the ensemble size, for each of the four segment durations. For smaller ensemble sizes, every added network affected the accuracy of the ensemble. As the ensemble size increased, this effect became smaller, such that for ensemble sizes of 15 and higher, the accuracy was relatively stable. This was observed for all nine ensembles, and similar stabilization was observed in the TPR and TNR values as well. We therefore selected an ensemble size of 15 networks.

Fig. 3: Accuracy as a function of ensemble size. This graph shows the accuracy of the ensemble as a function of the number of networks in the ensemble. Each color and symbol represents the performance on one of the four segment durations: 0.5 s – red stars; 1 s – blue squares; 2 s – green circles; and 4 s – yellow triangles. The dashed gray line marks the ensemble size of 15, beyond which the addition of further networks had a relatively small effect on the performance.

Iii-B Segment Duration

After we determined the ensemble size, we then focused our attention on the selection of the segment duration. Fig. 4 shows the average performance of the 10 evaluations of the nine ensembles, each of which was comprised of 15 networks, for each segment duration. All the standard deviations of the 10 repetitions were smaller than 0.1. As shown in Fig. 4(a), the accuracy achieved for the segment durations of 0.5 s and 1 s were higher than those of 2 s and 4 s. When comparing between the 0.5 s and 1 s segments, we saw that eight of the nine ensembles scored higher for the 1 s segments. Next, Fig. 4(b) revealed that the TPR values of the 0.5 s segments were higher than those of the 1 s segments. Similar to the accuracy, the TPR values for the 0.5 s and 1 s segments were higher than those of 2 s and 4 s. When examining Fig. 4(c), we saw that the TNR values of the 1 s segments were higher than those of the 0.5 s. We also noted that there were cases in which the 2 s segments had the highest TNR, however as this duration was inferior in both the accuracy and the TPR, it was not chosen.

To choose between the 0.5 s and the 1 s segment durations, we took all three metrics into account. The first consideration was that in the majority of the cases, the 1 s segments had higher accuracy values. The second consideration was that we aimed to design an ensemble that would be able to both detect when there would be, and when there would not be, a camera movement. Hence, both the TPR and TNR values were of equal importance. By comparing between Fig. 4(b) and Fig. 4(c), we saw that the TPR and TNR values for the 1 s segments were similar, whereas the 0.5 s segments had a high TPR, but a low TNR. We therefore selected the 1 s segment. Had we found either the shortest or longest segment duration to be the best of the four, we would have expanded our segment duration options, but this was not the case.

Fig. 4: Model performance as a function of the segment duration. The abscissa is the segment duration, and the ordinate is the average (a) accuracy, (b) True Positive Rate, and (c) True Negative Rate. Each color and symbol represents one of the nine different ensembles, which differ in the sizes of the two LSTM layers of the networks of which the ensemble is composed (as shown in the legend in (a)).

Iii-C Hyperparameter Tuning

We examined the accuracy, TPR and TNR of each of the networks architectures and arrived at the final hyperparameters. Our final network architecture was two LSTM layers, the first with 1100 neurons, and the second with 300. Each of these layers had a recurrent dropout of 0.0, and was followed by a dropout layer with a probability of 0.2. One batch normalization layer was placed between the two LSTM-dropout blocks. L2 regularization was used in both LSTM layers, with a value of 0.001. The learning rate was 0.0001, with a decay rate of 0.99, and the batch size was 128.

Iii-D Imminent Camera Movement Prediction

The average performance of the 10 runs of the ensemble comprised of these networks on the test set was accuracy, a TPR of and a TNR of in predicting the timing of the endoscopic camera movements. The confusion matrix of one of the 10 runs is presented in table II.

True ClassPredicted Class Not Before
Camera
Movement Before
Camera
Movement
Not Before Camera Movement 45 20
Before Camera Movement 16 49
TABLE II: Test Set Confusion Matrix

Fig. 5 presents examples of the classification of several segments from the test set. Fig. 5(a) shows three segments that were correctly classified as not before camera movement segments, Fig. 5(b) shows segments that were incorrectly classified as not before camera movement segments, Fig. 5(c) shows segments that were incorrectly classified as before camera movement segments, and Fig. 5(d) shows segments that were correctly classified as before camera movement segments. When examining the classification results, we found no clear visual distinction between those classified as before camera movement segments and those classified as not before camera movement segments. Specifically, the instruments were not idle before camera movements, and did not appear to exhibit any visible patterns of preparation towards the upcoming camera movements. It is likely that the ensemble learned more complex relations between all 21 features, rather than something that could be visible from just the Cartesian positions of one of the instruments.

Fig. 5: Classification Examples. This figure shows examples of segments correctly and incorrectly classified as before camera movement segments and not before camera movement segments. The displayed signals are the position of one of the PSMs, recorded relative to the camera. Each plot shows three examples of segments, and the color indicates the direction of the movement, beginning with the light shades, and ending with the dark. (a) True Negative, (b) False Negative, (c) False Positive, and (d) True Positive.

Iii-E Advance Camera Movement Prediction

After arriving at the ensemble size, the segment duration and the hyperparameters that led to the best performance when predicting the timing of the camera movements, we compared the ensemble’s ability to predict an imminent camera movement with its ability to predict an upcoming camera movement in advance. We trained three additional ensembles to predict camera movements 0.25, 0.5 and 1 second in advance. The architecture of these ensembles was identical to that which predicted imminent camera movements, as we were interested in assessing the cost in performance caused by advance prediction. We therefore compared the ability of the ensemble to predict camera movements in advance to its ability to predict imminent camera movements. Table III shows the ability of the ensemble to predict an upcoming camera movement 0.25, 0.5 and 1 second ahead of time relative to the imminent prediction of a camera movement in terms of accuracy, TPR and TNR. The values shown in Table III are the average results; here too, the standard deviation of the 10 repetitions with different random seeds were smaller than 0.1.

MetricTime [sec] 0.25 0.5 1
Accuracy 98% 94% 84%
TPR 92% 89% 80%
TNR 100% 99% 88%
TABLE III: Advance Prediction Performance in Percentages Relative to Imminent Prediction

Iv Discussion

In this work we developed an ensemble of neural networks to predict when camera movements will occur in RAMIS using the kinematic data of the three PSMs. We did this by splitting the data into segments, and labeling each either as a segment that immediately precedes a camera movement, or one that does not. Due to the large imbalance between the number of before camera movement segments and not before camera movement segments, we used an ensemble of networks. We showed that the kinematic data can indeed be used to predict camera movements, and found which segment duration and ensemble size were best for the task. Additionally, we found that advance prediction of camera movements in RAMIS is possible.

The accuracy we achieved on the test set was 0.72, which does indeed show that the kinematic data is indicative of camera movements, as chance level in this case is 0.50. However, these results do leave room for improvement. There are several potential explanations for this performance level. One reason is the fact that our dataset was very small, and therefore, more data would likely improve the ensemble’s performance. Available and labeled surgical data is a common bottleneck [34, 44, 46, 48]

, and therefore poses a challenge when using deep learning to analyze surgical data. However, this work is the first to use RAMIS data to predict the timing of the camera movements, and demonstrates that it is possible to do so using the kinematic data of the surgical instruments. As the availability of data increases, the potential for better performance increases. Therefore, this work serves as a proof of concept, and the performance would need to be improved to be used in RAMIS.

An additional fact that may have contributed to the performance level is the fact that features other than the instruments likely affect the surgeon’s decision about when to move the camera [43]. Moreover, this dataset was comprised of seven surgeons at two expertise levels (novices and experts), each completing two different procedures (Uterine Horn Dissection and Simulated Cuff Closure). More data would additionally allow for training networks separately on different procedures, which may be characterized by different movements. The camera movements also likely differ between the surgical levels, for example, experts have been shown to make smaller and more frequent camera movements than novices [28]. More training data would also allow for training networks separately for different surgical expertise levels, which may too improve the performance of the ensemble.

The fact that the kinematics of the instruments can be indicative of camera movements is consistent with previous works that designed systems in which the camera followed the instruments [35, 15, 13, 51, 54]. In these works, they found that the autonomous systems were preferred by users and improved users’ performance in dry-lab tasks. For example, in [35], they designed an algorithm that followed the instruments, such that they remained in the center of the field of view. Using a pick-and-place task, they demonstrated that the use of the autonomous camera control led to higher accuracy and shorter task times than those achieved with manual camera control. However, similar to these works, ours too has the drawback of assuming the instruments to be the only deciding factor in RAMIS camera movements. Therefore, in our future work, we may find it beneficial to add the visual information to the network, such that it can take the entire surgical scene into account. However, unlike previous works in which the camera tracks the surgical instruments, our predictive method does not assume a specific method to be optimal. That is, having the camera follow the instruments assumes a specific method for the camera control. On the other hand, using recordings of RAMIS performed by surgeons to train networks to learn when to predict camera movements does not assume that the instruments constantly need to be followed. Rather, the ensemble learns to imitate the surgeons based on a combination of the 21 kinematic features.

In this work, we found the use of an ensemble to be an appropriate solution for the class imbalance. This is in accordance with [55] and [47], which both used an ensemble of models with unbalanced datasets. Furthermore, by comparing between the performance of the ensemble and each of the individual networks composing the ensemble, we found that the ensemble led to better and more consistent results than using only under-sampling to deal with the class imbalance (i.e., training only a single network on a sub-set of the data). We will note that we explored the possibility of under-sampling the majority class to different ratios relative to the minority class, and accounting for the imbalance in the loss calculation, however found that this led to lower TPR values than the ensemble.

Commonly, ensembles are comprised of different models [55, 47, 45]. This can be beneficial, as different models may perform better in different aspects of the problem [40]. In our work, the ensemble was comprised of networks with the same architecture, similar to [6]. We chose not to use different networks in our ensemble due to the very small size of our dataset. We posited that examining many possible network combinations and choosing the one with the highest performance on the validation set might not lead to the best performance on the test set, rather might be tailored to the small validation set.

After creating ensembles comprised of 15 networks, we examined four segment durations to find which would achieve the best results. The segment durations we tested were 0.5, 1, 2, and 4 seconds. We found that the 1 s segment duration yielded the best results. We hypothesize that the 0.5 s duration may not have contained enough information to differentiate between before camera movement and not before camera movement segments. The before camera movement and not before camera movement segments may also have been similar in the cases of the 2 s and 4 s segments. That is, these segments may have contained information indicative of camera movements. However, this information may have been only part of the segment, while the rest of the segment could have been unrelated to a potential upcoming camera movement. If this were to be the case, the added information, which was unrelated to the camera movements, could have made it harder for the network to differentiate between before camera movement and not before camera movement segments, leading to the observed decrease in performance.

The 1 s segment duration may appear short on first glance. However, works that have decomposed surgical tasks (e.g., suturing) into the separate gestures of which they are comprised, have shown that there are gestures which fall in the range between one and two seconds, and some are even slightly shorter than one second [27, 11]. This indicates that meaningful surgical data can be included in data segments of one second. This is further supported by the fact that reaction times in surgery have been shown to be under a half a second [12, 56], and therefore information indicating an upcoming camera movement can be included in a one-second segment. These studies showed that the reaction times were generally above 300 msec, further supporting the idea that the 0.5 s segments likely did not contain enough information to be indicative of camera movements.

We examined the accuracy of our ensemble, however it was also important to separately assess the ability of the ensemble to recognize an upcoming camera movement when there was one (as quantified by the TPR), as we are interested in a system that may be able to initiate a camera movement at the appropriate time. However, of equal importance is the ability of the ensemble to predict no imminent camera movement when there is none (as quantified by the TNR). If the system were to move the camera at an inappropriate time, it could disorient and confuse the surgeon, as well as cause critical features or instruments to be outside the field of view. The ensemble achieved a TPR of 0.74 and a TNR of 0.70 on the test set. Similar to the accuracy, these metrics need to be further improved to be used in RAMIS, however are above chance level, and show that the ensemble learned how to recognize both before camera movement and not before camera movement segments, as desired.

The end goal of our work is to develop an autonomous camera controller. We trained an ensemble to recognize if a camera movement will, or will not, follow a segment of kinematic data. Hence, no data from after the initiation of the camera movement was used by our ensemble. This is in contrast to other studies that inputted both past and future RAMIS data points into neural networks in classification problems [27, 22, 11]. Therefore, our proposed ensemble can potentially be used for online prediction of upcoming camera movements. The ensemble was trained to receive a segment of kinematic data, and output if the camera should or should not move. Hence, segments of kinematic data can potentially be fed into the trained ensemble throughout RAMIS, and segments which should be followed by a camera movement can be recognized online.

Nevertheless, when designing such an autonomous controller, instructing the camera to move immediately would not be feasible. Therefore we examined the possibility of predicting that a camera movement will happen in a certain amount of time rather than in the next sample. We found that advance prediction of camera movements is possible. Predicting a camera movement 0.25 s in advance led to only a slight decrease in performance. Predicting camera movements 0.5 s in advance led to a slightly larger decrease in the performance, however, here too, all three metrics were close to, or higher than, 90% of the performance in the case of immediate prediction. Predicting camera movements 1 s in advance led to a larger decrease in the performance, such that the accuracy of predicting a camera movement 1 s in advance was 84% relative to imminent prediction, i.e., an accuracy level of 0.6. Based on these results, we posit that predicting a camera movement 1 s in advance may lead to too large of a decrease in the reliability of the prediction. However, we also note that a larger dataset would allow for more precise characterization of the limit of advance prediction. Even with the small dataset, our results demonstrate that camera movements can be predicted slightly in advance, with only a small decrease in performance, supporting the possibility of one day performing online camera movement prediction in RAMIS.

V Conclusions

In this work we used an ensemble of LSTM neural networks to analyze recordings of surgical tasks performed with the Da Vinci surgical system, and demonstrated that the kinematic data of the surgical instruments can be used to predict when camera movements will occur. We found the segment duration and ensemble size that produced the best results and tuned the hyperparameters of the networks in the ensemble. Furthermore, we showed that, in addition to the prediction of imminent camera movements, the camera movements can be predicted in advance as well. Our findings may be the first step towards designing a predictive autonomous camera controller, which would allow robotic surgeons to benefit from optimal viewpoints without necessitating the frequent camera movements they currently have to make manually.

Acknowledgments

We would like to thank Intuitive Surgical Inc. for providing us with the RAMIS data and Anthony Jarc for his valuable assistance. Additionally, we thank Yarden Sharon for her valuable insights on the manuscript.

References

  • [1] A. P. Advincula, X. Xu, S. Goudeau, and S. B. Ransom (2007-11) Robot-assisted laparoscopic myomectomy versus abdominal myomectomy: A comparison of short-term surgical outcomes and immediate costs. Journal of Minimally Invasive Gynecology 14 (6), pp. 698–705 (en). External Links: ISSN 15534650, Document Cited by: §I.
  • [2] A. S. Agrawal (2018) Automating endoscopic camera motion for teleoperated minimally invasive surgery using inverse reinforcement learning. Cited by: §I.
  • [3] S. Ali, L. A. Reisner, B. King, A. Cao, G. Auner, M. Klein, and A. K. Pandya (2008) Eye gaze tracking for endoscopic camera positioning: an application of a hardware/software interface developed to automate aesop.. Studies in health technology and informatics 132, pp. 4–7. Cited by: §I.
  • [4] A. I. Aviles, S. M. Alsaleh, P. Sobrevilla, and A. Casals (2015) Force-feedback sensory substitution using supervised recurrent learning for robotic-assisted surgery. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1–4. Cited by: §I.
  • [5] H. Carnahan and R. G. Marteniuk (1991-06) The Temporal Organization of Hand, Eye, and Head Movements during Reaching and Pointing. Journal of Motor Behavior 23 (2), pp. 109–119 (en). External Links: ISSN 0022-2895, 1940-1027, Document Cited by: §I.
  • [6] B. Cheng, W. Wu, D. Tao, S. Mei, T. Mao, and J. Cheng (2020) Random cropping ensemble neural network for image classification in a robotic arm grasping system. IEEE Transactions on Instrumentation and Measurement. Cited by: §IV.
  • [7] J. Conrad, A. Shah, C. Divino, S. Schluender, B. Gurland, E. Shlasko, and A. Szold (2006) The role of mental rotation and memory scanning on the performance of laparoscopic skills. Surgical Endoscopy and Other Interventional Techniques 20 (3), pp. 504–510. Cited by: §I.
  • [8] T. Da Col, A. Mariani, A. Deguet, A. Menciassi, P. Kazanzides, and E. De Momi (2020) SCAN: system for camera autonomous navigation in robotic-assisted surgery. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2996–3002. Cited by: §I.
  • [9] T. Dardona, S. Eslamian, L. A. Reisner, and A. Pandya (2019-04) Remote Presence: Development and Usability Evaluation of a Head-Mounted Display for Camera Control on the da Vinci Surgical System. Robotics 8 (2), pp. 31 (en). External Links: ISSN 2218-6581, Document Cited by: §I, §I.
  • [10] B. Davies (2000-01) A review of robotics in surgery. Proc Inst Mech Eng H 214 (1), pp. 129–140 (en). External Links: ISSN 0954-4119, 2041-3033, Document Cited by: §I.
  • [11] R. DiPietro, C. Lea, A. Malpani, N. Ahmidi, S. S. Vedula, G. I. Lee, M. R. Lee, and G. D. Hager (2016)

    Recognizing surgical activities with recurrent neural networks

    .
    In International conference on medical image computing and computer-assisted intervention, pp. 551–558. Cited by: §II-A, §IV, §IV.
  • [12] L. L. Drag, L. A. Bieliauskas, S. A. Langenecker, and L. J. Greenfield (2010) Cognitive functioning, retirement status, and age: results from the cognitive changes and retirement among senior surgeons study. Journal of the American College of Surgeons 211 (3), pp. 303–307. Cited by: §IV.
  • [13] S. Eslamian, L. A. Reisner, B. W. King, and A. K. Pandya (2016) Towards the implementation of an autonomous camera algorithm on the da vinci platform.. In MMVR, pp. 118–123. Cited by: §I, §I, §I, §IV.
  • [14] S. Eslamian, L. A. Reisner, and A. K. Pandya (2020) Development and evaluation of an autonomous camera control algorithm on the da vinci surgical system. The International Journal of Medical Robotics and Computer Assisted Surgery 16 (2), pp. e2036. Cited by: §I.
  • [15] S. Eslamian, L. Reisner, B. King, and A. Pandya (2017) An Autonomous Camera System using the da Vinci Research Kit. pp. 2 (en). Cited by: §I, §I, §I, §IV.
  • [16] P. J. Fabri and J. L. Zayas-Castro (2008) Human error, not communication and systems, underlies surgical complications. Surgery 144 (4), pp. 557–565. Cited by: §I.
  • [17] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019-09)

    Accurate and interpretable evaluation of surgical skills from kinematic data using fully convolutional neural networks

    .
    Int J CARS 14 (9), pp. 1611–1617 (en). Note: arXiv: 1908.07319 External Links: ISSN 1861-6410, 1861-6429, Document Cited by: §I.
  • [18] A. G. Gallagher, M. Al-Akash, N. E. Seymour, and R. M. Satava (2009) An ergonomic analysis of the effects of camera rotation on laparoscopic performance. Surgical endoscopy 23 (12), pp. 2684. Cited by: §I.
  • [19] S. Gao, Z. Ma, R. Tsumura, J. Kaminski, L. Fichera, and H. K. Zhang (2021) Augmented immersive telemedicine through camera view manipulation controlled by head motions. In Medical Imaging 2021: Image-Guided Procedures, Robotic Interventions, and Modeling, Vol. 11598, pp. 1159815. Cited by: §I.
  • [20] S. Giri and D. K. Sarkar (2012-06) Current Status of Robotic Surgery. Indian J Surg 74 (3), pp. 242–247 (en). External Links: ISSN 0972-2068, 0973-9793, Document Cited by: §I.
  • [21] G. S. Guthart and J. K. Salisbury (2000) The intuitive/sup tm/telesurgery system: overview and application. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 1, pp. 618–621. Cited by: §I.
  • [22] D. A. Hashimoto, G. Rosman, E. R. Witkowski, C. Stafford, A. J. Navarette-Welton, D. W. Rattner, K. D. Lillemoe, D. L. Rus, and O. R. Meireles (2019) Computer vision analysis of intraoperative video: automated recognition of operative steps in laparoscopic sleeve gastrectomy. Annals of surgery 270 (3), pp. 414–421. Cited by: §I, §IV.
  • [23] D. Hernández-Lobato, G. MartíNez-MuñOz, and A. Suárez (2013) How large should ensembles of classifiers be?. Pattern Recognition 46 (5), pp. 1323–1336. Cited by: §II-B2.
  • [24] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II-B2.
  • [25] A. Hussain, A. Malik, M. U. Halim, and A. M. Ali (2014-11) The use of robotics in surgery: a review. Int J Clin Pract 68 (11), pp. 1376–1382 (en). External Links: ISSN 13685031, Document Cited by: §I, §I.
  • [26] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §II-B2.
  • [27] D. Itzkovich, Y. Sharon, A. Jarc, Y. Refaely, and I. Nisky (2019) Using augmentation to improve the robustness to rotation of deep learning segmentation in robotic-assisted surgical data. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5068–5075. Cited by: §I, §II-A, §II-B2, §IV, §IV.
  • [28] A. M. Jarc and M. J. Curet (2017) Viewpoint matters: objective performance metrics for surgeon endoscope control during robot-assisted surgery. Surgical endoscopy 31 (3), pp. 1192–1202. Cited by: §I, §IV.
  • [29] J. J. Ji, S. Krishnan, V. Patel, D. Fer, and K. Goldberg (2018) Learning 2d surgical camera motion from demonstrations. In 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), pp. 35–42. Cited by: §I.
  • [30] J. M. Johnson and T. M. Khoshgoftaar (2019) Survey on deep learning with class imbalance. Journal of Big Data 6 (1), pp. 27. Cited by: §I.
  • [31] H. G. Kenngott, B. P. Müller‐Stich, M. A. Reiter, J. Rassweiler, and C. N. Gutt (2008-01) Robotic suturing: Technique and benefit in advanced laparoscopic surgery. Minimally Invasive Therapy & Allied Technologies 17 (3), pp. 160–167 (en). External Links: ISSN 1364-5706, 1365-2931, Document Cited by: §I.
  • [32] A. R. Lanfranco, A. E. Castellanos, J. P. Desai, and W. C. Meyers (2004-01) Robotic Surgery: A Current Perspective. Annals of Surgery 239 (1), pp. 14–21 (en). External Links: ISSN 0003-4932, Document Cited by: §I.
  • [33] B. Li, B. Lu, Y. Lu, Q. Dou, and Y. Liu (2020) Data-driven holistic framework for automated laparoscope optimal view control with learning-based depth perception. arXiv preprint arXiv:2011.11241. Cited by: §I.
  • [34] J. Lu, A. Jayakumari, F. Richter, Y. Li, and M. C. Yip (2020)

    Super deep: a surgical perception framework for robotic tissue manipulation using deep learning for feature extraction

    .
    arXiv preprint arXiv:2003.03472. Cited by: §IV.
  • [35] A. Mariani, G. Colaci, T. Da Col, N. Sanna, E. Vendrame, A. Menciassi, and E. De Momi (2020-04) An Experimental Comparison Towards Autonomous Camera Navigation to Optimize Training in Robot Assisted Surgery. IEEE Robot. Autom. Lett. 5 (2), pp. 1461–1467 (en). External Links: ISSN 2377-3766, 2377-3774, Document Cited by: §I, §I, §I, §IV.
  • [36] M. Medina (1997) Image rotation and reversal-major obstacles in learning intracorporeal suturing and knot-tying. JSLS: Journal of the Society of Laparoendoscopic Surgeons 1 (4), pp. 331. Cited by: §I.
  • [37] C. Molnár, T. D. Nagy, R. N. Elek, and T. Haidegger (2020) Visual servoing-based camera control for the da vinci surgical system. In 2020 IEEE 18th International Symposium on Intelligent Systems and Informatics (SISY), pp. 107–112. Cited by: §I.
  • [38] J. W. Motkoski, Fang Wei Yang, S. H. H. Lwu, and G. R. Sutherland (2013-04) Toward Robot-Assisted Neurosurgical Lasers. IEEE Trans. Biomed. Eng. 60 (4), pp. 892–898 (en). External Links: ISSN 0018-9294, 1558-2531, Document Cited by: §I.
  • [39] A. Pandya, L. Reisner, B. King, N. Lucas, A. Composto, M. Klein, and R. Ellis (2014-08) A Review of Camera Viewpoint Automation in Robotic and Laparoscopic Surgery. Robotics 3 (3), pp. 310–329 (en). External Links: ISSN 2218-6581, Document Cited by: §I.
  • [40] M. P. Perrone and L. N. Cooper (1992) When networks disagree: ensemble methods for hybrid neural networks. Technical report BROWN UNIV PROVIDENCE RI INST FOR BRAIN AND NEURAL SYSTEMS. Cited by: §IV.
  • [41] S. Ramesh, D. Dall’Alba, C. Gonzalez, T. Yu, P. Mascagni, D. Mutter, J. Marescaux, P. Fiorini, and N. Padoy (2021) Multi-task temporal convolutional networks for joint recognition of surgical phases and steps in gastric bypass procedures. arXiv preprint arXiv:2102.12218. Cited by: §I.
  • [42] M. Reza, S. Maeso, J. Blasco, and E. Andradas (2010) Meta-analysis of observational studies on the safety and effectiveness of robotic gynaecological surgery. British Journal of Surgery 97 (12), pp. 1772–1783. Cited by: §I.
  • [43] I. Rivas-Blanco, C. J. Perez-del-Pulgar, C. López-Casado, E. Bauzano, and V. F. Muñoz (2019) Transferring know-how for an autonomous camera robotic assistant. Electronics 8 (2), pp. 224. Cited by: §I, §I, §IV.
  • [44] T. Ross, D. Zimmerer, A. Vemuri, F. Isensee, M. Wiesenfarth, S. Bodenstedt, F. Both, P. Kessler, M. Wagner, B. Müller, et al. (2018)

    Exploiting the potential of unlabeled endoscopic video data with self-supervised learning

    .
    International journal of computer assisted radiology and surgery 13 (6), pp. 925–933. Cited by: §IV.
  • [45] S. Tatinati, K. C. Veluvolu, S. Hong, and K. Nazarpour (2014) Real-time prediction of respiratory motion traces for radiotherapy with ensemble learning. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4204–4207. Cited by: §IV.
  • [46] B. van Amsterdam, M. J. Clarkson, and D. Stoyanov (2020) Multi-task recurrent neural network for surgical gesture recognition and progress prediction. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 1380–1386. Cited by: §IV.
  • [47] H. Wang, Q. Xu, and L. Zhou (2015)

    Large unbalanced credit scoring using lasso-logistic regression ensemble

    .
    PloS one 10 (2), pp. e0117844. Cited by: §I, §II-B2, §IV, §IV.
  • [48] Z. Wang and A. M. Fey (2018) SATR-dl: improving surgical skill assessment and task recognition in robot-assisted surgery with deep neural networks. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1793–1796. Cited by: §IV.
  • [49] L. W. Way, L. Stewart, W. Gantert, K. Liu, C. M. Lee, K. Whang, and J. G. Hunter (2003) Causes and prevention of laparoscopic bile duct injuries: analysis of 252 cases from a human factors and cognitive psychology perspective. Annals of surgery 237 (4), pp. 460. Cited by: §I.
  • [50] O. Weede, H. Monnich, B. Muller, and H. Worn (2011-05) An intelligent and autonomous endoscopic guidance system for minimally invasive surgery. In 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, pp. 5762–5768 (en). External Links: ISBN 978-1-61284-386-5, Document Cited by: §I.
  • [51] P. J. M. Wijsman, I. A. M. J. Broeders, H. J. Brenkman, A. Szold, A. Forgione, H. W. R. Schreuder, E. C. J. Consten, W. A. Draaisma, P. M. Verheijen, J. P. Ruurda, and Y. Kaufman (2018-05) First experience with THE AUTOLAP™ SYSTEM: an image-based robotic camera steering device. Surg Endosc 32 (5), pp. 2560–2566 (en). External Links: ISSN 0930-2794, 1432-2218, Document Cited by: §I, §I, §IV.
  • [52] P. Wijsman, F. Voskens, L. Molenaar, C. van‘t Hullenaar, E. Consten, W. Draaisma, and I. Broeders (2021) Efficiency in image-guided robotic and conventional camera steering: a prospective randomized controlled trial. Surgical endoscopy, pp. 1–7. Cited by: §I.
  • [53] J. Y. Wu, A. Tamhane, P. Kazanzides, and M. Unberath (2021) Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery. International Journal of Computer Assisted Radiology and Surgery, pp. 1–9. Cited by: §I.
  • [54] B. Yang, W. Chen, Z. Wang, Y. Lu, J. Mao, H. Wang, and Y. Liu (2019-11) Adaptive FOV Control of Laparoscopes With Programmable Composed Constraints. IEEE Trans. Med. Robot. Bionics 1 (4), pp. 206–217 (en). External Links: ISSN 2576-3202, Document Cited by: §I, §I, §IV.
  • [55] D. Zhang, J. Ma, J. Yi, X. Niu, and X. Xu (2015) An ensemble method for unbalanced sentiment classification. In 2015 11th International Conference on Natural Computation (ICNC), pp. 440–445. Cited by: §I, §IV, §IV.
  • [56] B. Zheng, Z. Janmohamed, and C. MacKenzie (2003) Reaction times and the decision-making process in endoscopic surgery. Surgical Endoscopy and Other Interventional Techniques 17 (9), pp. 1475–1480. Cited by: §IV.