Intra-operative anticipation of instrument usage is a necessary component for context-aware assistance in surgery, e.g. for instrument preparation or semi-automation of robotic tasks. However, the sparsity of instrument occurrences in long videos poses a challenge. Current approaches are limited as they assume knowledge on the timing of future actions or require dense temporal segmentations during training and inference. We propose a novel learning task for anticipation of instrument usage in laparoscopic videos that overcomes these limitations. During training, only sparse instrument annotations are required and inference is done solely on image data. We train a probabilistic model to address the uncertainty associated with future events. Our approach outperforms several baselines and is competitive to a variant using richer annotations. We demonstrate the model's ability to quantify task-relevant uncertainties. To the best of our knowledge, we are the first to propose a method for anticipating instruments in surgery.READ FULL TEXT VIEW PDF
Anticipating the usage of surgical instruments before they appear is a highly useful task for various applications in computer-assisted surgery. It represents a large step towards understanding surgical workflow to provide context-aware assistance. For instance, it enables more efficient instrument preparation [maier2017surgical]. For semi-autonomous robot assistance, instrument anticipation can facilitate the identification of events that trigger the usage of certain tools and can eventually help a robotic system to decide when to intervene. Anticipating the use of certain instruments such as the irrigator can further enable early detection and anticipation of complications like bleedings.
The proposed applications require continuous anticipation estimates in long surgeries. Many instruments occur only rarely and briefly, i.e. only sparse annotations are available. Nevertheless, a useful anticipation framework should only react to these sparse occurrences and remain idle otherwise. Our approach addresses these requirements and is thus applicable to real-world scenarios. We train a neural network to regress the remaining time until the occurrence of a specific instrument within a given future horizon. Our uncertainty-quantification framework addresses the uncertainty associated with future events. This enables the identification of trigger events for instruments by measuring decreases in uncertainty associated with anticipation estimates (Fig.1).
Various works have investigated short-horizon anticipation of human actions [damen2018scaling, gao2017red, jain2016recurrent, vondrick2016anticipating]. However, these methods are only designed to predict actions in the immediate future based on a single frame [damen2018scaling, vondrick2016anticipating] or short sequences of frames [gao2017red]. Most importantly, they are trained and evaluated on artificially constructed scenarios where a new action is assumed to occur within typically one second and only its correct category is unknown. In surgery however, the challenge rather lies in correctly timing predictions given the video stream of a whole surgical procedure. Our task definition considers long video sequences with sparse instrument usage and encourages models to only react to relevant cues. Jain et al. [jain2016recurrent] address this by adding the default activity ’drive straight’ for anticipation of driver activities. For our task, we propose a similar category when no instrument is used within a defined horizon.
While methods for long-horizon anticipation exist, they require dense, rich action segmentations of the observed sequence during inference and mostly do not use visual features for anticipation [abu2019uncertainty, abu2018will, du2016recurrent, ke2019time, mehrasa2019variational]. This is not applicable to our task, since the information required for anticipating the usage of some instruments relies heavily on visual information. For instance, the usage of the irrigator is often triggered by bleedings and cannot be predicted solely from instrument signals. Some methods utilize visual features but nevertheless require dense action labels [mahmud2017joint, zhong2018time]. However, these labels are tedious to define and annotate and therefore not applicable to many real-world scenarios, especially surgical applications. In contrast, we propose to predict the remaining time until the occurrence of sparse events rather than dense action segmentations. During training, only sparse instrument annotations are required and inference is done solely on image data. This aids data annotation since instrument occurrence does not require complex definitions or expert knowledge [maier2014can].
The uncertainty associated with future events has been addressed by some approaches [abu2019uncertainty, vondrick2016anticipating]. Similar to Farha et al. [abu2019uncertainty], we do so by learning a probabilistic prediction model. Bayesian Deep Learning through Monte-Carlo Dropout provides a framework for estimating uncertainties in deep neural networks [gal2016dropout]. Kendall et al. [kendall2017uncertainties]
identify the model and data as two relevant sources of uncertainty in machine learning. Several approaches have been proposed for estimating these quantities[gal2017deep, kwon2020uncertainty, shridhar2018uncertainty, wang2019aleatoric]. We evaluate our model’s ability to quantify task-relevant uncertainties using these insights. The contributions of this work can be summarized as follows:
To the best of our knowledge, we are the first to propose a method for anticipating the use of surgical instruments for context-aware assistance.
We reformulate the general task of action anticipation to alleviate limitations of current approaches regarding their applicability to real-world problems.
Our model outperforms several baseline methods and gives competitive results by learning only from sparse annotations.
We evaluate our model’s ability to quantify uncertainties relevant to the task and show that we can improve performance by filtering uncertain predictions.
We demonstrate the potential of our model to identify trigger events for surgical instruments through uncertainty quantification.
We define surgical instrument anticipation as a regression task of predicting the remaining time until the occurrence of one of surgical instruments within a future horizon of minutes. Given frame from a set of recorded surgeries and an instrument , the ground truth for the regression task is defined as
where is the true remaining time in minutes until the occurrence of with for frame where is present. The target value is truncated at minutes since we assume that instruments cannot be anticipated accurately for arbitrarily long intervals. This design choice encourages the network only to react when the usage of an instrument in the foreseeable future is likely and predict a constant otherwise. Opposed to current definitions for anticipation tasks, we do not assume an imminent action or rely on dense action segmentations.
For regularization, we add a similar classification objective to predict one of three categories , which correspond to an instrument appearing within the next minutes, being present and a background category when neither is the case. In Section 3.2, we discuss the benefits of this regularization task for uncertainty quantification.
Due to the inherent ambiguity of future events, anticipation tasks are challenging and benefit from estimating uncertainty scores alongside model predictions. Bayesian neural networks enable uncertainty quantification by estimating likelihoods for predictions rather than point estimates [kendall2017uncertainties]. Given data with labels , Bayesian neural networks place a prior distribution over parameters which results in the posterior . Since the integration over the parameter space makes learning intractable, variational inference [graves2011practical] is often used to approximate
by minimizing the Kullback-Leibler divergenceto a tractable distribution . Gal et al. [gal2016dropout] have shown that this is equivalent to training a network with dropout if is in the family of binomial dropout distributions.
During inference, we draw =10 parameter samples to approximate the predictive expectation for regression variables and the predictive posterior for classification variables (Eq. 2 & 3), where and are the network’s regression and softmax outputs parametrized by [kendall2017uncertainties].
We estimate uncertainties through the predictive variance. Kendall et al. [kendall2017uncertainties] argue that predictive uncertainty can be divided into aleatoric (data) uncertainty, which originates from missing information in the data, and epistemic (model) uncertainty, which is caused by the model’s lack of knowledge about the data. Intuitively, uncertainty regarding future events corresponds best to aleatoric uncertainty but presumably affects model variations (epistemic) as well.
For regression, we follow Kendall et al.’s approach to estimate epistemic uncertainty (Eq. 4). Computing the predictive variance over parameter variation captures noise in the model. We omit aleatoric uncertainty for regression, since it was not effective for our task. For classification variables, we follow Kwon et al. [kwon2020uncertainty] (Eq. 5). The epistemic term captures noise in the model by estimating variance over parameter samples, while the variance over the multinomial softmax distribution in the aleatoric term captures inherent sample-independent noise. In the classification case, uncertainties are averaged over all classes.
The model (suppl. material) consists of a Bayesian AlexNet-style Convolutional Network [krizhevsky2012imagenet]
and a Bayesian Long Short-Term Memory network (LSTM)[hochreiter1997long]. We sample dropout masks with a dropout rate of once per video and per sample and reuse the same masks at each time step as proposed by Gal et al. [gal2016theoretically] for recurrent architectures. The AlexNet backbone has proven effective in a similar setting [bodenstedt2019active] and empirically gave the best results. State-of-the-art architectures such as ResNet [he2016deep]
performed poorly as they appeared to learn from future frames through batch statistics in batch-normalization layers[ioffe2015batch]. Further, the AlexNet can be trained from scratch, which is beneficial for introducing dropout layers. The code is published on https://www.gitlab.com/nct_tso_public/ins_ant.
We train on the Cholec80 dataset [twinanda2016endonet] of 80 recorded cholecystectomies and anticipate sparsely used instruments which are associated with specific tasks in the surgical workflow, i.e. bipolar (appears in of frames), scissors (), clipper (), irrigator (), and specimen bag (). Grasper and hook are dropped as they are used almost constantly during procedures and hence, are not of interest for anticipation. We extract frames at fps, resize to width and height of , process batches of
sequential frames and accumulate gradients over three batches. We use 60 videos for training and 20 for testing. We train for 100 epochs using the Adam optimizer (learning rate). The loss (Eq. 6) is composed of smooth L1 [twinanda2018rsdnet] for the primary regression task, cross entropy (CE) for the regularizing classification task and L2-regularization, where are parameter estimates of the approximate distribution . We set and .
|= 2 min.||= 3 min.||= 5 min.||= 7 min.|
We evaluate frame-wise based on a weighted mean absolute error (wMAE). We average the MAE of ’anticipating’ frames () and ’background’ frames () to compensate for the imbalance in the data. As instruments are not always predictable, a low recall does not necessarily indicate poor performance, making precision metrics popular for anticipation [gao2017red, vondrick2016anticipating]. We capture the idea of precision in the pMAE as the MAE of predictions with when the model is anticipating . Factors are chosen for robustness against variations during ’background’ predictions.
Since our task is not comparable to current anticipation methods, we compare to two histogram-based baselines. For instrument , the bin accumulates the occurrences of within the segments of all training video. If the bin count exceeds a learned threshold, we assume that occurs regularly in the video segment and generate anticipation values according to Eq. 1. Using 1000 bins, the thresholds are optimized to achieve best training performance w.r.t our main metric wMAE. For MeanHist, segments are expanded to the mean video duration. For OracleHist, we expand the segments to the real video duration at train and test time. This is a strong baseline as instrument usage correlates strongly with the progress of surgery, which is not known beforehand. See Fig. 4 for a visual overview of the baseline construction.
Additionally, we compare our model to a variant simultaneously trained on dense surgical phase segmentations [twinanda2016endonet] to investigate whether our model achieves competitive results using only sparse instrument labels. Surgical phases strongly correlate with instrument usage [twinanda2016endonet] and have shown to be beneficial for the related tasks of predicting the remaining surgery duration [twinanda2018rsdnet]. Finally, we compare to a non-Bayesian variant of our model without dropout to show that the Bayesian setting does not lead to a decline in performance.
We train models on horizons of 2, 3, 5 and 7 minutes and repeat runs for Ours and Ours non-Bayes. four times and Ours+Phase twice. In all settings, the our methods outperform MeanHist by a large margin (Table 1). Compared to the offline baseline OracleHist, we achieve lower pMAE and comparable wMAE errors, even though knowledge about the duration of a procedure provides strong information regarding the occurrence of instruments. Further, there is no visible difference in performance with and without surgical phase regularization. This suggests that our approach performs well by learning only from sparse instrument labels. The Bayesian setting also does not lead to a drop in performance while adding the advantage of uncertainty estimation. For , we outperform MeanHist for all instruments and OracleHist for all except the specimen bag (Table 2). However, this instrument is easy to anticipate when the procedure duration is known since it is always used toward the end. For instrument-wise errors of other horizons, see the supplementary material.
We analyze the model’s ability to quantify uncertainties using a model for . For all experiments, we consider predictions which indicate that the model is anticipating an instrument. We define anticipating predictions as where , and with .
Uncertainty quantification enables identification of events which trigger the usage of instruments. We evaluate the model using the known event-trigger relationship of scissors and clipper in cholecystectomies, where the cystic duct is first clipped and subsequently cut. Hence, we expect lower uncertainty for anticipating scissors when a clipper is visible. Fig. 2 supports this hypothesis for epistemic uncertainty during regression. However, the difference is marginal and most likely not sufficient for identifying trigger events. Even though clipper occurrence makes usage of the scissors foreseeable, predicting the exact time of occurrence is challenging and contains noise. Uncertainties for classification are more discriminative. The classification objective eliminates the need for exact timing and enables high-certainty class predictions. Both epistemic and aleatoric estimates are promising while the latter seems to be more discriminative. This is consistent with our hypothesis that uncertainty regarding future events corresponds best to aleatoric uncertainty but induces epistemic uncertainty as well.
We assess the quality of uncertainty estimates through correlations with erroneous predictions as high uncertainty should result in higher errors. For regression (Fig. 3, center), we observe the highest Pearson Correlation Coefficients (PCC) for scissors, clipper and specimen bag. These instruments are presumably the most predictable since they empirically yield the best results (Table 2) and are correlated with specific surgical phases (’Clipping & Cutting’ and ’Gallbladder Packaging’) [twinanda2016endonet]. Irrigator and Bipolar yield less reliable predictions as they are used more dynamically. For classification (Fig. 3, right), the median aleatoric uncertainty of true positive predictions for the ’anticipating’ class is almost consistently lower than for false positives. Scissors and specimen bag show the largest margin. We can reduce precision errors (pMAE) by filtering uncertain predictions, shown in Fig. 3 (left) for epistemic regression uncertainty. As expected, the decrease is steaper for instruments with higher PCC.
We propose a novel task and model for uncertainty-aware anticipation of intra-operative instrument usage. Limitations of current anticipation methods are addressed by enabling anticipation of sparse events in long videos. Our approach outperforms several baselines and matches a model variant using richer annotations, which indicates that sparse annotations suffice for this task. Since uncertainty estimation is useful for both anticipation tasks and intra-operative applications, we employ a probabilistic model. We demonstrate the model’s ability to quantify task-relevant uncertainties by investigating error-uncertainty correlations and show that we can reduce errors by filtering uncertain predictions. Using a known example, we illustrate the model’s potential for identifying trigger events for instruments, which could be useful for robotic applications. Future work could investigate more effective methods for uncertainty quantification where aleatoric uncertainty is especially interesting due to its link to future events.
Rethinking Anticipation Tasks: Uncertainty-aware Anticipation of Sparse Surgical Instrument Usage for Context-aware Assistance (Supplementary Material) D. Rivoir et al.
|= 2 min.|
|= 5 min.|
|= 7 min.|