The aim of computer-assisted surgery (CAS) is to provide the surgeon with the right type of assistance at the right moment. For many applications in CAS, such as providing the position of a tumor, specifying the most probable tool required next by the surgeon or determining the remaining duration of surgery, analyzing the surgical workflow is a prerequisite.Surgical workflow analysis comprises methods for perceiving and understanding surgical processes in the operating room, generally via data collected from sensors or from human input lalys2014surgical . Since laparoscopic surgeries are performed using an endoscopic camera, a video stream is always available during surgery, making it the obvious choice as input sensor data for workflow analysis.
Several methods in the state-of-the-art for video-based surgical workflow analysis utilize convolutional neural networks (CNNs) for interpreting the video stream aksamentov2017deep ; chen2018endo3d ; jin2018sv ; twinanda2017endonet ; zisimopoulos2018deepphase . Deep Neural Networks, such as CNNs, have a high number of parameters that have to be determined during training, which requires a large amount of annotated data. For many tasks in surgical workflow analysis, expert knowledge is often required for labeling data, making it difficult to obtain a sufficient amount of annotations. Motivated by the fact that data without annotations is often readily available, multiple methods for pretraining CNNs using unlabeled data for solving surgical workflow related tasks have been recently proposed bodenstedt2017unsupervised ; funke2018temporal ; ross2018exploiting ; yengera2018less . These methods generally exploit information inherent in the unlabeled data to solve an auxiliary task related to the actual problem. Recently crowdsourcing based approaches have been used to successfully create annotations for simple surgical workflow related tasks in laparoscopy, such as tool segmentation maier2014can ; maier2016crowd , locating point correspondences maier2015crowdtruth and for assessing skills deal2016crowd . More complex tasks, such as surgical phase segmentation, require more task-specific background knowledge, which generally only domain experts, such as surgeons, posses. Often these experts have limited resources for labeling such data, making it difficult to acquire large, annotated data sets.
A system that could instead actively asked for expert labels only on certain examples, e.g. examples with a high uncertainty, would reduce the total annotation effort and make collecting large, annotated datasets for surgical workflow analysis more feasible. Such a system is called an active learning system cohn1996active . During active learning, an initial model is trained using a small amount of labeled data, the initial training set. An acquisition function then determines through a metric, such as uncertainty, which data points should be labeled next. A new model is then trained on the extended training data gal2017Active .
Recently, new methods for estimating uncertainties on the predictions of deep neural networks, such asDeep Bayesian Networks (DBN), have been developed gal2016uncertainty . Seeing that such estimates can be used for active learning has motived Gal et al. gal2017Active to formulated acquisition functions based on DBNs.
In this paper, we investigate if an active learning system based on DBNs can successfully guide the annotation process for image- and video-based surgical workflow related tasks and thereby reduce the number of required labels. For this, we first modify the framework proposed in Gal et al. gal2017Active for laparoscopic instrument presence detection and phase segmentation. Namely, our main contributions are the following:
Propose a solution for multi-class annotations with DBN-based active learning
Propose a recurrent network for DBN-based active learning with laparoscopic videos
Extend the previous network to allow partial annotation of videos
Evaluate and compare the proposed methods using the publicly available Cholec80 dataset twinanda2017endonet .
To the best of our knowledge, we are the first to apply DBN-based active learning to annotate data related to surgical workflow. Furthermore, as far as we are aware, this is the first work that utilizes DBN-based active learning for video annotation.
In this section, we introduce methods for image-based and video-based active learning for surgical workflow analysis tasks. The basis of our image-based active learning system is a standard CNN that is transformed into a DBN (section 2.1). This serves as basis for performing DBN based active learning on single video frames. To allow active learning on video data, the DBN is further extended into a recurrent DBN (section 2.2. To use the likelihoods of the DBN to select which data points should be labeled next, several different metrics are possible, which are described in section 2.3.
2.1 Bayesian Network
A standard CNN, based on the AlexNet architecture Alexnet
and pretrained on ImageNet (see fig.1), serves as a foundation of the proposed system for active learning. Active learning requires a method for gauging which unlabeled training examples are ”difficult” for the current model, e.g. when given an input , an (softmax) output and training data , determining the likelihood of label . While neural networks generally do not output a binary class prediction, but instead a fuzzy prediction, e.g. through a sigmoid or a softmax non-linearity, it has been found that these outputs are not suitable as probability estimates guo2017calibration .
DBNs on the other are a mathematically proven concept for estimating likelihoods for predictions gal2016uncertainty
. DBNs are deep neural networks with a prior probability distribution, such as a Gaussian prior, placed over the weightsof the network: . The likelihood of a classification is then defined as
where is the output of the network depend on weights . Inference in DBNs requires the posterior
, which is extremely difficult to infer. Instead, the posterior can be approximated by through Monte Carlo dropout, which is done by performing random dropout on every weight layer during training and testing. Monte Carlo dropout can be shown to be equivalent to performing approximate variational interference, which minimizes the Kullback-Leibler divergence to the true posterior:
with , where is the dropout distribution gal2016uncertainty . In other words, to determining the likelihood of a classification of a sample
during testing, we classify the sampletimes using Monte Carlo dropout and average the outputs of the softmax.
The previous CNN that has been extended into a DBN can be seen in fig. 1. By applying task-specific classification layers to the network, predictions and their likelihood can be estimated.
2.2 Recurrent Bayesian Network
Many tasks in surgical workflow analysis, such as phase segmentation, require that frames are viewed in the context of an entire video or at least in the context of previous frames. Recurrent neural networks (RNN) make such an analysis possible by introducing recurrence into the topology of a network. This allows information from previous frames to contribute to future predictions.
Long short-term memory units (LSTM), a more complex form of the RNN, can learn to strategically remember, but also forget, information from previously seen inputs, while forgoing the problem of vanishing gradients common to RNNs hochreiter1997long . Combining CNNs with LSTMs makes video-based workflow analysis, by using exclusively deep neural networks, possible bodenstedt2017unsupervised ; chen2018endo3d ; funke2018temporal ; yengera2018less .
By applying the paradigm described in section 2.1, we can extend the topology of a CNN-LSTM based on AlexNet Alexnet (see fig. 2) into that of a Bayesian CNN-LSTM (see. fig. 2). One approach to perform inference with this network would be to naively apply random dropouts independently to each weight layer for every element in a given sequence. Multiple studies though indicate that such a naive dropout has negative effects on RNNs, such as added noise and a disruption of dynamics gal2016theoretically . As an alternative, the authors in gal2016theoretically propose a theoretically grounded variant of dropout for LSTMs. The idea is to sample dropout masks for each layer in the recurrent DBN at the beginning of each sequence and to use the same mask for each time-step (see fig. 3). The naive approach would be equivalent to sampling new masks at every time-step.
This recurrent DBN makes video-based classification possible, while simultaneously allowing likelihood estimations for each classification.
2.3 Acquisition Functions
Given a DBN with weights and a pool with unlabeled data points , the active learning framework uses an acquisition function , with , to determine which data points show high levels of uncertainty. The following criteria is used to select which data points should be labeled next:
The authors in gal2017Active propose multiple acquisition functions that have to be evaluated experimentally for their suitability in active learning for surgical workflow tasks.
One simple metric for measuring the uncertainty of a given prediction is to compute the variance of the different likelihoods contributing to the posterior:
with . Variance measures how the likelihood predictions are spread around their arithmetic mean. Here we assume that a large spread corresponds to a large amount of uncertainty.
Variation Ratio (VR)
Similarly to variance, the variation ratios also measures the spread of the predictions, in this case around the mode, i.e. the most common predicted class.
where is the frequency of the mode in the predictions.
A further possibility for measuring the uncertainty of the posterior likelihood is using predictive entropy from information theory:
Here, reaches its maximum when the likelihood of all possible classes becomes equal. Its minimum (zero) is reached when the likelihood of a single class is equal to one.
Mutual Information (MI)
An extension of predictive entropy is to examine the mutual information between the posterior and the likelihoods of the predictions:
To evaluate the suitability of the DBNs described in the previous section for active learning in workflow analysis tasks, we examine two different applications. In section 3.1 we extended a DBN to perform active learning for laparoscopic instrument presence detection and in section 3.2 a recurrent DBN is used to perform active learning for surgical phase segmentation. For both tasks, the publicly available Cholec80 dataset twinanda2017endonet is used. It consists of 80 videos from laparoscopic cholecystectomies, in which surgical instrument presence and surgical phases have been annotated.
3.1 Instrument Presence
). The last layer consists out of 7 units, one for each instrument type in Cholec80. Since multiple instruments of different types can be visible in the same video frame, a sigmoid nonlinearity is used on the final layer instead of a softmax. During training, we use a weighted DICE-lossmilletari2016v as cost function.
As this is a multi-class problem, when computing the uncertainty of the prediction of a given image using, any one of the acquisition functions outlined in section 2.3
will not return a scalar value, but instead a 7-dimensional vector containing the uncertainty of each class. To reduce this vector to a scalar, we examine the suitability of two different methods for aggregation:
The frames with the highest uncertainty from are then selected for annotation. would here favor frames with a high certainty across all classes, while would favor frames in which one class shows a large amount uncertainty, regardless of the uncertainties of the other classes.
3.2 Phase Segmentation
To allow active learning for surgical phase segmentation, we extend the recurrent DBN proposed in section 2.2 by adding a fully connected output layer (see fig. 4). The layer has 7 output units, one for each surgical phase in Cholec80, and uses a softmax nonlinearity. During training, we use cross-entropy as cost function.
We assume that judging the current surgical phase from a single video frame is difficult and prone to ambiguities, we therefore opt to query for annotation for temporally connect segments. For this, we propose two methods for selecting the next queries from .
A naive approach would be to determine which unlabeled video from has the largest amount of uncertainty and ask an expert to annotate this video completely. Given a video, we compute the uncertainty for each frame and aggregate these uncertainties using either or , where would favor videos with a high overall uncertainty and would tend to select videos where a single frame exhibits a high uncertainty.
Annotating an entire video is a time-consuming process, which is also difficult to parallelize. Instead it would be preferable to select the most uncertain parts of a video and query an expert to just annotate these.
To accomplish this, we divide each video into 5 minute long segments. During active learning, we determine the uncertainty of each segment in the same manner as described above for an entire video. We then query only for an annotation of the most uncertain segments according to either or .
This leads to having incompletely labeled videos during training, which is a problem as the recurrent nature of the network requires that each sequences be trained from the beginning. To account for this, we slightly modify the cost function. Given the output and the correct label of frame and the cost , we know define with , depending on whether is annotated or not. This causes frames whose label is unknown at this point, to be excluded from the overall cost, while still preserving their influence on the predictions of the annotated frames.
As previously stated, we evaluate our proposed active learning methods for surgical workflow analysis tasks (instrument presence and phase segmentation) on the Cholec80 dataset. For this, we first divide the dataset in 4 subsets of 20 videos each, as outlined in funke2018temporal . Each video was sampled at a rate of one frame per second and each frame was downsampled to a resolution of pixels.
During evaluation, we proceed in an identical manner for both instrument presence detection and phase segmentation. We begin by dividing the 4 subsets into a training data set (subsets 1-3) and testing data set (subset 4). We then select the first 6 videos (10%) from the training data set and define the remaining 54 video as . The 6 videos and their annotations are used to train an initial DBN using the Adam optimizer kingma2014adam
. We train for 100 epochs or until the training cost comes below. After training, we note the performance on the test data, namely F1-score and accuracy, and proceed to select data points from using one of the acquisition functions in 2.3 and aggregate using either or . New data points are then selected until a further 10% of the training data set has been annotated. We then train the model again from scratch using all the available annotated data. The process is repeated until 60% of the training data set has been annotated, noting the performance on the test data after each training run is completed.
As baseline, we use a fifth acquisition function that selects new data points at random. For each task, the baseline is computed 4 times and the results are averaged.
4.1 Instrument Presence
For the instrument presence task, the DBN was trained with a learning rate of , a L2-norm based weight decay of and a batch size of 128. Data points in for this task were essentially every frame in the training data, meaning we did not incorporate any knowledge about the structure of the original videos while querying for new frames. The results of the active learning process with the different acquisition functions in comparison to the baseline can be found in table 1.
|20%||79% (92%)||80% (92%)||79% (92%)||80% (92%)||80% (92%)||80% (92%)||79% (92%)||80% (92%)||79% (92%)|
|30%||82% (93%)||83% (93%)||80% (92%)||81% (93%)||81% (93%)||82% (93%)||82% (93%)||83% (93%)||80% (92%)|
|40%||83% (93%)||83% (94%)||81% (93%)||81% (93%)||82% (93%)||83% (94%)||82% (93%)||83% (93%)||81% (93%)|
|50%||83% (94%)||84% (94%)||82% (93%)||82% (93%)||83% (93%)||83% (94%)||83% (93%)||83% (93%)||82% (93%)|
|60%||83% (94%)||84% (94%)||83% (93%)||83% (93%)||83% (94%)||84% (94%)||83% (93%)||83% (93%)||83% (93%)|
4.2 Phase Segmentation
For the instrument presence task, the recurrent DBN was trained with a learning rate of , a L2-norm based weight decay of and a batch size of 128. The definition of a data point in varied depending on the querying method.
Each video represented a data point, meaning that during the query phases of the active learning process, the videos with the highest uncertainty, according to the acquisition function used, were annotated. The results can be seen in table 2.
Here, each video in the training dataset was divided into 5 minute long segments, each segments being a data point in . During the query phase, the most uncertain segments, according to the acquisition function used, were selected annotated, regardless of which video they originated. The results can be seen in table 3.
|20%||66% (77%)||68% (77%)||67% (79%)||68% (76%)||71% (80%)||66% (78%)||67% (81%)||65% (73%)||64% (76%)|
|30%||68% (78%)||76% (85%)||69% (81%)||73% (82%)||72% (83%)||71% (80%)||75% (84%)||70% (81%)||67% (79%)|
|40%||73% (80%)||79% (88%)||74% (84%)||79% (87%)||75% (82%)||75% (86%)||76% (83%)||78% (87%)||74% (84%)|
|50%||77% (87%)||78% (86%)||81% (90%)||82% (90%)||77% (85%)||80% (88%)||77% (85%)||78% (88%)||77% (84%)|
|60%||80% (87%)||80% (87%)||81% (90%)||82% (91%)||79% (89%)||80% (89%)||80% (88%)||80% (90%)||80% (86%)|
|20%||62% (78%)||59% (75%)||71% (82%)||64% (76%)||68% (79%)||67% (74%)||63% (74%)||71% (78%)||62% (75%)|
|30%||70% (83%)||68% (79%)||76% (85%)||74% (85%)||79% (87%)||74% (85%)||73% (87%)||73% (85%)||73% (84%)|
|40%||75% (89%)||75% (87%)||79% (87%)||81% (88%)||80% (88%)||79% (87%)||72% (80%)||79% (88%)||76% (86%)|
|50%||80% (89%)||77% (86%)||81% (89%)||81% (90%)||83% (89%)||81% (90%)||78% (86%)||81% (88%)||79% (88%)|
|60%||81% (91%)||84% (91%)||85% (91%)||80% (91%)||83% (90%)||83% (92%)||81% (91%)||84% (92%)||79% (89%)|
As tables 1, 2 and 3 clearly show, the DBN-based acquisition functions for active learning generally outperform a baseline based on randomly selecting the next data points. In the case of the instrument presence task, the methods based on seem to outperform their counterpart based on , indicating that selecting frames on which the uncertainty is spread among multiple classes is the better strategy. Especially the combination of and the variance-based acquisition function seems to be the method of choice for this task as it consistently achieves the highest performance.
Similarly, in the phase segmentation task using video-based selection, the methods based on generally also outperform their counterpart based on . The variance-based acquisition function performs also well on this task, though the variation rate-based method also performs well, actually outperforming the other methods at 50% and 60%.
Interestingly, in the case of the phase segmentation task using segment-based selection, the methods based on seem to be preferable. This indicates that segments containing large peaks of uncertainty seem to add more information than segments with a more distributed uncertainty. Here, the variation ratio and the entropy-based methods seem to perform best. Furthermore it can be noted that the segment-based methods generally produces similar results as the video-based methods with less annotated data, meaning that partially annotating videos seems to be a valid strategy.
Overall it can be noted that the mutual information-based acquisition function, while not providing the best results on any task, seems to perform consistently well on all tasks, indicating that it might be the best choice when examining a new problem.
In this paper, we presented, to the best of our knowledge, the first DBN-based active learning approach for annotating data related to surgical workflow tasks. Our focus, in particular, was on instrument presence detection and workflow analysis. Also we presented the first DBN-based active learning approach for video annotation. Furthermore, we showed that our approach for selecting the next data points for annotation outperforms a random baseline and we were able to demonstrate that partially annotating videos is a valid strategy for training CNNs for surgical workflow segmentation.
Even though the results seem promising, we see potential for improving performance. The step size of 10% for selecting data points might not be optimal as it could encourage unnecessary redundancy in the data, as it can be assumed that similar images have a similar uncertainty. Opting for a smaller step size might mitigate this problem. Furthermore, incorporating a form of similarity measure in the acquisition functions might also be appropriate.
- (1) Aksamentov, I., Twinanda, A.P., Mutter, D., Marescaux, J., Padoy, N.: Deep neural networks predict remaining surgery duration from cholecystectomy videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 586–593. Springer (2017)
- (2) Bodenstedt, S., Wagner, M., Katić, D., Mietkowski, P., Mayer, B., Kenngott, H., Müller-Stich, B., Dillmann, R., Speidel, S.: Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. arXiv preprint arXiv:1702.03684 (2017)
- (3) Chen, W., Feng, J., Lu, J., Zhou, J.: Endo3d: Online workflow analysis for endoscopic surgeries based on 3d cnn and lstm. In: OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, pp. 97–107. Springer (2018)
Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical
Journal of artificial intelligence research4, 129–145 (1996)
- (5) Deal, S.B., Lendvay, T.S., Haque, M.I., Brand, T., Comstock, B., Warren, J., Alseidi, A.: Crowd-sourced assessment of technical skills: an opportunity for improvement in the assessment of laparoscopic surgical skills. The American Journal of Surgery 211(2), 398–404 (2016)
Funke, I., Jenke, A., Mees, S.T., Weitz, J., Speidel, S., Bodenstedt, S.: Temporal coherence-based self-supervised learning for laparoscopic workflow analysis.In: OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis: First International Workshop, OR 2.0 2018, 5th International Workshop, CARE 2018, 7th International Workshop, CLIP 2018, Third International Workshop, ISIC 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16 and 20, 2018, Proceedings, p. 85. Springer
- (7) Gal, Y.: Uncertainty in deep learning. University of Cambridge (2016)
- (8) Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems, pp. 1019–1027 (2016)
- (9) Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: Proceedings of the 34th International Conference on Machine Learning (ICML-17) (2017)
- (10) Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. ICML (2017)
- (11) Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
- (12) Jin, Y., Dou, Q., Chen, H., Yu, L., Qin, J., Fu, C.W., Heng, P.A.: Sv-rcnet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE transactions on medical imaging 37(5), 1114–1126 (2018)
- (13) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- (14) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
- (15) Lalys, F., Jannin, P.: Surgical process modelling: a review. International journal of computer assisted radiology and surgery 9(3), 495–511 (2014)
- (16) Maier-Hein, L., Kondermann, D., Roß, T., Mersmann, S., Heim, E., Bodenstedt, S., Kenngott, H.G., Sanchez, A., Wagner, M., Preukschas, A., et al.: Crowdtruth validation: a new paradigm for validating algorithms that rely on image correspondences. International journal of computer assisted radiology and surgery 10(8), 1201–1212 (2015)
- (17) Maier-Hein, L., Mersmann, S., Kondermann, D., Bodenstedt, S., Sanchez, A., Stock, C., Kenngott, H.G., Eisenmann, M., Speidel, S.: Can masses of non-experts train highly accurate image classifiers? In: International conference on medical image computing and computer-assisted intervention, pp. 438–445. Springer (2014)
- (18) Maier-Hein, L., Ross, T., Gröhl, J., Glocker, B., Bodenstedt, S., Stock, C., Heim, E., Götz, M., Wirkert, S., Kenngott, H., et al.: Crowd-algorithm collaboration for large-scale endoscopic image annotation with confidence. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 616–623. Springer (2016)
- (19) Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on, pp. 565–571. IEEE (2016)
- (20) Ross, T., Zimmerer, D., Vemuri, A., Isensee, F., Wiesenfarth, M., Bodenstedt, S., Both, F., Kessler, P., Wagner, M., Müller, B., et al.: Exploiting the potential of unlabeled endoscopic video data with self-supervised learning. International journal of computer assisted radiology and surgery pp. 1–9 (2018)
- (21) Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36(1), 86–97 (2017)
- (22) Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: Surgical phase recognition with less annotations through self-supervised pre-training of cnn-lstm networks. arXiv preprint arXiv:1805.08569 (2018)
- (23) Zisimopoulos, O., Flouty, E., Luengo, I., Giataganas, P., Nehme, J., Chow, A., Stoyanov, D.: Deepphase: Surgical phase recognition in cataracts videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 265–272. Springer (2018)