The time in the operating room (OR) and the time of the operating staff are cost intensive hospital resources and have to be allocated precisely. Planning the usage of the OR cannot be static, as a procedure that takes longer than previously estimated can cause the following surgeries to be pushed back or even canceled, thereby inconveniencing both patient and the surgical team. On the other hand, if a procedure finishes early, the OR stays unused, incurring unnecessary idle time for the surgical personnel. To prevent this from occurring, OR schedulers need to dynamically update timetables. This is complicated by the unpredictability of the surgical workflow, which makes it difficult to estimate the duration of procedures beforehand. Therefore, the OR schedulers have to be constantly kept in the loop of the progress of ongoing interventions. For this, they have to periodically inquire into the status of interventions, resulting in highly subjective predictions and avoidable interruptions of procedures.
Integrated ORs are becoming more prevalent in hospitals, making it possible to access data streams from surgical devices such as cameras, insufflator, lights, etc. during interventions. Such data streams can provide information that enable context-aware assistance, such as automatically and continuously predicting the progress of an ongoing intervention. Especially the endoscopic video stream, via which laparoscopic interventions are performed, contains a large supply of information. Workflow analysis methods can be used to segment interventions into surgical phases. Often tool usage is employed to determine the current surgical phase Blum2010 ; Dergachyova2016 ; Katic2014 ; Padoy2012632 , which provides an indicator for progress surgery. Convolutional neural networks (CNNs) have also been used to determine the surgical phase directly from the endoscopic video stream Bodenstedt2017 ; Lea2016 ; TwinandaSMMMP16 .
Surgical phase detection methods can be used to approximate the duration of surgical procedures, but these methods generally require a sufficient amount of labeled examples as training input. Furthermore, seeing that phase models are generally specified to a certain type of intervention, multiple detectors would need to be trained. Therefore, using a phase based method as a general solution to determine the remaining duration of surgeries would require an unfeasible large amount of labeled training data. In Guedon2016 , the authors propose a system that determines the remaining time of surgery during laparoscopic cholecystectomies without surgical phases, but directly from the usage of the electrosurgical device. A recurrent CNN for predicting the remaining time of surgery directly from endoscopic video of cholecystectomies is presented in aksamentov2017deep .
In this paper, we propose and evaluate three methods, based on recurrent CNNs, for directly predicting and refining the duration of laparoscopic interventions. These methods do not require labeled training data and function for different types of laparoscopic procedures. The first method uses surgical device data collected from the OR as input, the second endoscopic image data. The third method is the combination of the previous methods. The evaluation of the methods is performed on a dataset containing 80 laparoscopic surgeries of varying types and on the Cholec80 dataset TwinandaSMMMP16 . To our knowledge, our approach is the first method to predict the duration of laparoscopic interventions based on a combination of unlabeled vision and surgical device data.
A requirement for predicting the remaining duration of laparoscopic procedures is information regarding the current state of the surgery. As the endoscopic video stream serves as basis for the actions of the surgeon, we assume that it contains sufficient information on the state of the procedure, though extracting quantitative information is challenging due to the high dimensionality of the data stream. Here, we propose a recurrent CNN that allows predicting the duration of surgery from an endoscopic video stream. Furthermore, we propose a variation of this recurrent CNN that explicitly performs tool presence detection and utilizes this information to further enhance its predictions.
On the other hand, integrated ORs are starting to become more prevalent in hospitals. These ORs make it possible to access data streams, in the form of time series, from other surgical devices. These time series are, in contrast to the video stream, more compact and easier to quantify, but contain a smaller amount of information. We hypothesize that both streams, video and device data, contain complementary information, meaning that combining the two should increase prediction accuracy. To evaluate this claim, we propose a fusion of the two streams into one recurrent CNN (figure 1).
2.1 V-Net: Vision based estimation
The topology of our recurrent CNN for predicting duration of surgery from the endoscopic video, V-Net, is identical to the network for phase segmentation proposed in Bodenstedt2017 , which consists of an Alexnet Alexnet style network (image features in figure 1) combined with a gated recurrent unit ChoMBB14
(GRU) to incorporate temporal information. Only the final layers differs, we use a fully-connected layer with a single output and a sigmoid activation function. We pretrain the layers preceding the GRU with the method proposed inBodenstedt2017 . As input for V-Net, we sampled videos at a rate of one frame per second and downsampled each image to a resolution of pixels. This is performed to reduce data size and training time.
Furthermore, we provide V-Net with information on the type of laparoscopic surgery being performed (procedure type in figure 1). For this, we categorize our dataset into 5 general types of laparoscopic surgery (table 1). We assign each frame a number between 0 and 1 as label, i.e. the label of frame from a video consisting of frames is . During inference, we can then directly compute the duration prediction : , where is the predicted progress of the procedure.
|ID||Surgery type||Samples in dataset|
|2||Upper Gastrointestinal and Bariatric||11|
Since the layers preceding the GRU are pretrained, we only optimize the weights of the newly added layers. For this, we use Adam kingma2014adam with a learning rate of and a sigmoid cross-entropy loss. The network is trained for 50 iterations.
2.2 VI-Net: Vision and tool presence based estimation
Seeing that surgical tool presence in the endoscopic video is an important cue for progress of surgery TwinandaSMMMP16 , we hypothesize that explicitly including information on the types of tools currently visible would increase the performance of predicting procedure duration.
To identify the tools currently in use in the endoscopic video stream, we modify a pretrained ResNet-152 he2016deep
for tool presence detection. Here, we replace the pretrained fully connected output layer with another fully connected layer, consisting of 12 nodes (one per tool type) with sigmoid nonlinearities. The weights of theconv5_x layers and of the new fully connected layer are fine-tuned using a dataset of 24 colorectal laparoscopies in which the presence of 12 possible surgical tools has been labeled in one frame per minute. Adam kingma2014adam with a learning rate of and binary cross-entropy are used.
To utilize the tool presence information for predicting duration of surgery, we extend the architecture of V-Net into VI-Net, to additionally accept information on tool presence (tool presence in figure 1). VI-Net is trained in the same manner as V-Net.
2.3 VID-Net: incorporating surgical device data
To incorporate the data stream provided by surgical devices, we extend VI-Net with a further input (device data in figure 1). The input consists of 14 values, each representing a different signal from a surgical device (a list of the devices and signals used can be found in table 2). Analog to the video stream, we select one second as the size of each time step. For signals with a data rate higher than 1 Hz, we discard all values, except the most recent one. For signals with a lower data rate, we use the most recent value, even if it was older than one second.
VID-Net is trained in an identical manner as the previous two networks.
|Insufflator||Current gas flow rate|
|Target gas flow rate|
|Current gas pressure|
|Target gas pressure|
|Used gas volume|
|Gas supply pressure|
|OR lights||All lights off?|
|Intensity light 1|
|Intensity light 2|
|Endoscopic light source||Intensity|
|Endoscopic camera||White balance|
3 Experiments and results
The basis of our evaluation is a dataset, containing recordings of 80 laparoscopic surgeries of different procedure types (MultiType). For each surgery, the dataset contains the endoscopic video stream and data collected from different surgical devices as listed in table 2. The procedures were all recorded in the same OR using the integrated operating room system OR1™ (Karl Storz GmbH & Co KG, Tuttlingen, Germany). The average procedure length in the dataset is 105.6 minutes. The datasets used for pretraining V-Net and training the ResNet-152 for tool presence detection do not overlap with this dataset.
To evaluate the three proposed methods, we divide the dataset into four sets of equal size and perform four leave-one-set-out evaluations for each method. While dividing the dataset, we ascertain that the distribution of the different types of surgery into each set is balanced.
During testing, we compute the duration prediction at each frame : , where is the predicted progress of the procedure. With , we can compute the absolute duration prediction error in seconds, , and the duration prediction error relative to the length of of each procedure, . The relative error gives a more appropriate impression on how well each method can predict procedure duration. For each method, we provide both the absolute and the relative average duration prediction error (see tables (a)a and (b)b). To measure how the error progresses during the course of a procedure, we compute the average error during each quarter of a given procedure (Q1-Q4). As a naive baseline, we provide the duration prediction error that would occur if the average procedure duration over the training data were used as value for . We also provide a type-based baseline, where we instead compute the average procedure duration separately for each procedure category in table 1.
For further evaluation, V-Net and VI-Net are applied to the publicly available Cholec80 dataset TwinandaSMMMP16 , which consists of 80 videos from laparoscopic cholecystectomies. We divide the dataset into four sets of equal size and similar average procedure length and perform four leave-one-set-out evaluations for each method. The absolute and the relative average duration prediction errors can be found in tables (a)a and (b)b. Only one baseline is available as the dataset consists of a single procedure type. No surgical device data is contained in the dataset, meaning VID-Net can not be used.
The results on MultiType show that all three of the presented methods outperform the baseline methods. Furthermore, VID-Net provides more accurate results than the other two networks, demonstrating that surgical device data does indeed contain complementary information. Table 5 shows that VID-Net performs consistently on the first three procedure categories. Part of the general laparoscopic category are diagnostic laparoscopies, which are significantly shorter (min) than the average procedure in the dataset, explaining this drop in performance. The last category contains only singular cases, which differ from the other categories, making predictions difficult.
On Cholec80, both methods outperform the baseline, while VI-Net also outperformed V-Net. A direct comparison with the results in aksamentov2017deep is not possible, as the authors use a private dataset for testing.
In this paper, we presented, to our knowledge, the first approach for online procedure duration prediction using unlabeled endoscopic video data and surgical device data in a laparoscopic setting. On MultiType, VID-Net showed an overall average prediction error of 37% and a halftime error of about 28%, which are lower than the results from the baselines. Furthermore, we showed that a method incorporating both vision and device data performs better than methods based only on vision. The evaluation showed that the presented methods currently produce larger than average errors on procedures with short length (shorter than 15 min) and on singular cases. We assume this is due to a lack of training data as our dataset comprises mostly longer operations.
On Cholec80, VI-Net achieved an average prediction error as well as a halftime error of 23%, outperforming the baseline. This difference of performance on the two datasets can be contributed to the fact that MultiType
contained multiple procedure types and also had a higher variance in procedure duration. Seeing that the results of the proposed methods perform significantly better on a simple operation like cholecystectomies than on a more complex and diverse operations as in theMultiType dataset leads us to conclude that the proposed methods themselves are sound, but that more training data is required for more complex cases.
As our results indicate that combing vision and device data provides more information on the progress of surgery, we assume that data available to the anesthetist, such as heart rate, blood pressure and drug doses, would provide even more valuable insights.
- (1) Aksamentov, I., Twinanda, A.P., Mutter, D., Marescaux, J., Padoy, N.: Deep neural networks predict remaining surgery duration from cholecystectomy videos. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 586–593. Springer (2017)
- (2) Blum, T., Feußner, H., Navab, N.: Modeling and segmentation of surgical workflow from laparoscopic video. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 400–407. Springer (2010)
- (3) Bodenstedt, S., Wagner, M., Katić, D., Mietkowski, P., Mayer, B., Kenngott, H., Müller-Stich, B., Dillmann, R., Speidel, S.: Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. ArXiv e-prints (2017)
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: Encoder–decoder approaches.In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 103–111. Association for Computational Linguistics (2014). DOI 10.3115/v1/W14-4012. URL http://aclweb.org/anthology/W14-4012
- (5) Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. International journal of computer assisted radiology and surgery 11(6), 1081–1089 (2016)
- (6) Guédon, A.C.P., Paalvast, M., Meeuwsen, F.C., Tax, D.M.J., van Dijke, A.P., Wauben, L.S.G.L., van der Elst, M., Dankelman, J., van den Dobbelsteen, J.J.: ‘it is time to prepare the next patient’ real-time prediction of procedure duration in laparoscopic cholecystectomies. Journal of Medical Systems 40(12), 271 (2016). DOI 10.1007/s10916-016-0631-1
- (7) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
- (8) Katić, D., Wekerle, A.L., Gärtner, F., Kenngott, H., Müller-Stich, B.P., Dillmann, R., Speidel, S.: Knowledge-driven formalization of laparoscopic surgeries for rule-based intraoperative context-aware assistance. In: International Conference on Information Processing in Computer-Assisted Interventions, pp. 158–167. Springer (2014)
- (9) Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks.In: Advances in neural information processing systems, pp. 1097–1105 (2012)
- (11) Lea, C., Choi, J.H., Reiter, A., Hager, G.: Surgical phase recognition: From instrumented ors to hospitals around the world. M2CAI 2016 (2016)
- (12) Padoy, N., Blum, T., Ahmadi, S.A., Feussner, H., Berger, M.O., Navab, N.: Statistical modeling and recognition of surgical workflow. Medical image analysis 16(3), 632–641 (2012)
- (13) Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., de Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36(1), 86–97 (2017)