Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery

05/07/2018 ∙ by Sebastian Bodenstedt, et al. ∙ 0

Intraoperative segmentation and tracking of minimally invasive instruments is a prerequisite for computer- and robotic-assisted surgery. Since additional hardware like tracking systems or the robot encoders are cumbersome and lack accuracy, surgical vision is evolving as promising techniques to segment and track the instruments using only the endoscopic images. However, what is missing so far are common image data sets for consistent evaluation and benchmarking of algorithms against each other. The paper presents a comparative validation study of different vision-based methods for instrument segmentation and tracking in the context of robotic as well as conventional laparoscopic surgery. The contribution of the paper is twofold: we introduce a comprehensive validation data set that was provided to the study participants and present the results of the comparative validation study. Based on the results of the validation study, we arrive at the conclusion that modern deep learning approaches outperform other methods in instrument segmentation tasks, but the results are still not perfect. Furthermore, we show that merging results from different methods actually significantly increases accuracy in comparison to the best stand-alone method. On the other hand, the results of the instrument tracking task show that this is still an open challenge, especially during challenging scenarios in conventional laparoscopic surgery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Minimally invasive surgery using cameras to observe the internal anatomy is the preferred approach for many surgical procedures. This technique reduces the operative trauma, speeds recovery and shortens hospitalization. However, such operations are highly complex and the surgeon must deal with a difficult hand-eye coordination, a restricted mobility and a narrow field of view Bernhardt201766 . Surgeons capabilities can be enhanced with computer- and robotic-assisted surgical systems cleary2010image . Such systems provide additional patient-specific information during surgery, e.g. by visualizing hidden risk and target structures based on preoperative planning data. Intraoperative localization of minimally invasive instruments is a prerequisite for such systems. The pose of the instrument is crucial for e.g. measuring the distance to risk structures Bernhardt201766 , automation of surgical skills chen2016virtual or assessing the skill level of a surgeon vedula2016objective . Since additional hardware like tracking systems, instrument markers or the robot encoders are cumbersome and lack accuracy, surgical vision is evolving as promising technique to localize the instruments using solely the endoscopic images. Image-based localization can be split into segmentation and tracking of the instruments in the endoscopic view.

Image-based instrument segmentation and tracking has received increased attention in different minimally invasive scenarios. A recent paper by Bouget et al. provides an in-depth review of different instrument detection and tracking algorithms bouget2017 . However, what is missing so far are common image data sets for consistent evaluation and benchmarking of algorithms against each other.

In this paper, we present a comparative validation of different vision-based state-of-the-art methods for instrument segmentation as well as tracking in the context of minimally invasive surgery (figure 1). The data used is based on the data of the sub-challenge Instrument segmentation and tracking https://endovissub-instrument.grand-challenge.org/, part of the Endoscopic Vision Challenge (http://endovis.grand-challenge.org), at the international conference Medical Image Computing and Computer Assisted Intervention.

The contribution of the paper is twofold: we introduce a comprehensive validation data set that was provided to the study participants and present the results of the comparative validation.

Two important surgical application scenarios were identified: robotic as well a conventional laparoscopic surgery. Both scenarios face different challenges considering the instruments. We have articulated instruments in robotic and rigid instruments in conventional laparoscopic surgery. Corresponding validation data was generated for both scenarios and consisted of endoscopic ex-vivo images with articulated robotic as well as in-vivo images with rigid laparoscopic instruments. The data was split into training and test data for the segmentation as well as the tracking task. All data used in this paper is publicly available at http://open-cas.org/?q=node/31.

Figure 1: The two task during the Instrument segmentation and tracking sub-challenge.

The paper is organized as follows. Section 2 and 3 briefly review the segmentation and tracking methods that participated in the study. Section 4 describes the validation data sets followed by the comparative validation of both tasks in section 5 and 6. We present the study results in section 7 and discuss our findings in section 8. Finally, section 9 provides a conclusion.

2 Instrument segmentation methods

This section briefly reviews the basic working principle of the different segmentation methods that participated in the comparison study. Most of the approaches investigated were either based on random forests (RF) (

2.3, 2.6

) or convolutional neural networks (CNN) techniques (

2.1, 2.2, 2.4, 2.5), except 2.7.

2.1 Seg-Jhu

The approach pakhomov2017deep is based on the CNN described in xie2015holistically . The architecture, which was originally designed for edge detection problems, is similar to FCN-8s (fully convolutional network) as described in Long_2015_CVPR and is adopted for the task of binary segmentation. A Deep Supervision approach is used for training. Based on experiments, the proposed architecture shows comparable or better performance on surgical instruments segmentation across all datasets compared to FCN-8s network Long_2015_CVPR .

2.2 Seg-Kit-Cnn

The first approach from Karlsruhe Institute of Technology (KIT) uses a FCNs Long_2015_CVPR for semantic segmentation of the instruments and background. This type of neural network consists of two parts: An encoder, which is trained on an image classification task, and a decoder, which upsamples the result to the required original resolution. The encoder is initialized with VGG-16 Simonyan2014 , the decoder is a single layer of two upsampling filters of the size

. There are two upsampling filters as two classes have to be recognized. The FCN is trained end-to-end with the mean softmax cross entropy between logits and labels as training objective. Each weight is regularized with L2 loss.

2.3 Seg-Kit-Rf

The second approach from the KIT segments the instruments in the endoscopic images based on a feature vector for each pixel consisting of values from multiple color spaces, such as HSV, RGB, LAB and Opponent, and gradient information

Bodenstedt2016a

. Using the supplied masks, a RF classifier to distinguish instrument pixels from background pixels using OpenCV is trained. For real-time online segmentation, a GPU-based RF is used. After the classification step, contours are located and fused if they are in close proximity to one another. Contours whose size lies over a certain threshold are then returned as instrument candidates.

2.4 Seg-Ub

The method developed by the University of Bern (UB) uses a CNN for classifying pixels as instrument or background. An AlexNet NIPS2012_4824

CNN architecture is used with a two neuron output in the last fully connected layer. During training, real time data augmentation is applied. The patches sizing

pixels are randomly rotated, scaled, mirrored and illumination adjusted. Every training batch had 70% background patches and 30% instrument patches. As the computational requirements of CNNs are relatively large, only a raster of pixels with a stride of 7 pixels in both image axes is used. After classification the pixels are grouped using DBSCAN clustering

Ester96 . Using the information from the perimeter of the clustered regions, an alpha shape edelsbrunner1983

is created. The region that is enclosed by the alpha shape is used as the final segmentation result. A combination of the Caffe Deep Learning Framework

jia2014caffe and Matlab were used to implement the described method.

2.5 Seg-Ucl-Cnn

The first approach from the University College London (UCL) is a real-time segmentation method that combines deep learning and optical flow tracking Herrera17

. Fully Convolutional Networks (FCNs) are able to produce accurate segmentations of deformable surgical instruments. However, even with state-of-the-art hardware, the inference time of FCNs exceeds the framerate of the surgical endoscopic videos. SEG-UCL-CNN leverages the fact that optical flow can track displacements of surgical tools at high speed to propagate FCN segmentations in real time. A parallel pipeline is used where a FCN runs in asynchronous fashion, segmenting only a small proportion of the frames. New frames are used as input for the FCN inference process only when the neural network is idle. The output of the FCN is stored in synchronized memory. In parallel, for every frame of the video, previously detected keypoints from stored frames are matched to detected keypoints in the current to-be-segmented frame and used to estimate an affine transformation with RANSAC. The segmentation mask of the stored frame is warped with the affine transformation estimated to produce the final segmentation for the frame with less than one inter-frame delay.

2.6 Seg-Ucl-Rf

The second approach of Allan et al. from UCL is based on a RF classification Allan2013a . They use variable importance to select 4 color features: Hue, Saturation and Opponent 1 and 2, which the experiments demonstrated provided the most discriminative power. The RF is trained on these features to classify the pixels in the test set with no post-processing stages. In all of the experiments, the OpenCV CPU RF implementation was used.

2.7 Seg-Uga

The method by Agustinos et al. from the Université Grenoble Alpes (UGA) uses color and shape information Agustinos2016 to segment the instruments. Based on the CIELab color space, a grayscale image composed of the a and b channels, corresponding to the chromaticity

, is computed. Afterwards, a postprocessing step consisting of binarization with an automatic Otsu thresholding and a skeletonization using a simple distance transform

felzenszwalb2004 followed by an erosion is performed. A contour detection algorithm suzuki1985 is then used to extract the extreme outer contour of each region as an oriented bounding box. Bounding boxes which do not satisfy a specific shape constraint (width/length ratio not inferior to 2) are eliminated. For each candidate, a Frangi filter frangi1998 and a Hough transform is used to highlight the instrument edges inside the box. False candidates are eliminated based on the relative orientation and position of the detected lines.

3 Instrument tracking methods

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: Validation data set examples. (a) Robotic instruments. (b) Conventional instruments. (c)-(f): Challenges in conventional data set: overlapping instruments (c), smoke (d), bleeding (e), mesh (f).

This section briefly reviews the basic working principle of the different tracking methods that participated in the comparison study. The following approaches are based on the segmentation methods that were described in the previous section.

3.1 Tra-Kit

The approach from KIT utilizes the segmentation method outlined above 2.3 to initialize the tracking. For each detected contour, a principle component analysis is performed to find the axis and center of the tool. The furthest point from the image border, which lies on both the axis and the contour, is used as instrument tip. Furthermore, a bounding box around each detected instrument is calculated. On each bounding box, features to track are detected by randomly sampling edges located with the Canny edge detector. Using Lucas-Kanade optical flow a vector describing the motion of each feature to its position in the next frame is computed. Afterwards a consensus over all directional vectors to locate the position of the instrument tip in the next frame is formed.

3.2 Tra-Ucl-Mod

The method proposed by Allan et al. Allan2015 from UCL tracks the 2D tip location and instrument orientation by projecting the output of a 3D tracker. This tracker is based on aligning a CAD model projection with the output of the RF described in 2.6 where the 3D pose is recovered as the model transformation which generates the best projection. This is formulated within a multi-region level set framework where the estimates with frame-to-frame optical flow tracking are combined.

3.3 Tra-Ucl-Ol

The second approach from UCL Du2015 treats the tracking problem as a classification task, and the object model is updated over time using online learning techniques. However, these methods are prone to include background information in the object appearance or lack the ability to estimate scale changes. To address this problem, a Patch-based Adaptive Weighting with Segmentation and Scale (PAWSS) is used. A simple colour-based segmentation model is then used to suppress the background information Furthermore multi-scale samples are extracted online, which allows the tracker to handle incremental and abrupt scale variations between frames. The method is based on online learning, it only needs bounding box initialization of the object on the first frame, then tracks the object for the whole sequence. The assumption of this kind of method is that the object is in view through all the sequence, so it can not handle out-of-view situation.

3.4 Tra-Uga

The tracking approach from UGA Agustinos2016 assumes that an instrument does not undergo large displacements between two successive images. In the inital step (first image), the instruments are located as described in section 2.7. In the following images, the candidate bounding boxes are detected, but the instrument search is refined only inside the bounding box best compatible with the position/orientation of the instrument in the previous image. Knowing the border of an instrument, the position of its tip in the Frangi image along its central axis is detected. The pixel along this line with maximum grey level in the Frangi image is considered as the tip.

4 Validation data

To compare the different segmentation and tracking methods from the participants, we identified two important surgical application scenarios that capture a wide instrument type and background variety generally encountered in laparoscopic images. The first scenario is in the context of robotic laparoscopic surgery and includes articulated instruments, the second scenario deals with rigid instruments in conventional laparoscopic surgery. Corresponding image validation data was generated for both scenarios covering the different challenges considering the instruments (figure 2).

4.1 Robotic laparoscopic instruments

The articulated robotic instrument data set D-ROB originates from ex-vivo 2D recordings using the da Vinci surgical system. In total, 6 videos from ex-vivo studies with varying background including a Large Needle Driver and Curved Scissors with resolution were provided (table 1). The instruments show typical poses and articulation in robotic surgery including occlusion in any sequence, though artifacts such as smoke and bleeding were not included in the data. All data was provided in video files. Task-specific reference data includes annotated masks for different instrument parts as well as the 2D instrument pose.

4.2 Conventional laparoscopic instruments

The conventional laparoscopic instrument data set D-CONV consists of 6 different in-vivo 2D recordings from complete laparoscopic colorectal surgeries with resolution (table 1). The images reflect typical challenges in endoscopic vision like overlapping instruments, smoke, bleeding and external materials like meshes. In total, the set contains seven different instruments that occur typically in laparoscopic surgeries including hook, atraumatic grasper, ligasure, stapler, scissor and scalpel. From these 6 recordings, we extracted single frames for the segmentation task as well as video sequences for tracking. As reference data, annotated masks containing pixel-wise labels of different instrument parts as well as the 2D instrument pose was provided depending on the task.

Type # of Size Frame # types of
videos size instruments
D-ROB ex-vivo 6 1min 2
D-CONV in-vivo 6 197min 7
Table 1: Validation data for robotic/conventional laparoscopic segmentation/tracking.

5 Comparative study: Instrument segmentation

The following sections provide a description of the training and test data for the segmentation task, the reference method and the validation criterion.

5.1 Training and test data

#Seq Size Data Split
Seq ID Train Test
D-ROB-SEG 6 60s 1–4 75% 25%
5–6 - 100%
D-CONV-SEG 6 50 frames 1–4 80 % 20%
5–6 - 100%
Table 2: Training and test data for robotic/conventional instrument segmentation.

As described in the previous paragraph 4, the validation data originated from recordings of ex-vivo robotic (D-ROB) and in-vivo conventional (D-CONV) laparoscopic surgeries.

The training data includes RGB frames as well as two types of annotated masks as reference data (figure 3). The first mask contains pixel-wise labels for the different instrument parts (shaft/manipulator) and the background. Furthermore, an additional mask with labels for each instrument is available if more than one instrument is in the scene. For the test data no masks are provided. A detailed overview including training and test data is given in table 2.

For the one minute robotic sequences, an annotation is provided for every frame. Since the frames for conventional segmentation originate from complete surgeries, single frames had to be selected in a standardized manner. For the selection, each in-vivo recording was divided into 50 segments where 10 frames from each segment where randomly extracted. From these 10 frames, one was manually selected to guarantee a certain image quality. This resulted in a validation set of 50 images per recording. Furthermore, the validation data set for conventional laparoscopic surgery was categorized according to challenging situations including overlapping instruments (D-CONV-SEGOverlap), bleeding (D-CONV-SEGBlood), smoke (D-CONV-SEGSmoke) and meshes (D-CONV-SEGMesh) in the scene and the validation measures calculated separately for the subsets. These challenges were taken into consideration while dividing the validation set in training and testing data.

Participants were advised to use the training data in a leave-one-sequence-out fashion. It was not allowed to use the same sequence in the training set when testing the additional images for each of the sequences (1-4) provided for training. For the new sequences (5-6) the whole training data could be used. Participants uploaded pixel-wise instrument segmentation results for each frame in the test data set. The results were in the form of binary masks, indicating instrument or background.

(a)
(b)
(c)
Figure 3: Training data example including RGB image, mask for instrument parts and mask for instrument types

5.2 Reference method and associated error

Reference data is provided as annotated masks, which were generated differently according to the surgical application scenario. Although the reference annotation aims to be as accurate as possible, errors are associated with the generation and are discussed accordingly.

5.2.1 Robotic laparoscopic instruments

The pixel-wise labeling was generated through backprojection of a 3D CAD model with hand corrected robotic kinematics. The CAD model used per-vertex material properties, which allowed the mask values for both the shaft and metal head to be written directly to the corresponding pixels. As these values were provided in video files, errors due to video compression could arise and the values had to be thresholded to the nearest expected value.

5.2.2 Conventional laparoscopic instruments

The reference annotation was generated via crowdsourcing. In a first step, the Pallas Ludens crowdsourcing platform (Pallas Ludens GmbH, Heidelberg, Germany) was used to segment all endoscopic instruments using a similar approach as in maier2014a . Next, human observers (non-experts) went through all annotations and made manual corrections if necessary. Finally, the challenge organizers double-checked the quality of all annotations.

5.3 Validation criteria

To evaluate the quality of a submitted segmentation result, several measures were calculated between the reference (R) and the predicted segmentation (P) of a given method provided as binary mask. The criteria used included typical classification metrics like precision, recall and accuracy as well as the dice similarity coefficient (DSC) powers2011evaluation . The DSC is a typical measure to evaluate the segmentation result by calculating the overlap of the predicted (P) and reference (R) segmentation. It is defined as:

(1)

In the case of a binary segmentation, the DSC is identical to the F score, which is a combination of precision and recall.

6 Comparative study: Instrument tracking

The following sections provide a description of the training and test data for the tracking task, the reference method and the validation criterion.

6.1 Training and test data

#Seq Size Mask Data Split
Seq ID Train Test
D-ROB-TRA 6 60s every 1–4 75% 25%
frame 5–6 - 100%
D-CONV-TRA 6 60s 1 fps 1–4 75% 25%
5–6 - 100%
Table 3: Training and test data for robotic/conventional instrument tracking.
(a)
(b)
Figure 4: Example of data provided for instrument tracking. Left shows a mask for a conventional laparoscopic instrument with the point to track in white and the instrument axis in black. Right shows the data provided for the robotic instruments, in yellow the point to track, in red the head axis, in blue the shaft axis and in green the clasper angle.

The validation data for tracking originate from the same recordings already described in the previous paragraphs of section 4. In addition to the RGB frames and the two types of annotated masks (5.1), a csv file with pixel coordinates of the center point and the normalized axis vector of the shaft for each instrument was provided as reference. For the robotic instruments, the normalized axis vector of the instrument head as well as the angle between the claspers is given as well (figure 4). For the test data no reference was provided. The center point is defined as the intersection between the shaft and the metal manipulator on the instrument axis.

For the robotic data set, an annotation is provided for every frame. For the conventional data set, only one frame per second is annotated.

Furthermore, the sequences in the validation data set for both conventional laparoscopic and robotic surgery were categorized according to challenging situations. For conventional laparoscopic surgery, the sequences contained challenges such as multiple instruments (D-CONV-TRAMultiple), multiple occurrences of instrument occlusions (D-CONV-TRAOcclusion), blood (D-CONV-TRABlood), smoke (D-CONV-TRASmoke) and surgical objects such as meshes and clips (D-CONV-TRAObjects). The robotic dataset contained sequences with multiple instruments (D-ROB-TRAMultiple). The validation measures were calculated separately for these subsets.

A detailed overview including training and test data is given in table 3. Participants were advised to use the training data in a leave-one-sequence-out fashion. It was not allowed to use the same sequence in the training set when testing the additional frames for each of the sequences (1-4) provided for training. For the new sequences (5-6) the whole training data could be used.

Participants uploaded the pixel coordinates of the center point and axis vector of the shaft for each instrument in each frame in the test dataset. For the robotic dataset, the axis vector of the head as well as the clasper angle was provided as well.

6.2 Reference method and associated error

Reference data was provided as annotated masks as described in 5.2 as well as coordinates of specific instrument parts which were generated differently according to the surgical application scenario.

6.2.1 Robotic laparoscopic instruments

Using hand corrected kinematics provided by the da Vinci Research Kit (DVRK), the center point, shaft axis, head axis and clasper angle were computed for each frame by projecting their location relative to the 3D model reference frame onto the camera sensor using a pinhole projection, where the camera parameters were computed using the Matlab calibration toolbox. To define the location of the the points in the 3D model reference frame, we manually estimated the coordinates of the 3D tracked point using modeling software and generated the tracked axes and clasper angles from the robot kinematic coordinate frames directly.

6.2.2 Conventional laparoscopic instruments

Given the results provided via crowd-sourcing (5.2), the position of the center point and axis were determined in an automated fashion. The center point is defined as the intersection between the shaft and the metal manipulator on the instrument axis. First the segmented tip and shaft regions were used to locate the border between shaft and the metal manipulator. We then determined the center and principal axis of the instrument section. Using a line going through the center point with the principal axis as direction, we used the intersection of this line with the border as center point. The principal axis was used as instrument axis.

6.3 Validation criterion

To evaluate the quality of a submitted tracking result, several distance measures were calculated between the reference and the predicted position and axis. To assess the accuracy of the 2D tracked center point, we compute the Euclidean distance between the predicted center point and the ground truth center point for each tool.

We also compute the angular distance between each of the predict angles () for the shaft, wrist and claspers and the groundtruth ().

7 Results

7.1 Instrument segmentation

For the instrument segmentation, we present two types of results. First we evaluate the performance of the previously presented methods on both the D-CONV-SEG and the D-ROB-SEG validation sets. Based on the results of the single methods on these validation sets, we formulate the hypothesis that merging the segmentations of multiple separate methods should provide a measurable improvement. We propose two different methods for merging different segmentations and evaluate the performance of these on the validation sets.

For each dataset, we then computed the metrics introduced in section 5.3 for all frames and presented the averages for each dataset and their respective subsets. The DCS was used to rank the different methods.

To determine the statistical significance of the difference in the DCS between the highest ranking method and each of the lower rankings methods, we performed a Wilcoxon signed-rank test Wilcoxon45 to compare the distribution of the DSC of two methods on a given dataset. We tested for a significant difference in DSC values at a significance level, which is coded with a green cell background in the following tables and a significance level, which is represented as a yellow cell background. is coded as a red cell background.

7.1.1 Single results

D-Conv-Seg

For the D-CONV-SEG dataset, we present the ranked results of the submitted method on each subset. In table (a)a the performances of the methods on the entire test dataset are listed, while table (b)b lists the performances on the different challenge subsets.

D-CONV-SEGAll DSC Prec. Rec. Acc.
SEG-KIT-CNN 0.88 0.86 0.90 0.98
SEG-UB 0.84 0.78 0.94 0.97
SEG-UCL-CNN 0.82 0.81 0.88 0.97
SEG-JHU 0.82 0.83 0.85 0.97
SEG-UGA 0.66 0.94 0.55 0.95
SEG-KIT-RF 0.50 0.74 0.44 0.93
SEG-UCL-RF 0.42 0.74 0.35 0.93
(a)
D-CONV-SEGBlood D-CONV-SEGMesh D-CONV-SEGOverlap D-CONV-SEGSmoke
DSC Prec. Rec. Acc. DSC Prec. Rec. Acc. DSC Prec. Rec. Acc. DSC Prec. Rec. Acc.
SEG-JHU 0.73 0.75 0.73 0.97 0.75 0.64 0.97 0.95 0.86 0.87 0.86 0.96 0.80 0.83 0.81 0.97
SEG-KIT-CNN 0.85 0.87 0.84 0.98 0.83 0.76 0.94 0.97 0.86 0.88 0.85 0.96 0.86 0.86 0.87 0.98
SEG-KIT-RF 0.31 0.51 0.23 0.94 0.47 0.51 0.52 0.91 0.52 0.82 0.42 0.89 0.41 0.60 0.37 0.93
SEG-UB 0.76 0.66 0.94 0.96 0.76 0.66 0.96 0.95 0.84 0.79 0.91 0.95 0.80 0.71 0.94 0.97
SEG-UCL-CNN 0.77 0.74 0.82 0.97 0.79 0.71 0.91 0.96 0.83 0.81 0.88 0.95 0.77 0.74 0.85 0.96
SEG-UCL-RF 0.30 0.67 0.24 0.95 0.42 0.62 0.42 0.91 0.46 0.83 0.37 0.88 0.38 0.56 0.37 0.92
SEG-UGA 0.63 0.98 0.50 0.97 0.68 0.80 0.68 0.93 0.69 0.94 0.58 0.93 0.72 0.95 0.59 0.96
(b)
Table 4: The ranked results of the segmentation methods on all the subsets of D-CONV-SEG. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of , yellow and red .

It is noticeable that SEG-KIT-CNN outperforms the other methods in all subsets and significantly outperforms the competition on 3 of 5 subsets. Furthermore a split between the method type can be seen in all subsets, as the top 4 ranks are always taken by the CNN-based methods.

D-Rob-Seg

For the D-ROB-SEG dataset, we also present the ranked results of each method (table 5). Here SEG-JHU significantly outperforms all the other submitted methods. While, similarly as for D-CONV-SEG a CNN-based method achieves the highest DCS, a split between CNN-based methods and non CNN-based methods cannot be observed here.

D-ROB-SEGAll DSC Prec. Rec. Acc.
SEG-JHU 0.88 0.84 0.92 0.97
SEG-UCL-CNN 0.86 0.83 0.90 0.97
SEG-UCL-RF 0.85 0.87 0.83 0.96
SEG-KIT-CNN 0.81 0.86 0.77 0.96
SEG-KIT-RF 0.78 0.86 0.72 0.95
SEG-UGA 0.78 0.95 0.66 0.96
Table 5: the ranked results of all segmentation methods on the D-ROB-SEG dataset. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of .

7.1.2 Merged results

While on each dataset one method significantly outperforms the competition (SEG-KIT-CNN on D-CONV-SEG and SEG-JHU on D-ROB-SEG), we pose the hypothesis that all the information contained in the results of other methods is not necessarily contained in the leading method. We therefore propose to merge multiple segmentation results in order to determine whether this could improve upon the results of the leading method.

An obvious and naive approach to merge multiple segmentations would be via majority voting (MV). Here, each method would cast a vote if a given pixel belongs to either the object of interest (a laparoscopic instrument) or the background. The label that the majority of methods selected is then assigned to the pixel. The drawback of majority voting is that the vote of each method is weighted equally, not taking into account that the performance or quality of a given segmentation might vary. In medical image segmentation, the STAPLE algorithm Warfield04

is often used to merge multiple segmentations from experts and/or algorithms. STAPLE use an expectation-maximization algorithm to assess the quality of each segmentation according to the spatial distribution of structures and regional homogeneity. The assessed quality is then used as weight while merging. To determine whether a combination of segmentation methods can outperform the highest ranking method, we iterated through all possible combinations of the segmentation results of all methods and compared the results.

In the following sections, we will be presenting the three highest ranked combinations of segmentation methods for each subset. Furthermore, we calculated the significance of the difference in the DCS of each combination to the highest ranked single method. To abbreviate the names of the merged methods in the following results, each method is assigned an ID number (table 6).

ID Method
1 SEG-JHU
2 SEG-KIT-CNN
3 SEG-KIT-RF
4 SEG-UB
5 SEG-UCL-CNN
6 SEG-UCL-RF
7 SEG-UGA
Table 6: To simplify the results presented in the following section, we assigned each presented method an ID number.
D-Conv-Seg

In table 7 we present the ranked results on each subset of the segmentations merged using majority voting. On D-CONV-SEGAll, the 3 highest ranking combinations outperform SEG-KIT-CNN significantly, as do the top 2 combinations on D-CONV-SEGBlood and D-CONV-SEGSmoke. It is also interesting to note that SEG-KIT-CNN is included in the 3 highest ranking combinations for every subset.

D-CONV-SEGAll DSC Prec. Rec. Acc.
2 + 4 + 7 0.89 0.90 0.89 0.98
1 + 2 + 4 0.89 0.88 0.92 0.98
1 + 2 + 4 + 7 0.88 0.87 0.92 0.98
SEG-KIT-CNN [2] 0.88 0.86 0.90 0.98
(a)
D-CONV-SEGAll DSC Prec. Rec. Acc.
2 + 4 + 7 0.89 0.90 0.89 0.98
1 + 2 + 4 0.89 0.88 0.92 0.98
2 + 4 0.89 0.91 0.87 0.98
SEG-KIT-CNN [2] 0.88 0.86 0.90 0.98
(b)
Table 7: The ranked results of the merged segmentations computed using majority voting LABEL:sub@tab:results:rigid:majority and STAPLE LABEL:sub@tab:results:rigid:staple on all the subsets of D-CONV-SEG. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of , yellow and red .

The results of the STAPLE based combination on each subset of D-CONV-SEG is presented in table (b)b. Similar as with majority voting, SEG-KIT-CNN is also included in the 3 highest ranking combinations for every subset.

In table 8 and figure 5, we compare the performance of the highest ranked single and merged methods. While both merged methods outperform the highest ranked single method significantly, there is no significant difference in the performance of majority voting and STAPLE (see table 8).

D-CONV-SEGAll Mean SD Min. Max.
MV[2 + 4 + 7] 0.89 0.08 0.42 0.97
STAPLE[2 + 4 + 7] 0.89 0.08 0.42 0.97
SEG-KIT-CNN 0.88 0.08 0.57 0.97
Table 8: Comparison of all the DSC values and their range of the single and merged methods on D-CONV-SEGAll. The color-coded significance here shown is between the results of majority voting and STAPLE. Red indicates a significance value of .
Figure 5: Comparison of all the DSC values and their range of the single and merged methods in D-CONV-SEGAll.
D-Rob-Seg

The results for majority voting on D-ROB-SEG can be found in table (a)a. Here it can be seen that the 3 highest ranked combinations with majority voting outperform the highest ranked single method. It is also noteworthy that the highest ranked single method on D-ROB-SEG (SEG-JHU) is included in all of the combinations.

D-ROB-SEGAll DSC Prec. Rec. Acc.
1 + 4 + 6 0.89 0.87 0.92 0.97
1 + 2 + 3 + 4 + 6 0.89 0.90 0.89 0.97
1 + 2 + 4 + 6 + 7 0.89 0.90 0.88 0.97
SEG-JHU [1] 0.88 0.84 0.92 0.97
(a)
D-ROB-SEGAll DSC Prec. Rec. Acc.
1 + 4 + 6 + 7 0.89 0.87 0.92 0.97
1 + 4 + 6 0.89 0.87 0.92 0.97
1 + 2 + 4 + 6 + 7 0.89 0.85 0.93 0.97
SEG-JHU [1] 0.88 0.84 0.92 0.97
(b)
Table 9: The ranked results of the merged segmentations computed using majority voting LABEL:sub@tab:results:robotic:majority and STAPLE LABEL:sub@tab:results:robotic:staple on all the subsets of D-ROB-SEG. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of .

In table (b)b, the results for the STAPLE based combinations are listed. Similarly to majority voting, these combinations also outperform the single methods.

While both majority voting and STAPLE outperform the highest ranked single method, a significant difference between the two combination methods cannot be observed (see table 10 and figure 6).

D-ROB-SEGAll Mean SD Min. Max.
MV[1 + 4 + 6] 0.89 0.05 0.31 0.97
STAPLE[1 + 4 + 6 +7] 0.89 0.05 0.31 0.97
SEG-JHU 0.88 0.07 0.00 0.97
Table 10: Comparison of all the DSC values and their range of single and merged methods on D-ROB-SEG. The color-coded significance here shown is between the results of majority voting and STAPLE. Red indicates a significance value of .
Figure 6: Comparison of all the DSC values and their range of the single and merged methods on D-ROB-SEG.

7.2 Instrument tracking

We present results for the instrument tracking by evaluating the performance of all methods on both the D-CONV-TRA and D-ROB-TRA validation sets and, similarly to the segmentation task, testing the hypothesis that a better tracker can be built by combining the tracking output from each method into a single tracker. We merge the results from the tracker by taking the mean parameter prediction over all methods for each frame. For the hypothesis that this should improve the tracking accuracy, the errors should be symmetrically distributed around the true value with similar magnitude. To counter the situation where the errors are not symmetrically distributed, we discard outliers by finding the mean distance between each of measurements for a frame and discarding the largest if it is greater than 2 times the second largest. The hypothesis here is that we assign higher confidence to the tracking results when there is a greater consensus between the methods. We again use the Wilcoxon signed-rank test to assess statistical significance of the performance changes between each method and use the same color coding as in Section

7.1.

7.2.1 D-Conv-Tra

The images in the D-CONV-TRA validation set are highly challenging with complex shadows, numerous occlusions from tissue and out-of-view as well as fast motion (see Figure 2). Entries to this dataset were made using the TRA-UCL-OL, TRA-KIT and TRA-UGA methods and the results on all the challenge subsets are shown in Table 11. The high tracked point errors are often caused by periods of complete tracking failure, when the method tracks a feature in the background rather than the model tracking the wrong part of the instrument. Numerous frames in the submissions to the rigid tracking data failed to make any prediction for the instrument position, despite there being an instrument present in the image. In cases where this happened, we assigned a fixed penalty to the prediction of half the length of the diagonal of the frame for translation and 90 for rotation.

The results show that the merged method achieves the highest accuracy for both the tracked point and the shaft angle. It is likely that this is caused by the TRA-KIT and TRA-UGA methods providing slight accuracy improvements during frames when TRA-UCL-OL fails to obtain a good estimate of the instrument position and orientation.

D-CONV-TRAAll T.P. (pix) S.D.
TRA-MERGED 84.7
TRA-UCL-OL 96.8
TRA-KIT 178.9
TRA-UGA 217.9
(a)
D-CONV-TRABlood D-CONV-TRAMultiple D-CONV-TRAObjects D-CONV-TRAOcclusion D-CONV-TRASmoke
T.P. (pix) S.D. T.P. (pix) S.D. T.P. (pix) S.D. T.P. (pix) S.D. T.P. (pix) S.D.
TRA-KIT
TRA-MERGED
TRA-UCL-OL
TRA-UGA
(b)
Table 11: Comparison of the tracking accuracy of each method, averaged across all of D-CONV-TRA datasets. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of p ¡ 0.01. T.P. refers to the tracked point and S.D. refers to the shaft direction.

7.2.2 D-Rob-Tra

Table 12

shows the tracking error of each method on the subsets of the D-ROB-TRA dataset for the instrument shaft angle, tracked point and clasper angle. As no submitted method was capable of tracking the clasper opening angle we omit this degree of freedom from the results. The results in this tracking task were more balanced, as most methods were able to track for extended sequences and there were few obvious large failures. This is reflected in the non-improvement of the merged methods. As in the D-CONV-TRA dataset, TRA-UCL-OL provided the best point tracking but the TRA-UGA method provided the highest shaft angle accuracy. This is likely due to this method working well only when its quite strict assumptions about the low level image features are not violated, which is regularly not the case during the much more challenging D-CONV-TRA data. The TRA-UCL-MOD method provided the best wrist direction due to it being the only method which used a full 3D model to perform the tracking.

D-ROB-TRAAll T.P. (pix) S.D. W.D.
TRA-UCL-OL
TRA-MERGED
TRA-UGA
TRA-UCL-MOD
TRA-KIT
(a)
D-ROB-TRAMultiple T.P. (pix) S.D. W.D.
TRA-UCL-OL
TRA-MERGED
TRA-UGA
TRA-UCL-MOD
TRA-KIT
(b)
Table 12: Comparison of the tracking accuracy of each method, averaged across all of D-ROB-TRA datasets. The statistical significance of the difference in ranking is color-coded: Green indicates a significance value of p ¡ 0.01. T.P. refers to the tracked point, S.D. refers to the shaft direction and W.D. refers to the wrist direction.

8 Discussion

8.1 Instrument Segmentation

For the instrument segmentation challenge, the methods were evaluated based on the performance on two datasets. Overall on the D-CONV-SEG dataset, the segmentation method SEG-KIT-CNN achieved the highest performance based on the DSC, while on the D-ROB-SEG dataset SEG-JHU significantly outperformed the other methods. While the performance of the highest ranked methods was similar on both datasets (DSC of 0.88), the range of the DSC between the sets varied largely, with D-ROB-SEG having a narrower range. Furthermore, the highest ranked method on each dataset performed significantly worse on the other dataset.

This can be explained by the fact that the D-CONV-SEG dataset was collected from videos of actual laparoscopic operations, while the D-ROB-SEG dataset was collected from ex-vivo organs under controlled conditions. D-ROB-SEG therefore contains less variance, as only a small selection of surgical instruments were used in the videos and neither the endoscope optics nor the lighting conditions changed. Also D-ROB-SEG did not include challenges such as smoke and blood. While the kinematics of the robot allowed automated annotation of the instruments, resulting in more training data for the robotic dataset in contrast to the laparoscopic images that required manual annotation, the similar performance of the highest ranked methods on both datasets seems to suggest that D-CONV-SEG contained a sufficient amount of annotated images. It is still possible though that the performance of some methods could have improved with more annotated training examples.

It should be noted that on both datasets methods based on CNNs were the highest performing methods. On D-CONV-SEG  all the CNN based methods perform significantly better than the remaining methods, while on D-ROB-SEG one of the RF based methods was among the top three. Furthermore it can also be seen that on D-CONV-SEG  the performance of the RF based methods degraded significantly when confronted with image that contained challenges such as blood or smoke. Possible explanations for the difference in performance between the CNN based methods and the non-CNN based methods is that, especially the RF based methods, relied on manually selected features that described local pixel neighborhood, while the CNN based methods used features that had been task specifically learned and operated from a more global perspective. These features allowed the CNN based methods to label each pixel based on information collected from a larger region than its immediate neighbors. While this allowed for a higher performance, the non-CNN based methods benefit from a run-time closer to real-time than the CNN based methods. While all methods performed somewhat similarly in precision (), the recall performance showed a large spread. In other words, false positives were rare, instead the methods differed in the amount of correct instrument pixels found.

The ranges of the DSC between the different CNN based methods on D-CONV-SEG (0.82 - 0.88) and on D-ROB-SEG (0.81 - 0.88) are significantly large, prompting the question how these methods differ. Three of the CNNs (SEG-JHU, SEG-KIT-CNN and SEG-UCL-CNN) are based on FCN, the current state of the art. While SEG-JHU and SEG-UCL-CNN perform similarly on D-CONV-SEG, SEG-KIT-CNN outperforms the other two. This difference between KIT-CNN and the other two lie in weight regularization and using pretrained weights for part of the network, which seems to improve performance, at least on D-CONV-SEG. On D-ROB-SEG, KIT-CNN actually performs significantly worse than the other CNNs. The method SEG-UB is not based on a FCN, but a patched-based CNN instead. While this is not the current state of the art, SEG-UB still outperforms two of the FCN-based methods on D-CONV-SEG. This can be attributed to the data augmentation used by the method, which none of the other CNN based methods employed. These results seem to suggest that using a FCN combined with pretraining, data augmentation and regularization increases the DSC.

The results show that merging the segmentation results of different methods improves overall performance. The merged methods always perform better and the results are often significantly better than best single method. On both datasets, the highest ranked single methods achieved a DSC of 0.88, while highest ranked merged methods resulted in a DSC of 0.89. When comparing the manner of merging the results, whether with majority voting or STAPLE, no significant difference was found.

8.2 Instrument Tracking

As for the segmentation challenge, two datasets were also used for the instrument tracking challenge for evaluating the performance of the different methods. On both D-CONV-TRA and D-ROB-TRA, TRA-UCL-OL outperforms the other single methods significantly in terms of locating the laparoscopic instruments. The performance of the different methods on D-CONV-TRA differs enormously, which can be contributed to the large variations in the dataset. On D-ROB-TRA the different methods provided similar performances, except TRA-KIT, whose error was two times larger than the other methods. This can be contributed to the point, which was tracked. TRA-KIT tracked the instrument tip over time, while the other methods tracked the tool center point, which was also the point annotated.

Seeing that TRA-UCL-OL was the only submission based on a machine-learning algorithm, rather than hand-created image processing techniques, leads us to carefully suggest that this seems the more promising direction for future research, though more submissions with more methodological crossover would be required here to draw a more general conclusion. Tracking the articulation of the tools seems to be a much more challenging problem, as all methods provided errors larger than

on D-CONV-TRA and larger than on D-ROB-TRA.

The results show that merging the results of the different tracking was beneficial on D-CONV-TRA  as the merged method outperformed the single methods in both locating the instrument tip and finding the correct instrument angle. On D-ROB-TRA  merging multiple trackers appears to be less effective, which can be contributed to TRA-KIT tracking a different point than the other methods.

One open problem in comparing tracking results is how to appropriately score, or rather punish, tracking failure. Ignoring frames in which tracking failed would actually improve performance metrics, while adding a constant error might drastically reduce metrics.

9 Conclusion

In this paper, we presented the results on an evaluation of multiple state of the art methods for segmentation and tracking of laparoscopic instruments based on data from the Endoscopic Vision sub-challenge on Instrument segmentation and tracking. For this challenge, a validation data set for two tasks, segmentation and tracking of surgical tools in two settings: robotic and conventional laparoscopic surgery was generated. Our results indicate that while modern deep learning approaches outperform other methods in instrument segmentation tasks, the results are still not perfect. Neither did one method clearly have the combined coverage of all the other methods, as merging the segmentation results from different methods achieved a higher performance than any single method alone.

The results from the tracking task show that this is still an open challenge, especially during challenging scenarios in conventional laparoscopic surgery. Here acquiring more annotated data might be the key to improve results of the machine learning based tracking methods, but acquiring large quantities of training data is challenging. In the conventional laparoscopic setting, tracking would also be improved by merging results, though this was not the case in the robotic setting, as one method tracked a different part of the instruments than the other methods.

References

References