CaDSS: Cataract Dataset for Semantic Segmentation

06/27/2019 ∙ by Evangello Flouty, et al. ∙ 5

Video signals provide a wealth of information about surgical procedures and are the main sensory cue for surgeons. Video processing and understanding can be used to empower computer assisted interventions (CAI) as well as the development of detailed post-operative analysis of the surgical intervention. A fundamental building block to such capabilities is the ability to understand and segment video into semantic labels that differentiate and localize tissue types and different instruments. Deep learning has advanced semantic segmentation techniques dramatically in recent years but is fundamentally reliant on the availability of labelled datasets used to train models. In this paper, we introduce a high quality semantically segmented dataset forCataract surgery annotated on top of an available video dataset.To the best of our knowledge, this dataset has the highest quality annotation in surgical data to date. We introduce the dataset and then show the automatic segmentation performance of state-of-the-art models on that dataset as a benchmark.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computer assisted interventions (CAI) have the potential to enhance surgeons’ capabilities through better clinical information fusion, navigation and visualization [5]. Currently, CAI systems are used mainly as tools for preoperative planning [1] and translation of such plans into the procedure through surgical navigation [2][3]. There are possibilities to develop CAI further with more advanced deformable navigation capabilities, better imaging and robotic instrumentation [4]

. More advanced CAI systems rely on effective use of the video signal, which is used by surgeons, to perform procedures through computer vision. Data driven machine learning techniques and deep learning, in particular, have been immensely influential in recent vision advances as well as in medical image computing and analysis. To take advantage of such advances for CAI in procedures using surgical cameras (e.g. laparoscopy, endoscopy, microsurgery), establishing data repositories and labels that facilitate training of vision models and subsequent bench-marking is necessary


Surgical video datasets for CAI have emerged over the past decade and contributed to algorithm development. Notable examples include the Cholec80 and Cholec120 [16], RMIT [7] and data associated with the EndoVis challenge hosted by MICCAI[8]. The labels for such data can include per frame annotations of surgical instrument presence or surgical phase, which are weak labels for more detailed tasks such as semantic segmentation. Pixel-level annotations could facilitate more advanced and effective applications in CAI to help image guided interventions [17][18], support pre-operative surgical planning[19], increase efficiency and workflow of medical robots [20]

, estimate tool use or motion for post - procedural analytics

[20] [21][22], automated diagnostic readouts [23][24], or enhanced surgical training [25]. While data availability is increasingly growing through the use of digital surgical cameras in endoscopy, laparoscopy and microsurgery, and systems for managing confidentiality, regulation and ethics are well established, annotation and data labels are still a major challenge for CAI.

Fig. 1: Example image and semantic segmentation labels from the CaDIS dataset, which we present in this paper. The image on the left shows an image from the CATARCATS dataset[14] while the image on the right shows the ground truth semantic segmentation labels.

Recently, the CATARACTS challenge111 presented 50 annotated surgical videos from a surgical microscope. Labels for the data describe the surgical tools that are present within each video frame [14] and the phase of the frame, which is subject to a procedural workflow segmentation [15]. In this paper, we introduce a semantic segmentation dataset built on top of the CATARACTS data. We demonstrate how this dataset can be used to train state-of-the-art deep learning frameworks for semantically segmenting unseen cataract data. We believe this contribution will underpin the development of CAI techniques based on vision.


We use the training videos from the CATARACTS challenge to generate our dataset. Videos are collected during cataract surgery, which is one of the most common surgical procedures performed across the globe. All videos are recorded during Phacoemulsification procedure, the most performed approach for cataract surgery [9][10], using a camera mounted on the surgical microscope focusing directly on the patient’s eye. Even though cataract surgery is less prone to complications, a small improvement and risk mitigation can have big impact knowing that over 20 million cases in 2010 have been recorded [11]. In addition, a study on medical malpractice claims related to cataract surgery reveals that 76.28% of the 118 claims are intraoperative allegations [12] and another study shows that the rate of a certain intraoperative complication (posterior capsular rent) is for experienced surgeons, and for residents [13]. We therefore generate this dataset to foster more research on developing CAI systems for cataract surgery, which can potentially reduce risks and improve the workflow.

The CATARACTS challenge training set includes 25 videos that have around 500K frames in total. Because pixel-level labeling is time-consuming and the change among consecutive frames is subtle, we use ground-truth tool and phase information to select frames that have tools and are evenly distributed across different phases. To this end, we use phase annotation from [15] to split videos into 14 surgical phases. We then randomly select a maximum of 20 frames per phase such that the frames are at least three seconds apart and have a tool in it. As a result, we collect around 200 frames per video and 4738 frames in total. The image are also downsampled by half from to .

The dataset includes 36 different semantic classes: 28 surgical tool classes, 5 anatomy classes, and 3 miscellaneous classes. Surgical tool classes also include surgical tool handles: When surgical handles appear in some of the images, they were given a different class ID. The handles where given a different class ID. Additionally, the markers(ink) created by the ”Biomarker”(surgical tool) were given their own class ID. The classes are presented in Table I. Each image used in CaDIS has the video and frame number form the original CATARACTS dataset. This is done to be able to trace from which video and at what time was the frame was extracted. Figure 2 (a) presents the distributions of the number of class instances per video while Figure 2(b) shows the total number of videos, where at least an instance of a class appears. While some of the classes like the pupil and the iris are present across an entire video, other classes, like rare tools, only appear in few frames across different videos. This makes the dataset highly unbalanced and therefore accurate tool detection more challenging. Furthermore, there are other visual challenges due to the high inter-class similarity among instruments. For example, Figure 3 shows four different types of cannulas, which look very similar. Each of these cannulas are used to perform different actions, like injecting material, moving tissue or breaking tissue in eye. As the cataract surgical phases are defined based on these actions, distinguishing among different types of cannulas is crucially important for any CAI system.

Tools and Handles Anatomy Misc.
8. Hydro. Cannula 0. Pupil 1. Surgical Tape
9. Visco. Cannula 4. Iris 2. Hand
10,23. Cap. Cystotome 5. Eyelid 3. Eye Retractors
11,21. Rycroft Cannula 6. Skin
12. Bonn Forceps 7. Cornea
13,31. Primary Knife
14,22. Phaco. Handpiece
15,25. Lens Injector
16,19. A/I Handpiece
17,24. Secondary Knife
18. Micromanipulator
20. Cap. Forceps
26. Water Sprayer
27. Suture Needle
28. Needle Holder
29. Charleux Cannula
30. Vannas Scissors
32. Viter. Handpiece
33. Mendez Ring
34. Biomarker
35. Marker
Table I: New dataset classes. The number preceding the class is the ID of the class used in our experiments. In the case where there are two numbers preceding the class, the first number represents the tool ID while the second number represents the respective tool’s handle ID. We labeled the tool’s tip separately than the shaft to be able to possibly study the isolation of tool handles from the dataset as they have no surgical value. More details about the dataset can be found in the Appendix.
(a) (b)
Fig. 2: Boxplot (a) provides the average and scatter of the number instances of the respective class in one video. Barplot (b) describes the number of videos each class appeared in at least once. The plots show that the data is very biased towards anatomy classes where they appear in almost all videos and having a high number of instances per video.
Fig. 3: Four different types of cannulas appearing in this dataset. These cannulas look very similar, but they are used to perform different actions during cataract surgery. The top-left picture has a hydrodissection cannula hydrating the lens for for smoother hydrodissection. The top-right image is a viscoelastic cannula that ensures the proper pressure in the anterior chamber. The bottom-left image has a Rycroft cannula hydrating the wounds to fasten the healing process of the incision wounds. Finally, in the bottom-right picture there is a capsulorhexis cystotome that is used to tear the anterior capsule allowing surgeon to access to the damaged lens.

Iii Experimental Setup

In order to increase the usability of our dataset we setup three different experiments, each of them with different particular goals, but all of them with the same inherent objective of advancing the state-of-the-art on CAI systems. The three experimental setups, along with the specific task they relate to, are described as follows:

Iii-a Experiment I: Anatomical Understanding

One of most crucial properties a CAI system could have is the ability of understanding the different anatomy in real-time and its interaction of surgical instruments. This would enable an overall understanding of the state of the procedure and would allow for monitoring the healthy state of different anatomical landmarks at any moment. Additionally, these insights could help to develop scoring systems of surgeon’s interaction with the patients’ anatomy. To this end, the first experimental setup is composed by merging all instrument classes into a single class, simplifying the challenging instrument distinction task and allowing the algorithms to focusing on discriminating between different anatomical landmarks and surgical instruments as a whole. The resulting dataset for this setup contains 5 anatomical labels, 3 miscellaneous labels and an extra label containing all instrument and instrument-handle calsses.

Iii-B Experiment II: Instrument Identification

It is known that a variety of surgical processes are highly correlated with specific instrument usage, and thus, identifying and distinguishing different surgical instruments is a key component to understand surgical actions and prevent potential misuse and risks. Semantic segmentation of instruments can also help to draw an accurate profile of instrument usage across the operation, which along with their trajectories will result in technology for both intra-operative risk estimation systems and post-operative operational note and analytic generation. In addition, this would potentially allow to score surgeon’s instrument handling skills and report feedback when tremor or abrupt movements are detected. To validate systems capable of discriminating between different instruments, we merge all the anatomical and miscellaneous labels into a single class, yielding a dataset with 21 instrument classes (each instruments merged with their corresponding handler into one class) and one extra class including everything else.

Iii-C Experiment III: Surgical Understanding

To get a full understanding of the surgery and relation between the surgeon and the patient, it is necessary not only to be able to identify and distinguish anatomy and instrumentation, but also to study the interaction between them. This can provide a basis for developing surgical risk estimation systems, where surgeon’s actions can be directly correlated with patient anatomy and measured to estimate the intensity or the risk the actions pose to the patient. Excelling on instrument to anatomy interaction detection would allow for early warning systems and risk prevention systems to arise, ultimately, making surgery safer for all. We merged instrument labels with their corresponding handle label (e.g. merging ”Rycroft cannula handle” with the ”Rycroft cannula” class) resulting on a dataset with 21 instrument classes, 5 anatomical classes and 3 miscellaneous classes for a total of 29 labels that would allow for studying usage and interactions among them.

Fig. 4: Sample results of Experiment I. From left to right: real image, ground truth, DeepLab v3+, PSP, and UperNet results. All models were trained on 9 classes.
Fig. 5: Sample of the data used in Experiment I. As shown in the figure, all tool classes have been merged into one class.

Iii-D Comparative study

We benchmark our three experimental setups for the dataset against state-of-the-art algorithms to provide strong baselines for new and future approaches to compare against. First, we include FCN-VGG [30]222

Original Caffe implementation was used.

as one of first works in introducing fully convolutional networks for image segmentation. It was introduced in 2014 when it highlighted the potential of FCN-CNN networks by improving the previous results by 20% (mean IoU). Since then, fully convolutional networks have been dominating the semantic segmentation field that is currently advancing at incredible pace every year. PSPNet [28] (2016) 333PSPNet: introduced the Pyramid Pooling Module (PPM) to make use of both global and local contextual information to better learn about class co-occurrence in a single image, which in turn improves object segmentation accuracy. This is achieved by joining multiple sub-regional pooling layers that combine features at different scales. UperNet [29]444UpperNet: builds on top of PSPNet by adding a Feature Pyramid Network (FPN), a module to extract features at different scales of the encoder, which allows to build a richer representation by combining information at multiple image scales. Last, DeepLab v3+ [32] 555DeepLab v3+: was recently introduced as an extension of DeepLab (v2 [33] and v3 [34]) that uses a modified Xception [35] encoder, known too increase object detection performance, and combines it with atrous convolutions with different dilation rates to achieve better contextual predictions without losing image resolution. The atrous convolution enables DeepLab v3+ to benefit from long-range contextual information while preserving fine boundary information.

All networks weights were initialized with the original pretrained weights from Pascal VOC 2012 dataset [26]

and trained on a system with two NVIDIA GTX 1080 Ti GPUs. Hyperparameters for each of the networks were empirically selected to achieve the optimal performance across the three experiments, resulting on FCN-VGG trained with learning rate

, actual batch size of 8 (which is achieved by using iterations size of 4 and batch size of 2), PSPNet and UpperNet with learning rate , and a batch size 4 and DeepLab v3+ with learning rate

, and a batch size of 4. Note that networks are trained on their original frameworks and thus, hyper-parameters are in different scales. All networks were trained with CrossEntropy loss for 100 epochs and for each experiment the model with the highest mIOU was reported.

In order to establish a common evaluation protocol, we split all the datasets for all the experiments into training, validation and test sets. The validation set contains all frames from videos 5, 7 and 16 for a total of 542 frames. Similarly, the test set includes all the frames from videos 2, 12, and 19 for a total of 614 frames. The rest of frames from the remaining videos are used in the training set with a total of 3582.

Fig. 6: Sample results of Experiment II. From left to right: real image, ground truth, DeepLab v3+, PSP, and UperNet results. All models were trained on 22 classes.
Fig. 7: Sample of the data used in Experiment II. As shown in the figure, all non-tool classes have been merged into one class.

In order to evaluate the above-presented models, we use two metrics. The first metric is pixel accuracy that is computed as the percentage of correctly classified pixels and it is defined as:


where GT and Pred stands for ground truth and predictions respectively and indicates the intersection operation. The second metric is the mean Intersection over Union (IoU). The mean IoU tends to penalize incorrect detection more than pixel accuracy by considering both intersection between prediction and ground truth as well as the incorrect predictions along with missed ground truth pixels. Mean IoU isdefined as:


where denotes the union operation.

Iv Results

Iv-a Experiment I - Anatomical Understanding

In this experiment, all instrument classes found in Table I were merged into one class leading to a total of 9 classes. This experiment focused more on the anatomy classes to help CAS applications enable anatomy evaluation during surgery and thus helping in risk avoidance and evaluation of surgical skill. A sample of the data can be found in Figure 5. Qualitative and quantitative results can be seen in Figure 4 and Table II respectively.

Model Mean IoU Pixel Acc.
PSP 69.75% 91.49%
DeepLab v3+ 68.61% 91.30%
UPerNet 69.57% 91.40%
Table II: Segmentation performance of the different networks of Experiment I with 9 classes.

Table II shows that all models perform similarly with the PSP model achieving the highest performance. All models achieve around of mean IoU in detecting the instrument class. The slight improvement of UPerNet and PSP mainly comes from segmenting the Surgical Tape, the Eye Retractors and the Cornea. Figure 4 shows example frames and corresponding color-coded ground truth annotations along with with qualitative results from each model. As can be seen in this figure, PSP and UPerNet have better segmented the Eye Retractors. We believe that merging all instruments in one class and their high similarity with the ”Eye Retractors” made the DeepLab model further confuse these classes. All the models perform similarly in segmenting anatomical classes.

As shown in Table II, UPerNet outperformed the other models in this experimental setup. It is very interesting to see how DeepLab failed to recognize small regions in this experiment. As shown in the first example of Figure 4, the gap between the forceps is more defined with PSP and UPerNet than with DeepLab. We believe this is the result of the atrous convolutions where local spatial information can be skipped. DeepLab also confused light reflection on the pupil with a surgical tool. In the second example of Figure 4, the reflection of light on the eye was classified as a surgical tool with DeepLab but not with the remaining networks. We can also see better tool segmentation with UPerNet than the other networks in the same example. PSP and UperNet segmented the cannulas better in the third example. Finally, DeepLab confused whether the surgical tool or the retractor is on top whereas PSP and UperNet clearly identified the correct order as seen in the last example of Figure 4. Atrous convolution might have merged information from both tool and retractor leading Deeplab to confuse which tool is on top.

Even though both PSP and UPerNet outperform DeepLab, the DeepLab model segments and recognises cannulas better. Merging the tool masks into one mask helped the PSP and UPerNet networks to have a better overall performance especially with the surgical tool class, leading to both networks outperforming DeepLab in this experimental setup. This experiment also shows that DeepLab has a difficulty with detecting and correctly classifying thin regions within an image. Therefore, traditional convolutional kernels has shown to be more effective than atrous convolution kernels when it comes to segmenting thin classes.

Iv-B Experiment II - Instrument Identification

In this experiment, all the anatomy and misc. classes found in table I were merged into one class leading to a total of 22 classes. This experiment focused on surgical instruments during cataract surgery to pave the way towards tool usage, tool tracking and cross-tool interaction applications. A sample of the data is presented in Figure 7.

Model Mean IoU Pixel Acc.
PSP 44.38% 98.87%
DeepLab v3+ 44.22% 98.86%
UPerNet 45.16% 98.85%
Table III: Segmentation performance of the different networks of the Experiment II with 22 classes.

Table III presents the performance of the models in this setup. All the models perform similarly in detecting the anatomy class and achieve the Pixel Acc. of . Pixel Acc. is high for all models because the anatomy class is the dominant one and the model perform well on this class.

In this experimental setup, UPerNet outperforms the other two networks with a mean IoU of 45.16% as shown in Table III. Having all anatomy and misc. classes combined allows the networks focusing on detecting the instruments. Despite this, networks still struggled in differentiating among different cannulas. All the models have achieved Mean IoU of more than for segmenting the knives, which are the most accurately segmented instrument by these models. The UPerNet model performed significantly better than the other models on this class by achieving Mean IoU that is higher than both PSP and DeepLab v3+. The architectural difference between PSP and UPerNet helps reason why that happens. As opposed to PSP, UperNet uses un-fused features at the different levels of the PPM to classify pixels. PSP however, fuses all the features from the PPM and uses the concatenated scales to infer the pixel class. Using concatenated features along with individual scale features helped PSP achieve a better performance than the other networks. Figure 6 presents several frames along with color-coded ground-truth and segmentation results. One can observe that when an instrument is segmented the instrument class is often correctly predicted.

Comparing Tables II and III indicates that the pixel accuracy has increased when all anatomy classes are merged together. The higher pixel accuracy values are because of the fact that the joined anatomy class is more than 89.96% of the dataset and therefore dominate the pixel accuracy metric. On the other hand, the Mean IoU has significantly dropped compared to Table II. This clearly highlights the difficulty of segmenting and distinguishing among different instrument classes.

Iv-C Experiment III - Surgical Understanding

The experiment is targeted towards detecting different class types together (tool and handles, anatomy and Miscellaneous). As only of the dataset belongs to tool handle classes, tool tip and handle classes are merged where applicable. As a result, we got 29 different classes. A sample of the training data is shown in Figure 8.

Fig. 8: Sample of the data used in Experiment III. As shown, each tool and Anatomical tissue has its own label and mask.
Model Mean IoU Pixel Acc.
PSP 52.66% 91.62%
DeepLab v3+ 51.91% 91.37%
UPerNet 49.43% 90.81%
VGG 20.61% 77.84%
Table IV: Segmentation performance of the different networks on Experiment III with 29 classes.
Fig. 9: Sample test images of Experiment III. From left to right: real image, ground truth, DeepLab v3+, PSP, UperNet, and VGG results. All models were trained on 29 classes.

The evaluation results are presented in Table IV. Similarly to other experimental setups, the pixel Acc. metric is very biased towards the anatomical classes, where all models has achieved the similar performance of . The mean IoU indicates that DeepLab and PSP perform significantly better than UPerNet. DeepLab segments the forceps more accurately than the other models ( vs , PSP, and , UPerNet). Both DeepLab and PSP detect the cannulas better than UPerNet ( vs ). We also experiment with the VGG network, but the model performance was dramatically lower that other model ().

Qualitative results are shown in Figure 9. It can be seen in the second and the forth rows that the DeepLab better detected and segmented the cannulas. In the final row of Figure 9, we can see that DeepLab had the best detection between all networks with only the bottom of the primary tool confused. However, PSP and UPerNet failed to achieve smooth detections of both present tools. More confusion of the primary tool can be seen with UPerNet.

V Conclusion

In this paper, we have developed and provided a new dataset for semantic segmentation in cataract surgery building on top of the CATARACTS dataset. In order to construct a more challenging dataset, we use the CATARACTS challenge annotation of selected frames, where at least a surgical instrument is present. The generated dataset is conveniently separated in three different challenges with the aim to address three challenges of CAI systems: anatomical understanding, instrument identification and tracking, and understanding of interactions between surgical instrumets and anatomical landmarks. To validate the applicability of the dataset and establish strong baselines, state-of-the-art neural network models were trained and evaluated on all three different experimental setups. Validation and tests sets were created from separate videos to better assess the quality and generalization of the trained models on unseen test data. To the best of our knowledge, this is the first dataset that offers this high quality fine surgical annotation. We believe this dataset will assist the community with their research and also foster new applications in the surgical domain.

The first experiment was created by merging all surgical tool classes into one class, while preserving all the anatomy and misc classes, resulting in a dataset with 9 classes. PSP and UPerNet out-performed DeepLab, and showed that current state-of-the-art models are able to learn accurate anatomical representations in cataract surgery, achieving high segmentation accuracy.
In the second experiment, all surgical tool masks preserved their label while the remaining anatomical and miscellaneous classes were merged into a single class, leading to a dataset of 21 classes. This experiment helps understanding the potential of semantic segmentation as instrument identification and tracking CAI systems, and would help to understand the correlation between different instruments on cataract surgery. This experiment showed the challenges of accurately differentiating different instruments, as for example multiple cannulas with different surgical function look very much alike.

In the last experiment, all 29 classes each with its own mask were used for training. This experimental setup exposes the whole dataset in the efforts to find relations between all the 29 classes. Segmenting 29 classes is a difficult challenge, but DeepLab showed how deep neural networks can perform well on image segmentation with a difficult dataset such as the one proposed here. On the other hand, a simpler network like VGG was not able to perform well with a low 20% m-IoU. We believe that VGG could perform better with hyper-parameter optimisation, and more importantly, with pixel-level balancing of all classes, but for fair comparison all networks were trained with the same training loops.

In the Appendix the big difference of pixel instances between anatomy and surgical tools can be seen, which explains why VGG wasn’t able to overcome the instrument identification problem without further hyperparameter optimisation and class-balancing sampling. DeepLab outperformed the other models with 47.76% and 41.63% m-IoU in Experiments II and III respectively, while PSP outperforms the remaining models in Experiment I setup with m-IoU 69.76%. The results show that the models can relatively segment the none-tool classes better than tool classes. Cannulas are difficult to distinguish, specially with only visual data from one single frame. We believe that introducing a temporal models like Long-Short Term Memory Recurrent Neural Networks (LSTM)

[36] would significantly help distinguish between the them. Each cannula has a specific task in the surgery, some to scratch and tear tissue, others to inject. Therefore, activity recognition would significantly help in differentiating between these tricky surgical tools. Models that would help recognize activity can also help with the challenge.


We would like to thank Ellie Jaram and Nunzia Lombardo for their efforts in annotating the dataset to the highest quality.


  • [1] Zeng, Canjun, et al. ”A combination of three-dimensional printing and computer-assisted virtual surgical procedure for preoperative planning of acetabular fracture reduction.” Injury 47.10 (2016): 2223-2227.
  • [2]

    Tonutti, Michele, et al. ”A machine learning approach for real-time modelling of tissue deformation in image-guided neurosurgery.” Artificial intelligence in medicine 80 (2017): 39-47.

  • [3] Kaus, Michael R., et al. ”Automated segmentation of MR images of brain tumors.” Radiology 218.2 (2001): 586-591.
  • [4] Kassahun, Yohannes, et al. ”Surgical robotics beyond enhanced dexterity instrumentation: a survey of machine learning techniques and their role in intelligent and autonomous surgical actions.” International journal of computer assisted radiology and surgery 11.4 (2016): 553-568.
  • [5]

    Maier-Hein, L., et al. ”Surgical Data Science: Enabling Next-Generation Surgery.” Nature Biomedical Engineering 1 (2017): 691-696.

  • [6] Vedula, S. S. et al. ”Objective Assessment of Surgical Technical Skill and Competency in the Operating Room”, Annu. Rev. Biomed. Eng., 19(1), 301–325 (2017)
  • [7] Sznitman, Raphael, et al. ”Data-driven visual tracking in retinal microsurgery.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Berlin, Heidelberg, 2012. Vedula, S. S. et al. ”Objective Assessment of Surgical Technical Skill and Competency in the Operating Room”. Annu. Rev. Biomed. Eng., 19(1), 301–325 (2017)
  • [8] Allan, Max, et al. ”2017 Robotic Instrument Segmentation Challenge.” arXiv preprint arXiv:1902.06426 (2019).
  • [9] Eunbi, Kim et al, ”Studies on the Cornea and Lens”, Springer, p. 4, ISBN 9781493919352
  • [10] Hasler, Pascal , ”Essential Principles of Phacoemulsification”, JP Medical Ltd, ISBN 9789962678618
  • [11] World Health Organization, ”Priority Eye Diseases”, Retrieved from on 16, April 2019.
  • [12] Kim, Judy E et al. “Medical malpractice claims related to cataract surgery complicated by retained lens fragments (an American Ophthalmological Society thesis)” Transactions of the American Ophthalmological Society vol. 110 (2012): 94-116.
  • [13] Chakrabarti, Arup and Nazneen Nazm. “Posterior capsular rent: Prevention and management” Indian journal of ophthalmology vol. 65,12 (2017): 1359-1369.
  • [14] Al Hajj, Hassan, et al. ”CATARACTS: Challenge on Automatic Tool Annotation for cataRACT Surgery.” Medical image analysis (2018).
  • [15] Zisimopoulos, Odysseas, et al. ”Deepphase: Surgical phase recognition in cataracts videos.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2018.
  • [16] Twinanda, Andru P., et al. ”Endonet: A deep architecture for recognition tasks on laparoscopic videos.” IEEE transactions on medical imaging 36.1 (2017): 86-97.
  • [17] Kovler, I., et al. ”Haptic computer-assisted patient-specific preoperative planning for orthopedic fractures surgery.” International journal of computer assisted radiology and surgery 10.10 (2015): 1535-1546.
  • [18]

    Pfeiffer, Micha, et al. ”Learning soft tissue behavior of organs for surgical navigation with convolutional neural networks.” International journal of computer assisted radiology and surgery (2019): 1-9.

  • [19] Ozdemir, Firat, and Orcun Goksel. ”Extending Pretrained Segmentation Networks with Additional Anatomical Structures.” Int J CARS (2019) 14: 1187.
  • [20] García-Peraza-Herrera, Luis C., et al. ”Real-time segmentation of non-rigid surgical tools based on deep learning and tracking.” International Workshop on Computer-Assisted and Robotic Endoscopy. Springer, Cham, 2016.
  • [21] Fuentes-Hurtado, Félix, et al. ”EasyLabels: weak labels for scene segmentation in laparoscopic videos.” International journal of computer assisted radiology and surgery (2019): 1-11.
  • [22] Allan, Max, et al. ”Image based surgical instrument pose estimation with multi-class labelling and optical flow.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2015.
  • [23] Suetens, Paul, et al. ”Image segmentation: methods and applications in diagnostic radiology and nuclear medicine.” European journal of radiology 17.1 (1993): 14-21.
  • [24] Bouget, David, et al. ”Semantic segmentation and detection of mediastinal lymph nodes and anatomical structures in CT data for lung cancer staging.” International journal of computer assisted radiology and surgery (2019): 1-10.
  • [25] Engelhardt, Sandy, et al. ”Improving Surgical Training Phantoms by Hyperrealism: Deep Unpaired Image-to-Image Translation from Real Surgeries.” arXiv preprint arXiv:1806.03627 (2018).
  • [26] Everingham, Mark, et al. ”The pascal visual object classes (voc) challenge.” International journal of computer vision 88.2 (2010): 303-338.
  • [27] Chen, Liang-Chieh, et al. ”Encoder-decoder with atrous separable convolution for semantic image segmentation.” arXiv preprint arXiv:1802.02611 (2018).
  • [28]

    Zhao, Hengshuang, et al. ”Pyramid scene parsing network.” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 2017.

  • [29]

    Xiao, Tete, et al. ”Unified Perceptual Parsing for Scene Understanding.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

  • [30] Long, Jonathan wt al. ”Fully convolutional networks for semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
  • [31] He, Kaiming, et al. ”Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • [32] Chen, Liang-Chieh, et al. ”Encoder-decoder with atrous separable convolution for semantic image segmentation.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
  • [33] Chen, Liang-Chieh, et al. ”Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.” IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848.
  • [34] Chen, Liang-Chieh, et al. ”Rethinking atrous convolution for semantic image segmentation.” arXiv preprint arXiv:1706.05587 2017.
  • [35] Chollet, Francois. ”Xception: Deep Learning with Depthwise Separable Convolutions.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  • [36] H. Sak et al. “Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling,” Proc. Interspeech, 2014.

Vi Appendix

This section presents more details about the dataset used in this paper. We extracted 4738 images from the originals CATARACTS dataset [14] as explained in section II. This appendix contains a table that describes the extracted dataset after manual segmentation.

Class Instances Pixels Images
ID Name Count Mean Std. Count Mean Std. Example Spatial Avg.
0 Pupil 4732 182.0 40.28 413.9M 87.47K 25.12K
1 Tape 3901 150.04 60.96 155.2M 39.78K 46.39K
2 Hand 767 29.5 36.1 18.46M 24068.72 31554.36
3 Retractors 3513 135.12 68.76 13.99M 3983.59 3599.72
4 Iris 4735 182.12 40.31 275.6 58207.97 23511.3
5 Eyelid 181 6.96 34.81 5.909M 32647.89 11795.42
6 Skin 4732 182.0 40.44 314.4 66441.51 44291.9
7 Cornea 4738 182.23 40.37 1.199B 253251.54 56012.11
8 Hydrodissection Cannula 470 18.08 5.95 3.1M 6606.28 2823.04
9 Viscoelastic Cannula 604 23.23 10.68 2.24M 3714.87 1967.61
10 Capsulorhexis Cystotome 445 17.12 5.59 2.21M 4979.81 1766.25
11 Rycroft Cannula 436 16.77 5.51 1.55M 3565.49 1596.32
12 Bonn Forceps 518 19.92 28.42 5.81M 11224.29 11259.28
13 Primary Incision Knife 480 18.46 32.97 3.79M 7905.07 9235.29
14 Phacoemulsif- ier Handpiece 459 17.65 5.2 4.46M 9729.5 3805.18
15 Lens Injector 783 30.12 11.39 8.78M 11225.34 4890.55
16 Irrigation/ Aspiration 410 15.77 5.37 8.11M 19791.46 12494.24
17 Secondary Incision Knife 319 12.27 5.86 2.62M 8238.67 4503.81
18 Micro- manipulator 644 24.77 7.55 4.81M 7472.54 4378.98
19 Irrigation/ Aspiration Handle 98 3.77 6.8 1.28M 13102.61 15808.41
20 Capsulorhexis Forceps 119 3.27 4.88 1.51M 10452.58 13018.31
21 Rycroft Cannula Handle 85 2.73 5.02 888.4K 16098.61 15205.75
22 Phacoemulsifier Handle 21 3.35 4.89 1.14M 5086.28 5635.15
Table V:

This table presents supplementary information about the extracted dataset. The first and second column contain the class ID and name respectively. The third column contains the number of instances the respective class has in the dataset. The mean and standard deviation of the number of instances per video is shown in the fourth and fifth columns respectively. The sixth column shows the total number of pixels the respective class has in the entire dataset. The mean and standard deviation of the number of pixels per image can be seen in the seventh and eighth columns respectively. The eighth column give an example of the class. We recommend looking at the examples while zoomed in. Finally, the tenth row show the spatial average of the class within an image. The red area describes where it is less likely for the pixel to belong to the class if present in the image. However, the blue area describes where it is highly likely for the pixel to belong to the class if present in the image.

Class Instances Pixels Images
ID Name Count Mean Std. Count Mean Std. Example Spatial Avg.
23 Capsulorhexis Cystotome Handle 87 5.27 4.75 442.5K 9638.29 9166.58
24 Seconday Knife Handle 137 4.58 6.77 1.32M 12754.99 9855.47
25 Lens Injector Handle 40 1.54 4.23 706.8K 17670.78 10646.16
26 Water Sprayer 4 0.15 0.77 8.32K 2081.25 354.67
27 Suture Needle 32 1.23 3.2 25.67K 802.44 320.59
28 Needle Holder 12 0.46 2.31 373.88K 31156.83 6175.01
29 Charleux Cannula 19 0.73 2.57 99.33K 5227.89 2695.77
30 Vannas Scissors 1 0.04 0.19 19.86K 19862.0 0.0
31 Primary Knife Handle 3 0.12 0.42 7.18K 2395.33 1158.03
32 Viterectomy Handpiece 17 0.65 3.27 248.83K 14637.41 4982.82
33 Mendez Ring 14 0.54 2.69 1.78M 127421.29 49674.11
34 Biomarker 8 0.31 1.54 114.94K 14367.88 9613.42
35 Markers 179 6.88 34.42 1.22M 6862.12 1929.6