Endoscopy is a widely used clinical procedure for the early detection of numerous cancers (e.g., nasopharyngeal, oesophageal adenocarcinoma, gastric, colorectal cancers, bladder cancer etc.), therapeutic procedures and minimally invasive surgery (e.g., laparoscopy). During this procedure an endoscope is used; a long, thin, rigid or flexible tube having a light source and a camera at the tip which allows to visualize inside of affected organs on a screen. A major drawback of these video frames is the heavy corruption of multiple imaging artifacts (e.g., pixel saturation, motion blur, defocus blur, specular reflections, bubbles, fluid and debris). These artifacts not only present difficulty in visualizing the underlying tissue during diagnosis but also adversely affects post-analysis methods. Many notable quantitative analysis, such as video mosaicking for visualizing an extended field-of-view for follow up, 3D surface reconstruction for aiding surgical planning and key video-frame retrieval for reporting, that help to significantly improve clinical care are compromised. The accurate detection of artifacts in clinical endoscopy is thus a critical bottleneck problem whose solution will transform and greatly accelerate the development of effective quantitative clinical endoscopic analysis across all diseases, organs and modalities. This was first identified in our previous work Ali_arXiv2019. However, to comprehensively address the artifact detection problem and stimulate academic discussion, we established the Endoscopy artifact Detection challenge (EAD)111Details of this challenge can be found at: https://ead2019.grand-challenge.org
as an initiative to discover the limitations of existing state-of-the-art computer vision methods and to stimulate the development of new algorithms in the following key problem areas inherent to all video endoscopy.
1.1 Multi-class artifact detection
Existing endoscopy workflows detect mainly one artifact class which is insufficient to obtain high-quality frame restoration suitable for quantitative analysis of the entire clinical video. In general, the same video frame can be corrupted with multiple artifacts for example motion blur, specular reflections, and low contrast can be present in the same frame. Further, not all artifact types corrupt the frame to equal extent. Thus unless multiple artifacts present in the frame are known with their precise spatial location, clinically relevant frame restoration quality cannot be guaranteed. Another advantage of class specific detection is that frame quality assessments can be more guided to minimise the overall number of frames that gets discarded during automated video analysis maximising the usage of information within each video.
1.2 Multi-class artifact region segmentation:
Frame artifacts typically have irregular shapes that are non-rectangular. Consequently they are overestimated by bounding box detections. The development of accurate semantic segmentation methods to precisely delineate the boundaries of each detected frame artifact enables optimized restoration of video frames without sacrificing information.
1.3 Multi-class artifact generalisation
It is important for algorithms to avoid biases induced by the use of specific training datasets. Additionally it is well known that expert annotation generation is time consuming and infeasible for many institutions. In this challenge, we encourage the participants to develop machine learning algorithms that can be applied across different endoscopic datasets worldwide based on our large collected combined dataset from 6 different institutions.
With the EAD Challenge we aimed to establish a first large and comprehensive dataset for “Endoscopy artifact detection” (see Fig. 1). The provided data was assembled from 6 different centers worldwide: John Radcliffe Hospital, Oxford, UK; ICL Cancer Institute, Nancy, France; Ambroise Parè Hospital of Boulogne-Billancourt, Paris, France; Instituto Oncologico Veneto, Padova, Italy; University Hospital Vaudois, Lausanne, Switzerland and the Botkin Clinical City Hospital, Moscow. This unique endoscopic video frame dataset is multi-tissue (gastroscopy, cystoscopy, gastrooesophageal, colonoscopy), multi-modal (white light, fluorescence, and narrow band imaging), is inter patient and encompasses multiple populations (UK, France, Russia, and Switzerland). Videos were collected from patients on a first-come-first-served basis at Oxford, with randomized sampling at French centres and only cancer patients were selected at the Moscow centre. Videos at these centres were acquired with standard imaging protocols using endoscopes built by different companies, Olympus, Biospec, and Karl Storz. The dataset was built randomly mixing the collected data with no exclusion criteria. All images have been carefully anonymised. No patient information should be visible in this data. A comprehensive open-source software222Useful tools for this dataset: https://sharibox.github.io/EAD2019/ have been established to assist the participants.
2.1 Gold standard
Clinical relevance to the challenge problem was first identified. During this step, 7 (see Fig. 1) different common imaging artifact types were suggested by 2 expert clinicians who performed bounding box labelling of these artifacts on a small dataset (100 frames). These frames were then taken as reference to produce bounding box annotations for the remaining train-test dataset by 2 experienced postdoctoral fellows. Finally, further validation by 2 experts (clinical endoscopists) was carried out to ensure the reference standard. The ground-truth labels were randomly sampled (1 per 20 frames) during this process.
To maximise consistency in annotation labels between annotators a few rules were determined as described below. For the final scoring, we additionally penalized annotator variance of IoU (intersection over union, in the final score) which exhibit inevitable bounding box variations between annotators particularly for more subjective artifact types such as blur and contrast.
We have used an open-source VIA annotation tool for semantic segmenation dutta2019vgg. For bounding box annotation we have used a python, Qt and Opencv based in-house tool.
For the same region, multiple boxes were annotated if the region belonged to more than 1 class
The minimal box sizes were used to describe the artifact region, e.g. if there are lots of specular reflections present in an image then instead of one large box we use multiple small boxes to capture the natural size of the artifact
Each artifact type was determined to be distinctive and general across endoscopy datasets
Variation in bounding box annotations are considered by weighting the final score in the multi-class artifact detection challenge (0.6*mAP + 0.4*IoU) as IoU (intersection over union) is likely to vary more compared to mAP (mean average precision) across individual annotators
Variation in the semantic class labels of masks for semantic segmentation was found not significant. Further we do not consider contrast and blur classes which are poorly defined.
Examples for bounding box annotations for detection are shown in Fig. 2. It can be observed that while multiple boxes are annotated for several small specular areas; contrast, blur and instrument have relatively larger areas. Due to the overlap between two or more classes, the annotation by experts varied. This was minimized by following the detailed annotation protocol above. For semantic segmentation, a larger area mask was preferentially used to delineate locally very cluttered small specularity artifacts (see Fig. 3).
2.2 Training and testing data
The training dataset for detection consists in total 2147 annotated frames over all 7 artifact classes. All algorithms were evaluated online333https://ead2019.grand-challenge.org/evaluation/results/
using the evaluation metrics discussed in Section3 on a test set of 195 frames (10% of training data). During the annotation we found that most of frames were much more affected by specularity, imaging artifact and bubbles compared to other artifact classes. We tried to keep the ratio of class types similar between the training and test datasets as best we could. Artifact class distribution for detection and generalization datasets are provided in Fig. 4.
For semantic segmentation, we released 475 annotated frames for 5 different classes that include specularity, saturation, artifact, bubbles and instrument. For test data, 122 frames were annotated for online evaluation of participants algorithm. All data are available online at (EAD2019Dataset).
The training dataset for generalization is the same as that for detection however the test data for generalization uses a previously withheld dataset provided by a sixth institution not present in any other training or test data released for the detection and segmentation tasks. The generalization test data consisted of 52 images and the task was to detect all 7 artifact classes as with the detection task.
3 Evaluation criteria
The challenge problems fall into three distinct categories. For each there exists already well-defined evaluation metrics used by the wider imaging community which we use for evaluation here.
3.1 Detection score
Metrics used for multi class artifact detection:
IoU - intersection over union. This metric measures the overlap between two bounding boxes and as the ratio between the overlapped area over the total area occupied by the two boxes (see Fig.6):
where , denote the intersection and union respectively. In terms of numbers of true positives (TP), false positives (FP) and false negatives (FN), IoU (aka Jaccard J) can be defined as:
mAP - mean average precision of detected artifacts with precision (p) defined as and recall (r) as . This metric measures the ability of an object detector to accurately retrieve all instances of the ground truth bounding boxes. The higher the mAP the better the performance. Average precision (AP) is computed as the Area Under Curve (AUC) of the precision-recall curve of detection sampled at all unique recall values whenever the maximum precision value drops:
with . Here, denotes the precision value at a given recall value. This definition ensures monotonically decreasing precision. The mAP is the mean of AP over all artifact classes for classes given as
This definition was popularised in the PASCAL VOC challenge pascal-voc-2012. The calculation is illustrated in Fig. 6. An IoU was used to call a ”match” between a predicted and ground-truth detection.
Participants were finally ranked on a final mean score , a weighted score of mAP and IoU represented as:
3.2 Semantic segmentation score
Metrics widely used for multi-class semantic segmentation of artifacts have been used for scoring semantic segmentation. It comprises of:
Dice similarity coefficient (DSC) or F1-score
Jaccard Index (J) or IoU
The general forumula to compute -score is:
With and in Eq. (6) one can compute F1-score (DSC) and F2-error respectively. Participants were ranked on a final weighted score of the above metrics defined as:
3.3 Generalization score
For multi-class artifact detection, task-3 generalization detection mAP was estimated on a sixth institution dataset not included in the training or test data of the detection and segmentation tasks. Generalization was evaluated based on the deviation between the mAP of the detection and generalization test datasets for strictly the same model parameters. Participants were ranked based on a score gap of generalization defined as:
It is worth noting that the highest with is required for winning the competition.
We would like to acknowledge James Meakin and his team for providing us an online framework to host our challenge at grand-challenges.org. We would also like to thank the Medical Image Analysis Network (MedIAN) and Cancer Research UK (CRUK) for co-sponsoring our workshop event at IEEE ISBI 2019, Venice, Italy. We are grateful to our colleagues especially Mariia Dmitrieva, Korsuk Sirinukunwattana, Soumya Gupta, Ka Ho Tam and Joel Lefebvre who helped during our annotation protocol study.