ADAM Challenge: Detecting Age-related Macular Degeneration from Fundus Images

02/16/2022
by   Huihui Fang, et al.
Baidu, Inc.
SUN YAT-SEN UNIVERSITY
11

Age-related macular degeneration (AMD) is the leading cause of visual impairment among elderly in the world. Early detection of AMD is of great importance as the vision loss caused by AMD is irreversible and permanent. Color fundus photography is the most cost-effective imaging modality to screen for retinal disorders. Recently, some algorithms based on deep learning had been developed for fundus image analysis and automatic AMD detection. However, a comprehensive annotated dataset and a standard evaluation benchmark are still missing. To deal with this issue, we set up the Automatic Detection challenge on Age-related Macular degeneration (ADAM) for the first time, held as a satellite event of the ISBI 2020 conference. The ADAM challenge consisted of four tasks which cover the main topics in detecting AMD from fundus images, including classification of AMD, detection and segmentation of optic disc, localization of fovea, and detection and segmentation of lesions. The ADAM challenge has released a comprehensive dataset of 1200 fundus images with the category labels of AMD, the pixel-wise segmentation masks of the full optic disc and lesions (drusen, exudate, hemorrhage, scar, and other), as well as the location coordinates of the macular fovea. A uniform evaluation framework has been built to make a fair comparison of different models. During the ADAM challenge, 610 results were submitted for online evaluation, and finally, 11 teams participated in the onsite challenge. This paper introduces the challenge, dataset, and evaluation methods, as well as summarizes the methods and analyzes the results of the participating teams of each task. In particular, we observed that ensembling strategy and clinical prior knowledge can better improve the performances of the deep learning models.

READ FULL TEXT VIEW PDF

Authors

page 3

page 5

page 10

page 12

page 13

page 22

page 24

page 25

02/18/2022

REFUGE2 Challenge: Treasure for Multi-Domain Learning in Glaucoma Assessment

Glaucoma is the second leading cause of blindness and is the leading cau...
05/05/2020

AGE Challenge: Angle Closure Glaucoma Evaluation in Anterior Segment Optical Coherence Tomography

Angle closure glaucoma (ACG) is a more aggressive disease than open-angl...
09/03/2020

Fundus Image Analysis for Age Related Macular Degeneration: ADAM-2020 Challenge Report

Age related macular degeneration (AMD) is one of the major causes for bl...
02/14/2022

GAMMA Challenge:Glaucoma grAding from Multi-Modality imAges

Color fundus photography and Optical Coherence Tomography (OCT) are the ...
03/07/2020

Endoscopy disease detection challenge 2020

Whilst many technologies are built around endoscopy, there is a need to ...
02/15/2020

Automatic lesion segmentation and Pathological Myopia classification in fundus images

In this paper we present algorithms to diagnosis Pathological Myopia (PM...
01/02/2022

FUSeg: The Foot Ulcer Segmentation Challenge

Acute and chronic wounds with varying etiologies burden the healthcare s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Macula, located in the posterior pole of the retina, is closely related to fine vision and color vision. Once lesions appear in this region, people will suffer from vision decline, dark shadows, or dysmorphia. Age-related macular degeneration (AMD) is a degenerative disorder in the macular region. It mainly occurs in people older than 45 years old (haines2006functional). The etiology of AMD is not fully understood, which could be related to multiple factors, including genetics, chronic photo destruction effect, and nutritional disorder (ambatiMechanismsAgeRelatedMacular2012; zajkac2015dry; rapalli2019nanotherapies). AMD can be divided into early, middle and advanced stages according to clinical characteristics (ferris2013clinical). Early and middle AMD mainly manifested as drusen and pigmentary abnormalities, patients usually have normal or nearly normal vision. Advanced AMD has two types: geographic atrophy (also known as dry AMD) and neovascular (also known as wet AMD) (lim2012age). For advanced dry AMD, choroidal retinal atrophy appears, leading to impaired central vision. Wet AMD is characterized by active neovascularization under retinal pigment epithelium, subsequently causing exudates, hemorrhage, and scarring, and will eventually cause irreversible damage to the photoreceptors and rapid vision loss if left untreated (maruko2007clinical). An early diagnosis of AMD is crucial to treatment and prognosis. Fundus photography is one of the basic examinations and is significant for fundus screening. Ophthalmologists observe the state of the macular region with fundus photographs to determine whether there are lesions such as drusen, exudates, hemorrhage, scar, geographic atrophy, neovascularization, etc. However, analysis of fundus images is experience-dependent and time-consuming.

Figure 1: Illustration of the four tasks of the ADAM challenge: Classification of AMD and non-AMD, Detection and segmentation of optic disc, Localization of fovea, and Detection and segmentation of lesions. Note: there are five types of lesions (drusen, exudate, hemorrhage, scar, other) that should be segmented in the challenges, while the sample shown here only has three types.

With the development of image processing technology, automatic fundus image analysis methods have appeared (fundus_survey). For example, the optic disc and cup segmentation methods using region growing or level set (anitha2014region; thakur2019optic) and the fovea localization method using a concentric circular sectional symmetry measure (guoRobustFoveaLocalization2020) can automatically, intuitively and quantitatively present the fundus structure. In addition, Akram et al. (akram2013automated)

used the feature vector of area, compactness, average boundary intensity, minimum and maximum boundary intensity, etc. to represent the suspected drusen region and used support vector machine to determine the real drusen. García et al. 

(garcia2010assessment)

extracted a set of color and shape features from the image and utilized four classifiers to detect hemorrhages and microaneurysms. Mookiah et al. 

(mookiah2014decision)

proposed an automated AMD detection system using discrete wavelet transform, in which they extracted the first four-order statistical moments, energy, entropy, and Gini index-based features from the DWT coefficients as image features. These lesion segmentation and AMD detection methods can also provide valuable diagnostic message. However, the above methods mostly used the hand-crafted features designed and extracted based on the texture, grayscale, or the position of fundus structure in the image, which had limited use of the image information. Recently, due to the rapid development of deep learning, the methods of structures analysis 

(Liu2019DeepAMD; wangPatchBasedOutputSpace2019; tabassumCDEDNetJointSegmentation2020; jiangJointRCNNRegionBasedConvolutional2020; xieEndtoEndFoveaLocalisation2020; maiyaRethinkingRetinalLandmark2020; li2021applications), lesion extraction (vangrinsvenFastConvolutionalNeural2016; tanAutomatedSegmentationExudates2017; orlandoEnsembleDeepLearning2018; guoLSegEndtoendUnified2019; playoutNovelWeaklySupervised2019; engelbertsAutomaticSegmentationDrusen2019), and disease prediction (burlinaAutomatedGradingAgeRelated2017; grassmannDeepLearningAlgorithm2018; Fu2018DiscAware; pengDeepSeeNetDeepLearning2019; He2020CABNet; shamshad2022transformers)

based on fundus images can take advantage of the features learned by the deep networks, which are abstract and containing more image information. However, supervised learning requires a large amount of annotated data. Unlike natural image analysis tasks, which are easy to obtain annotated data, medical image analysis based on deep learning has been limited by the lack of annotated data. In addition, the related researches are usually conducted on a single aspect, such as optic disc segmentation, fovea location, or disease classification, and there are no unified systems for evaluating these methods.

To address the above issues, we introduced the Automatic Detection challenge on Age-related Macular degeneration (ADAM) for the first time. This challenge*** https://amd.grand-challenge.org followed on the successes of the REFUGE (orlandoREFUGEChallengeUnified2020a) and AGE (fuAGEChallengeAngle2020a) challenges, and held in conjunction with the International Symposium on Biomedical Imaging (ISBI), 2020. According to age-related eye disease study (AREDS) (davis2005age), in the diagnosis of AMD, in addition to detecting the lesions, the size and location of the lesions should also be analyzed to determine the severity of the disease. AREDS proposed a grid and standard circles based on the size of the optic disc and the localization of the fovea, which are used in assessing size, area, and location of the lesions. Hence, we believe that the automatic identification of fundus structure (optic disc and fovea) are equally important to the clinic because they can assist in diagnosis of the disease degree. The clinical goal of our challenge is not only the detection of AMD, but also the recognition of fundus structures and lesions. So, in addition to studying automatic AMD detection, our challenge designed three basic tasks that can assist AMD detection, namely detection and segmentation of optic disc, localization of fovea, and detection and segmentation of lesions (as shown in Fig.1). In the ADAM challenge, a large dataset of 1200 fundus images were released, and the samples in this dataset have the following professional annotations: a label of AMD classification, a segmentation mask of the optic disc, a location coordinate of the macular fovea, and five segmentation masks of the lesions related to drusen, exudates, hemorrhage, scar, and others. In addition, to standard and fairly compare of the results obtained by different methods, a unified evaluation framework was provided. In our challenge, the AMD classification task classifies the samples into AMD and non-AMD rather than makes a detailed grading of early, intermediate, advanced stage. Similar to but different from Burlina’s method (burlina2017automated), all the samples diagnosed as AMD, including early, middle and advanced AMD, were considered as AMD category, the other samples which were not diagnosed as AMD were classified as non-AMD category. In this paper, we introduce the ADAM challenge and the released dataset in detail, summarize the methodologies of participating teams, and report their performances. Moreover, we discuss the impact of ensemble method, additional datasets and medical prior knowledge on the performance of the deep-learning based models. Finally, we discuss the clinical significance of deep learning methods for structure analysis and disease prediction from fundus images.

2 The ADAM challenge

The ADAM challenge focuses on the investigation and development of algorithms associated with the diagnosis of AMD and the analysis of fundus photos. Hence, we released a large dataset of 1200 annotated retinal fundus images. In addition, to evaluate and compare the results of different algorithms, we designed an evaluation framework. The challenge consisted a preliminary (online stage) and a final (onsite stage). During the preliminary, we released a training set and an online set for the model learning and evaluation, respectively. There were 610 results submitted to the online evaluation, and 11 teams were invited to the final. In the final, an online set was released to test the models.

Set Num. AMD/ Non-AMD With/o Full Disc With/o Fovea With/o Drusen With/o Exudate With/o Hemorrhage With/o Scar With/o Other Lesions
Train 400 89/311 270/130 396/4 61/339 38/362 19/381 13/387 17/383
Online 400 89/311 265/135 400/0 49/351 53/347 30/370 34/366 10/390
Onsite 400 89/311 286/114 394/6 44/356 39/361 23/377 21/379 2/398
Total 1200 267/933 821/379 1190/10 154/1046 130/1070 72/1128 68/1132 29/1171
Table 1: Summary of ADAM challenge dataset

2.1 ADAM dataset

The ADAM dataset consists of 1200 retinal fundus images stored in ‘JPEG’ format with 8 bits per color channel, which are provided by Zhongshan Ophthalmic Center, Sun Yat-sen University, China. The fundus images were acquired by using a Zeiss Visucam 500 fundus camera with a resolution of 21242056 pixels (824 images) and a Canon CR-2 device with a resolution of 14441444 pixels (376 images). The collection process strictly follows the operating procedures of the fundus cameras. The examinations were made in a standardized darkroom and with the patients sitting upright. The center of the field of view of the fundus images is the optic disc, the macula, or the midpoint of the optic disc and macula. These fundus images correspond to Chinese patients (47% female), which ranged in age from 18-79 years old, visiting Zhongshan Ophthalmic Center Clinic, and we have manually selected only high-quality images of 1,200 patients’ left or right eyes. For anonymization, we removed the personal information of every image. The dataset was divided into three parts: training set (400 images), online set (400 images), and onsite set (400 images). Details of the ADAM dataset is shown in Table 1. This ADAM challenge dataset is the first one which contains the comprehensive labels for AMD analysis. In addition to the category labels of AMD, the dataset also provides the pixel-wise segmentation masks of the full optic disc and lesions, as well as the location coordinates of the macular fovea (as shown in Fig. 2).

Figure 2: Illustration of the reference standards for all tasks. The first row shows an AMD sample, the second row shows a non-AMD sample. (A) original fundus image; (B) optic disc segmentation mask (blue), there is no optic disc in this non-AMD sample; (C) fovea location coordinate (pink cross) annotation; (D) lesion segmentation masks (light blue: drusen, green: exudate, red: hemorrhage, grey: others), there is no scar in these two samples.

The reference standard for AMD classification is obtained from the clinical diagnosis results, which considering the medical record text information, such as medical history, physical examination results, and the imaging information, including fundus images and optical coherence tomography. The AMD category contains the samples considered as early, middle, and advanced AMD (ferris2013clinical), the non-AMD category contains the samples that do not have AMD, but may have other fundus disorders. In the training, online, and onsite sets, the AMD samples all account for 22.25% (89 images). The sizes of early/middle/advanced-dry/advanced-wet AMD samples in these three datasets are respectively 28/7/4/50, 18/7/3/61, and 24/9/3/53. It is worth noting that the proportion of AMD samples in our dataset, especially the proportion of wet AMD samples, is higher than that in the real-world, which is mainly due to the patients who come to the hospital are more likely to have serious ocular disease. Besides, to balance the sample sizes of AMD and non-AMD category in the model training, we also supplemented the size of the AMD category. Furthermore, other retinal diseases, such as diabetic retinopathy, myopia and glaucoma, can be found in certain AMD and non-AMD samples. Initial manual pixel-wise annotations of the full optic disc and five lesions (drusen, exudate, hemorrhage, scar, and other lesions) were provided by 7 independent ophthalmologists who reviewed and delineated the targets in all images, without having access to any patient information or knowledge of disease prevalence in the data. The results of the majority voting of the 7 annotations were used to be a single disc or lesion segmentation mask per image, followed by a quality check by a senior specialist. When errors in the annotations were observed, this additional specialist analyzed each of the 7 segmentations, removed those that were considered failed in his/her opinion, and repeated the majority voting process with the remaining ones. Similarly, the initial coordinates of the fovea were obtained by 7 independent experts and the final coordinate was determined by averaging of the 7 annotations, and a senior specialist performed a quality check afterward. When errors in the final coordinates were observed, the modification method was similar to the error processing method in the segmentation annotating process.

2.2 Challenge Evaluation

The ADAM challenge designed specific evaluation metrics and performance ranking rules for every task. After the teams submitted the predicted results, a unified evaluation method was used to calculate the effects of different models, which ensures the fairness of the evaluation. It is worth noting that since the evaluation metric values of multiple tasks belong to different dimensions, we did not directly calculate the ranking based on the numerical results. Instead, we first ranked the teams in each task, which can transfer the evaluation to a unified measurement system, and then weighted the ranks to obtain the overall ranking.

Task1: Classification of AMD and non-AMD images

Participants provided the estimated classification probability/risk of the image belonging to a patient diagnosed with AMD (value from 0.0 to 1.0). The classification results were compared to the clinical conclusion of AMD. A receiver operating curve was created across all the testing images and an area under the curve (AUC) was calculated. Each team received a rank (1=best) based on the obtained AUC value.

Task 2: Detection and segmentation of optic disc

Participants provided the segmentation results as one image per testing image, with the segmented pixels labeled in the same way as in the reference standard (png files with 0: optic disc, 255: elsewhere). If the optic disc was detected to be not fully presented, the pixel-wise labels in the segmentation should all be 255. Please note that the segmentation evaluation was first calculated in every sample with optic disc and then the results were averaged among all these calculated samples, while the detection evaluation was directly calculated among all test samples. The Dice coefficient was calculated as the segmentation evaluation metric in the ADAM challenge following the evaluation method of the previous challenges (PALM; orlandoREFUGEChallengeUnified2020a) :

(1)

where and represent the pixel numbers of the prediction and ground truth, represents the pixel numbers of the overlap between the prediction and ground truth. , , and definite the numbers of true positive, false positive, and false negative of the segmented pixel in each image, respectively. Each team received a rank (1=best) based on the mean value of Dice over the testing images.

In addition, an optic disc was assumed to be detected in an image if its segmentation had pixels being labeled as the optic disc. Accordingly, the score was calculated as the detection evaluation metric:

(2)

where and

represent the precise and recall of the detection results among the testing images.

, and respectively represent the numbers of true positive, false positive, and false negative of the detected image in the testing set. Although Dice and have the same formula for calculating, they have different meanings for true positive, false positive and false negative in our paper. Specially, the true positive, false positive and false negative in Dice for segmentation evaluation are computed at pixel-level, whereas those in score for detection evaluation are computed at image-level. Each team received a rank (1=best) based on the obtained score. Because segmentation results could provide more information in clinic, we set a higher weight for . Thus, the final ranking of optic disc detection and segmentation was then determined by the following equation:

(3)

Task 3: Localization of fovea.

The location results submitted by participants contain the X and the Y coordinates and if the fovea was invisible in the given image, both X and Y coordinates were set to 0. Submitted fovea localization results were compared to the reference standard. The evaluation criterion was the average Euclidean distance between the estimations and the ground truth, which was the lower the better. Each team received a rank (1=best) based on the obtained measure.

Rank Team Additional Dataset Architecture Ensemble Loss
1 VUNO EYE TEAM - EfficientNet Self-ensemble, Concatenate 15 finding feature maps CE
2 ForbiddenFruit ODIR EfficientNet, DenseNet Ensemble 5 models with designed formation CE
3 Zasti_AI - EfficientNet, Inception-ResNet, ResNext, SENet Ensemble 8 network predictions using averaging of posterior probabilities CE
4 Muenai_Tim - EfficientNet Self-ensemble 3 local minimal loss models using averaging method CE
5 ADAM TEAM - Xception, Inception-v3, ResNet50, DenseNet101 Ensemble 3 models, using averaging method CE
6 WWW - EfficientNet - Weighted CE
7 XxlzT - Autoencoder with ResNet50 as backbone - CE
8 TeamTiger - Resnet-101 - CE
9 Airamatrix ODIR EfficientNet-B4 - CE
Table 2: A brief summary of the methods provided by the participating teams on AMD classification task.
Team AUC Rank
VUNO EYE TEAM 0.9714 1
ForbiddenFruit 0.9592 2
Zasti_AI 0.9581 3
Muenai_Tim 0.9399 4
ADAM-TEAM 0.9287 5
WWW 0.9178 6
XxlzT 0.9097 7
TeamTiger 0.9086 8
Airamatrix 0.8847 9
Table 3: Evaluation of the AMD classification task

Task 4: Detection and segmentation of lesions.

Participants were asked to provide five lesion segmentation results respectively. Similar to task 2, the segmentation result of each lesion was as one image per testing image, with the segmented pixels labeled in the same way as in the reference standard (png files with 0: lesion, 255: elsewhere). Similar to task2, the evaluation of segmentations was calculated in the samples with real lesions; evaluation of detections was calculated in all test samples. The Dice was calculated as the segmentation evaluation measure for each lesion independently. Each team received a rank (1=best) for each lesion based on the mean value of the Dice measure over the testing images where the corresponding lesion was presented. In addition, a specific lesion was assumed to be detected in an image if the submitted segmentation mask had pixels being labeled as that lesion. Accordingly, the score was calculated as a detection evaluation measure. Each team received a rank (1=best) for each lesion based on their detection evaluation measures. For each lesion, the evaluation score was then determined by weighting two individual ranks. Specifically, the evaluation ranking for lesion segmentation was determined by , where represents drusen, exudate, hemorrhage, scar or other. The final ranking was computed as:

(4)

which then determines the lesion detection and segmentation leaderboard. The team with the smallest value was ranked #1.

The final ranking of the ADAM challenge was calculated by the following equation:

(5)

where , , and represent the ranks of the aforementioned four leaderboards. In clinic, the information of lesions is the most essential evidence for diagnosis, so we give the highest weight to . AMD classification rank has a high weight due to its results can provide direct diagnostic recommendations, and the ranks of the two tasks for fundus structure analysis have the lowest weights. This ranking then determined the final leaderboard (1=best). In case of a tie, the rank of the classification leaderboard had the preference.

After the online evaluation, 11 teams attended the final onsite challenge during ISBI 2020. The onsite dataset were released for these teams and the final results were asked to be submitted in 6 hours. The final ranking considers both online and onsite evaluation rankings:

(6)

where a higher weight was assigned to the onsite ranking due that the results of onsite dataset can better reflect the real performance of the models (orlandoREFUGEChallengeUnified2020a; fuAGEChallengeAngle2020a). That’s because the teams can see the performance on the online dataset through the leaderboard during the preliminary, and can adjust their models to work well on this dataset.

Figure 3: ROCs of the AMD classification results
Rank Team Additional Dataset Architecture Ensemble Strategy Loss
1 Airamatrix - FCN with ResNet50 as encoder - Multi-task: jointly training with fovea segmentation CE
2 XxlzT REFUGE,IDRiD ResNet for classification, U-Net for coarse segmentation, DeepLab-v3+ for fine segmentation - Multi-stage CE
3 Forbidden Fruit - FPN with EfficientNet -B0 and -B2 as encoder Ensemble 2 models by averaging their predictions - Focal loss, Dice loss
4 WWW RIGA, IDRiD EfficientNet for classification, U-Net for segmentation - Multi-stage CE
5 VUNO EYE TEAM RIGA, IDRiD, REFUGE, PALM U-Net with EfficientNet as encoder

Self-ensemble 5 models at different epochs with majority voting

Multi-task: jointly training with OD segmentation based on vessel mask CE
6 TeamTiger - U-Net with EfficientNet B7 as encoder - - Jaccard loss
7 ADAM-TEAM - U-Net with Inception-v3, EfficientNet-B3, ResNet50, DenseNet101 as encoder Ensemble 4 models using averaging method - BCE, Dice loss
8 Zasti_AI REFUGE U-Net with ResNet as encoder - - CE
8 Muenai_Tim - EfficientNet-B0 for classification, U-Net++ and EfficientNet-B7 for segmentation Ensemble 2 models for segmentation Multi-stage BCE, Dice loss
Table 4: A brief summary of the challenge methods on detection and segmentation of optic disc task.
Team Dice Rank
XxlzT 0.9486 0.9913 1
Airamatrix 0.9475 0.9862 2
ForbiddenFruit 0.9420 0.9912 3
WWW 0.9445 0.9793 4
TeamTiger 0.9429 0.9792 5
VUNO EYE TEAM 0.9370 0.9894 6
ADAM-TEAM 0.9417 0.9675 7
Zasti_AI 0.9020 0.9843 8
Muenai_Tim 0.9294 0.9737 8
Table 5: Evaluation of the optic disc detection and segmentation task, in term of Dice for segmentation and for detection.
Figure 4: The optic disc segmentation results. (A) original fundus image, (B) the edges of the segmented region obtained by the 8 teams, (C)-(E) the results of the segmentation of the top 3 teams (i.e., (C) XxlzT, (D) Airamatrix, (E) ForbiddenFruit) compared with the ground-truth (green: true positive, blue: false negative, red: false positive).

3 Results

In this paper, we summarized the methods and results of the participating teams in the ADAM challenge. Since CHING WEI WANG (NUST) team used a commercial Auto AI Platform, we did not discuss this team. In addition, Voxelcloud team was only discussed in task 3 because they only participated in this task. The detail leaderboards can be accessed on the ADAM challenge website at https://amd.grand-challenge.org/final-LeaderBoard/.

3.1 Classification of AMD and non-AMD

The purpose of this task is to determine whether the image belongs to AMD. A brief summary of the methods adopted by the participating teams is shown in Table 2. These approaches can be divided into two categories based on whether or not using the ensemble strategy (zhouEnsemblingNeuralNetworks2002)

. Ensemble methods are the most common approaches used by winners in the Computer Vision competitions 

(nguyenFacialExpressionRecognition2019; khenedFullyConvolutionalMultiscale2019; sekiguchiEnsembleLearningHuman2020). Such methods can improve the performance of a single model by training multiple models and combining their results. The detail of the methods are available in the supplemental material.

The result evaluation of the AMD classification task (task 1) is shown in Table 3. All teams except Airamatrix obtained an AUC over 0.9 on the onsite dataset, as shown in Fig. 3. According to the high AUC results, the ROC curves obtained by these classification results are all located in the top left corner. When , only the TPRs of XxlzT, ADAM-TEAM, and Airamatrix teams were less than 0.9, indicating that the false positives of most teams could be controlled relatively low and met the clinical requirements.

Rank Team Additional Dataset Architecture Ensemble Strategy Loss
1 VUNO EYE TEAM IDRiD, REFUGE, PALM U-Net with EfficientNet as encoder Ensemble of 2 models by averaging Add vessel information, segmentation framework CE
2 Forbidden Fruit - Segmentation: three FPNs with EfficientNet-B0, -B1 and -B2 as encoders; Regression: VGG-19, Inception-v3 and ResNet-v2-50 For both segmentation and regression: ensemble of 3 models Segmentation and regression CE, MSE
3 Voxelcloud IDRiD, ARIA, Proprietary dataset Nested U-Net Ensemble 20 models by averaging Segmentation and regression; add vessel information Dice loss, L2 loss
4 Airamatrix - FCN-ResNet50 - Joint training with OD segmentation CE
5 Zasti_AI - GAN - Distance map generation MSE
6 WWW - Segmentation: 2 U-Net, Mask-RCNN; Regression: ResNet Ensemble different models by averaging Segmentation and regression CE,MSE
7 Muenai_Tim - EfficientNet-B0, -B7 - Classification, regression CE, MAE
9 TeamTiger - EfficientNet-B7 - Regression MSE
10 ADAM-TEAM - YOLO-v2 - Using the common object detection method IOU loss, BCE loss, MSE
11 XxlzT REFUGE, IDRiD Faster RCNN - Using the common object detection method BCE, Smooth L1 loss
Table 6: A brief summary of the challenge methods on localization of fovea task.
Figure 5: Fovea localization results of the participating teams and the enlarged display.
Team ED Rank
VUNO EYE TEAM 18.5538 1
ForbiddenFruit 19.7074 2
Voxelcloud Team 25.2316 3
Airamatrix 26.1720 4
Zasti_AI 28.8555 5
WWW 36.3596 6
Muenai_Tim 69.0398 7
TeamTiger 192.6720 9
ADAM-TEAM 205.7886 10
XxlzT 284.0335 11
Table 7: Evaluation of the fovea localization task, in term of Euclidean Distance (ED).

3.2 Detection and segmentation of optic disc

The purpose of this task is to detect whether there is a complete optic disc, and if there is, pixel-level segmentation should be performed. The methods used by the participating teams are summarized in Table 4. We divided these solutions into three categories: multi-step approaches which first detect and then segment the optic disc, the approaches which directly segment the optic disc, and the approaches based on multi-task joint training strategy. Further details can be found in the supplemental material. The evaluation results are shown in Table 5. From the table, we can see that XxlzT obtained the best score in terms of Dice and (, ). According to the ranking rules, the second and third-ranked teams are Airamatrix and Forbiddenfruit. In addition, Mann-Whitney U hypothesis tests with were performed on these top three-ranked teams to compare the statistical significance of the differences in the term of Dice values. We found that the XxlzT team was significantly outperforming the Airamatrix and Forbiddenfruit teams ( and ).

Fig. 4 shows the edges of the predicted optic disc of all participating teams and the segmentations of the top 3 teams (Airamatrix, XxlzT, and ForbiddenFruit) compared with the ground truth. The first two rows show the AMD images, and the last two rows show the non-AMD images. As can be seen from Fig. 4(B), no matter the AMD or non-AMD samples, all teams can segment the optic disc. It can be seen from row 3, when the contrast of optic disc edge decreased, the segmentations become worse to some extent while the segmentation effects of the top 3 teams are relatively robust.

Rank Team Additional Dataset Architecture Ensemble Post-processing Loss
1 VUNO EYE TEAM - U-Net with EfficientNet as encoder Self-ensemble: concatenate 15 finding feature maps as encoder output - CE
2 Zasti_AI - U-Net with Residual blocks as encoder - Lesion area constraint CE
3 WWW - DeepLab-v3 with ResNet Ensemble 2 models Region fill CE
4 Airamatrix DiretDB1 DeepLab-v3 with Xception - - CE
5 Forbidden Fruit - FPN with EfficientNet Ensemble 2 models with designed formation AMD score constraint Focal loss, Dice loss
6 Muenai_Tim - Nest-U-Net, FPN, DeeplabV3 with EfficientNet-B7 and B3 as backbone Ensemble different models using majority voting - CE, Dice loss
8 ADAM-TEAM - U-Net with Inception-v3, EfficientNet-B3, ResNet50, DenseNet101 as encoder Ensemble different models using averaging method Contour filling CE
9 TeamTiger - U-Net with EfficientNet-B0 as encoder - - Jaccard loss
10 XxlzT - ResNet50 for classification, DeepLab-v3+ for segmentation - - BCE
Table 8: A brief summary of the challenge methods on detection and segmentation of lesions from fundus images task.
Team Drusen Exudate Hemorrhage Scar Other Rank
Dice Dice Dice Dice Dice
VUNO EYE TEAM 0.4838 0.6316 0.4154 0.5688 0.4303 0.7307 0.4051 0.7027 0.2852 0.0714 1
Zasti_AI 0.5549 0.4972 0.4337 0.4965 0.2400 0.3614 0.5466 0.4598 0.2668 0.0290 2
WWW 0.4836 0.4018 0.3174 0.5581 0.2190 0.6000 0.5807 0.7273 0.0344 0.1176 3
Airamatrix 0.3518 0.5674 0.2606 0.4673 0.2257 0.2466 0.4080 0.6500 0.6906 0.1818 4
ForbiddenFruit 0.4007 0.5443 0.2866 0.5155 0.2079 0.8293 0.5639 0.8511 0.1224 0.0085 5
Muenai_Tim 0.4483 0.5535 0.2140 0.4634 0.2164 0.6038 0.4248 0.6957 0.1595 0.0910 6
ADAM-TEAM 0.3260 0.1986 0.3256 0.1785 0.1815 0.1093 0.4038 0.1002 0.0698 0.0100 8
TeamTiger 0.3260 0.1982 0.3256 0.1777 0.1815 0.1087 0.4038 0.0998 0.0698 0.0100 9
XxlzT 0.0157 0.0556 0.1593 0.4096 0.0186 0.0690 0.1014 0.2963 0.0000 0.0000 10
Table 9: Evaluation of the detection and segmentation of lesions task, in term of Dice and
Figure 6: Illustration of lesion segmentation results of the top 3 teams.(A) original fundus images with the edges of the segmentation regions; (B)-(D) the results of the segmentation mask of the top 3 teams compared with the ground-truth (green: true positive, blue: false negative, red: false positive)

3.3 Localization of fovea

The purpose of this task is to predict the coordinates of the fovea. Table 6 shows the methods of the participating teams in this task. According to the type of the network, these methods can be divided into regression network-based, generated network-based, segmentation network-based, and object detection network-based. Further details about these methods are provided in the supplemental material. The evaluation in terms of ED is shown in Table 7. The VUNO EYE TEAM achieved the best performance with an ED of 18.5538 pixels. ForbiddenFruit and Voxelcloud teams achieved the second and third best performances with ED of 19.7074 and 25.2316 pixels, respectively. The statistical significance of the differences in performance of the top 3 teams was assessed by Mann-Whitney U hypothesis tests with . The differences in the ED values achieved by the VUNO EYE TEAM were statistically significant with respect to the Voxelcloud team (), while not to the ForbiddenFruit team (). Fig. 5 shows four examples of the fovea localization results of the participating teams.The samples (A) and (B) are AMD images, and the remaining samples are non-AMD images. As can be seen, the 10 participating teams can get better fovea localization results when the macular region has higher contrast and cleaner texture, regardless of AMD or non-AMD images.

3.4 Detection and segmentation of lesions

The objective of this task is to detect if the images contain drusen, exudate, hemorrhage, scar, or other lesions. If the lesions are detected, they should be segmented at a pixel level. The summary of methods provided by the participating teams is shown in Table 8. Among these methods, U-Net, DeepLab-v3, and FPN are the most commonly used networks. The interested reader could refer to the supplemental material for further details. Table 9 provides the detailed evaluation in terms of Dice and . VUNO EYE TEAM, Zasti_AI, and WWW achieved the top three best performance according to our rank rule in Section 2.2. In particular, for the drusen and exudate segmentation, Zasti_AI obtained the best results with Dice of 0.5549 and 0.4337, and for the drusen and exudate detection, VUNO EYE TEAM achieved the best score with of 0.6316 and 0.5688. This illustrates the method of VUNO EYE TEAM had advantages in detecting the drusen and exudate lesions from the fundus images, while the detailed segmentation of these lesions could be improved. For hemorrhage segmentation and detection, VUNO EYE TEAM had the best performance with Dice of 0.4303 and of 0.7307. For the scar and other lesions segmentation and detection, WWW and Airamatrix teams obtained the best performance, respectively.

Fig. 6 shows the lesion segmentation results of the top 3 teams. Fig. 6 (A) shows five samples with the edges of the segmented regions. Figs. 6 (B-D) show the segmentations of the VUNO EYE TEAM, Zasti_AI, and WWW teams compared with the ground-truth. The five columns correspond to the segmentation results of drusen, exudate, hemorrhage, scar, and others. Green, red and blue regions are true-positive, false-positive, and false-negative pixels, respectively. From the first column, we can find Zasti_AI team produced the best result and VUNO EYE TEAM and WWW produced more false-positive pixels. For exudate segmentation in the figure, WWW provided more true-positive pixels while also provided more false-positive pixels. The VUNO EYE TEAM and the Zasti_AI provided more false-negative pixels. It can be seen that the hemorrhage lesions are widely distributed on the fundus image, all these three methods failed to extract the hemorrhage completely. And, in the shown sample, the results of the Zasti_AI team produced some false positive pixels. For the scar lesion segmentation, the result of the WWW team was the best one, which is consistent with the quantitative results in Table 9. From other lesion segmentation results, we can see that all the top three teams produced false-negative pixels, especially the WWW team.

4 Discussion

In general, training datasets are important to the deep model with supervised learning because their size and distribution can directly affect the prediction and generalization abilities of the model (shorten2019survey; geirhos2020shortcut). In addition, clinical knowledge are crucial to the application of deep learning in medical image processing, and ensemble method is an effective performance-enhancing strategy for deep learning applications. Hence, in this section, we discuss the impact of ensemble method, imbalanced classes in the dataset, additional datasets, and clinical prior knowledge on the ADAM challenge and clinical practices.

4.1 Discussion of ensemble

AMD
Classification
Disc Detection
and Segmentation
Fovea
Localization
Lesion Detection and Segmentation
Drusen Exudate Hemorrhage Scar Others
AUC DICE ED DICE DICE DICE DICE DICE
0.9702 0.9519 0.9930 17.8173 0.5338 0.5714 0.3874 0.6263 0.3121 0.6667 0.5314 0.8000 0.1481 0.0909
Table 10: Ensemble results of the top 3 teams in each task

Each neural network model has its own parameters and characteristics, and different models will produce different errors in the same test set, that is, there is variance between the models. In the application, ensemble method utilizes the complementary advantages of different models to offset the variance, which is training multiple models and combining the prediction results, so that the results will be not sensitive to the training details and the contingency 

(brownlee2018ensemble). The models in the ensemble method are generally constructed in the following three ways: first, using the same model with different training data, such as the method adopted by VUNO EYE TEAM in task 1 and 4. Second, using different models with the same training data, such as the methods designed by ForbiddenFruit team in task 1 and VUNO EYE TEAM in task 2. Third, using the same model with the same training data but setting different training details. The first model construction method requires more training datasets, the third method may cause highly relevant error since they learned similar mapping functions, hence, most participating teams choose the second method in the challenge. After obtaining the predicted results of each model on the test set, the ensemble strategies of these results mainly include averaging and majority voting. Special combination strategy is also encouraged to design according to the task, such as the strategy designed by ForbiddenFruit team in task 1.

In the classification task of our challenge, the top 5 teams all used the ensemble method, and the top 3 teams used at least 5 models. It is important to note that the 1st team used a proprietary dataset of 15 lesion annotations and trained 15 models whose extracted lesion features were associated with the AMD disease, thus enabling better AMD classification by ensemble these models. In the task of fovea localization, the ensemble was used for the top 3 teams, among which the 1st team used 2 models, the 2nd team used 6 models, and the 3rd team used 20 models. The first and third methods were supplemented with the additional vascular information, and the second method used only ensemble strategies. It can be inferred that when the number of models reaches a certain level, the performance of the ensemble model decreases with the increase of the number. An appropriate number of models should be selected according to the task and training data. In addition to the ensemble method, the ability of the designed model is very important. For example, in the optic disc segmentation task, the 1st teams used the relevance of the positions of optic disc and macula in the fundus images and 2nd teams adopted a coarse-to-fine strategy. Although the 3rd team used the ensemble method, their models dealt with the disc segmentation directly from the original image, so the effect was not as good as the above two strategies.

We have designed an experiment to composite the results of the top 3 teams in each task, the ensemble strategy for AMD classification and fovea localization tasks is averaging the probabilities and coordinates, the ensemble strategy for disc and lesions segmentation tasks is majority voting. Table 10 shows the experimental results. In the table, the green value indicates that the ensemble result is better than the result of the 1st team, while the red value indicates inverse. It can be seen that the basic model used for ensemble is very important to the result, because more model with poor effect will interfere with the model with good effect in the ensemble process, making the final result worse.

Teams
Early
AMD
(24)
Middle
AMD
(9)
Advanced
AMD-
dry (3)
Advanced
AMD-
wet (53)
VUNO EYE
TEAM 0.9159 0.9943 0.9861 0.9917
FobiddenFruit 0.9090 0.9964 0.9743 0.9748
Zasti_AI 0.8901 0.9882 0.9914 0.9818
Muenai_Tim 0.8268 0.9877 0.97535 0.9809
ADAM-TEAM 0.8284 0.9775 0.9753 0.9632
WWW 0.8261 0.9578 0.9603 0.9502
XxlzT 0.7661 0.9557 0.9700 0.9635
TeamTiger 0.8312 0.9500 0.9539 0.9341
Airamatrix 0.7186 0.9023 0.9855 0.9511
Table 11: AMD Classification (AMD and non-AMD classes) results of the samples with different severities.

4.2 Discussion of imbalanced classes

Additional datasets Tasks Teams Annotators Devices Involving ocular disease
ODIR AMD Classification ForbiddenFruit, Airamatrix -
Canon, Zeiss,
Kowa
diabetic retinopathy, glaucoma, cataract, AMD, pathologic myopia, hypertensive retinopathy
REFUGE1
optic disc segmentation,
fovea localization
XxlzT, VUNO EYE TEAM,
Zasti_AI
7 Zeiss glaucoma
IDRiD
optic disc segmentation,
fovea localization
XxlzT, WWW,
VUNO EYE TEAM, Voxelcloud
- Kowa diabetic retinopathy
RIGA optic disc segmentation WWW, VUNO EYE TEAM 6 Topcon, Canon glaucoma
PALM
optic disc segmentation,
fovea localization
VUNO EYE TEAM 7 - high myopia
ARIA fovea localization Voxelcloud - Zeiss AMD, diabetic retinopathy
DIARETDB1 leision segmentation-exudate Airamatrix 4 Zeiss diabetic retinopathy
fundus 10k lesion segmentation-scar Airamatrix - - diabetic retinopathy
Table 12: Introduction of the additional datasets used by the participating teams in the four tasks.

To confirm the binary AMD classification effect of each model on the samples with different severities, we mixed the AMD samples at various severities with non-AMD samples for binary AMD classification experiments. The results in term of AUC are shown in Table 11. As can be seen from the table, compared with the middle, advanced dry, and advanced wet AMD samples, the early AMD were difficult to be recognized as AMD due to the lack of obvious pathological features. Teams should pay more attention to the difficult samples, which can be achieved by increasing the loss weights of these samples. Since our task was to divide the sample into AMD or non-AMD categories, the effect of correctly identifying middle AMD and advanced dry AMD as AMD category is not affected by the small sample size.

In the dataset released in this challenge, except the imbalance of AMD samples with different severities, it can be seen from Table 1 that the samples distribution with specific lesions was also imbalanced. In detection and segmentation of lesions task, the target lesions were subdivided into five categories: drusen, exudate, hemorrhage, scar and others, resulting in a small size of positive samples for each category. As can be seen from the comparison of Tables 9 and 5, the DICE of optic disc segmentation results obtained by all teams in the final was above 0.9, and the score of detection results was above 0.96. However, in the detection and segmentation of lesion task, the highest DICE value was 0.6906 for Airamatrix team’s segmentation of others lesion, and the highest score was 0.8293 for ForbiddenFruit team’s detection of hemorrhage lesion. In the segmentation and detection task of other lesion with the least positive samples in the dataset, XxlzT team did not detect any correct positive cases, so and DICE values were 0. There are three main reasons for the poor performances: 1) there are few positive samples for the model to learn; 2) the characteristics of different lesions are similar and difficult to distinguish; 3) the lesion morphology is irregular. In subsequent studies, we will supplement the positive samples of the lesions to alleviate the imbalance of the samples. In addition, data enhancement can be carried out on the small samples, and model training strategy based on patches can also be adopted to deal with this task.

4.3 Discussion of additional datasets

Table 12

shows the additional datasets used by the participating teams in our challenge. The table lists devices, ocular diseases involved, and number of annotators, which may influence the effect of the machine learning models. As can be seen from the table, the types of ocular diseases involved in these datasets are not only AMD, but also diabetic retinopathy, glaucoma, cataract, pathologic myopia, hypertensive retinopathy, etc. Half of the additional datasets did not provide the information about annotators, especially the number of the annotators, which would affect the quality of gold standard and further affect the model learning effect. These data collection devices are concentrated in four types, including Canon, Zeiss, Kowa, and Topcon. Fig. 

7 shows the data distribution of the datasets used in each task. In the figure, red and blue points respectively represent the training set and onsite set of ADAM dataset, and the corresponding datasets of other color points can be found in the schematic diagram on the right side of Fig. 7. It can be seen from the figure that the additional datasets can expand the distribution range of the training data, which is conducive to the improvement of the model generalization ability. However, the distribution of the online set is similar to that of the original training set, so it is difficult to use our online set to reflect the generalization ability of the models. The comparisons of the effects of the models trained with and without additional datasets are as follows.

Figure 7: Data distribution of the datasets used in each task by t-SNE dimension reduction (van2008visualizing). (A) Classification of AMD and non-AMD task; (B) Detection and segmentation of optic disc task; (C) Localization of fovea task; (D) Detection and segmentation of lesions task.

In the AMD classification, ForbiddenFruit team used the additional ODIR datasethttps://odir2019.grand-challenge.org/dataset/. The ROCs of the classification results obtained by ForbiddenFruit team under these two situations are shown in Fig. 8, and the corresponding AUC values are 0.9592 and 0.9579. We can find that using of the additional dataset has very little effect for ForbiddenFruit team in AMD classification task. In the detection and segmentation of optic disc task, XxlzT, WWW, VUNO EYE TEAM and Zasti_AI used additional datasets, such as IDRiD (porwalIDRiDDiabeticRetinopathy2020a), REFUGE (orlandoREFUGEChallengeUnified2020a), RIGA (almazroaRetinalFundusImages2018), and PALM (PALM). The evaluation of the models obtained by the four teams trained with and without the additional datasets are shown in Fig. 9. The mean Dice coefficient of the segmentations of XxlzT, WWW, VUNO EYE TEAM, and Zasti_AI teams without using the additional datasets decreased by 1.02%, 1.98%, 0.61%, and 1.94%, respectively. In fovea localization, VUNO EYE TEAM and VoxelCloud used additional datasets including IDRiD, REFUGE, PALM, ARIA (chea2021classification), and VoxelCloud also used a proprietary dataset. As can be seen from Fig. 10 that the ED results of VUNO EYE TEAM and Voxelcloud team without using additional datasets are 24.9309 and 48.8327 pixels, the errors are increased by 34.37% and 93.54% compared with those of the models trained with additional datasets, respectively. In the detection and segmentation of lesions task, the Airamatrix team used the additional dataset DiretDBIhttps://www.it.lut.fi/project/imageret/diaretdb1/ and fundus10k§§§https://github.com/li-xirong/fundus10k when deal with exdute and scar lesions. Table 13 shows the performances of the models trained under the two situations. For exudate and scar lesions segmentation and detection, the Dices of the models without additional datasets were reduced by 1.2% and 1.8%, and were reduced by 2.74% and 6.97%.

Figure 8: ROC curves of the classification results obtained by Forbiddenfruit team training their model with and without additional dataset ODIR.
Figure 9: The Dice evaluation results of the optic disc segmentation models obtained by the four teams trained with and without the additional dataset.
Figure 10: The ED evaluation results of fovea localization predicted by the models trained with and without additional datasets by VUNO EYE TEAM and Voxelcloud team.
Additional datasets Exudate Scar
Dice Dice
With 0.2606 0.4673 0.4080 0.6500
Without 0.2574 0.4545 0.4006 0.6047
Table 13: Evaluation of the results obtained by Airametrix team training the model with and without the additional datasets in the detection and segmentation of exudate and scar.

The experimental results indicate that on the challenge online set, except for the fovea location task, the use of additional datasets did not sharply improve the performances of the models. The use of additional datasets can expand the distribution of the training data, as shown in Fig. 7, however, the data distribution of our test set was similar to that of the training set, so the results on this set cannot reflect the generalization ability of the models. In addition to enriching the distribution of the existing training set, it can be seen from Fig. 7 that the additional datasets may complement the amount of the sample in the distribution of the existing training set. These samples may be helpful to improve the predictions on the samples with such data distribution. However, for AMD classification task and optic disc segmentation task, the existing original training data is enough for the model to find the optimal parameters, so the additional data plays a small role. In contrast, these additional data are important for the fovea localization task, because it is a regression task, which is more sensitive to the loss value during training process than classification task. For example, the prediction probabilities of 0.7 and 0.99 in the classification task can be regarded as the same category, but there is a 0.29 error in the regression task. Hence, compared with the classification task, the regression task requires larger sample size. This explains our experimental conclusion that in the fovea localization task, the model trained with additional data sets achieved the most obvious improvement. For the task of lesion detection and segmentation, there were very few positive samples in the original training set. Theoretically, adding additional datasets would improve the model performance, but the experimental results were not so. We speculated that the main reason was that different datasets had inconsistent labeling standards for lesions. Therefore, in the future work, we need to collect a large size of widely distributed data, which collected by different devices, with uniform labeling standards, and make more perfect disease statistics of them, so as to provide support to study algorithms with strong generalization ability and to study diagnostic methods of cases with multiple ocular disorders.

4.4 Clinical discussion

Although the approaches used by the participating teams were all based on the deep learning frameworks, many teams considered clinical prior knowledge for specific tasks. For example, The Airametrix team designed a joint learning method of optic disc segmentation and fovea localization by using the relevance of the positions of optic disc and macula in the fundus images, and the obtained results of optic disc segmentation ranked first. In the fovea localization task, the VUNO EYE TEAM (1st) and the VoxelCloud team (3rd) used additional fundus vessel information in the learning network, considering that the shape of the vessel can provide a strong prior for fovea localization. In addition, the presence of the lesions is closely associated with AMD disease, so the ForbiddenFruit team incorporated the risk probability for disease prediction into the segmentation process. The above clinical prior knowledge can guide or assist the network to learn better for specific tasks. Therefore, an evolving branch of deep learning is to explore how clinical knowledge can be used to enable models to learn features more accurately and thus achieve better performance.

The task of detection and segmentation of optic disc, localization of fovea, and detection and segmentation of lesions designed by ADAM challenge are the basis for clinicians to diagnose AMD disease. In the detection and segmentation of optic disc task, the best performance was obtained by the XxlzT team with the Dice coefficient of 0.9486 and the score of 0.9913. In the fovea localization task, VUNO EYE TEAM provided the best performance with an ED of 18.6 pixels, which is about one-tenth of the radius of the optic disc. These results can provide ophthalmologists with useful information about the position of the fundus structures. The lesion detection and segmentation results obtained by the learning method are far from the clinical application due to the small size of samples and the variable characteristics of the lesions. However, for the AMD classification task, due to the plentiful features provided by the large dataset and the superior performance of deep learning-based methods, the best method provided by the VUNO EYE TEAM can achieve an AUC of 0.9714, which can assist to make decisions to a certain extent. This is a promising start but further improvements are needed before they can be used in clinics. In the future, we hope the automatic AMD detection methods could consider the information of the size and location of the actual lesions comparing with the grid and standard circles, which can make the automatic AMD diagnosis results more interpretable.

Table 14 shows the comparison between the manual labels of optic disc segmentation and fovea localization annotated by 7 independent ophthalmologists and the ground truths. It can be seen from the table, in the optic disc segmentation task, the disc regions manually labeled by user 7 was closest to the ground truth, all the optic discs involved in 400 images were correctly detected, and the DICE value reached 0.9439. The disc segmentation masks labeled by user 3 were the most different from the ground truth with the DICE value of 0.7469. Compared with Table 5, we can see that the automated method proposed by XxlzT and Airamatrix teams achieved better results than the manual labeling results of user7, and all the automatic methods proposed by the finalists produced better segmentation results than user3. In the fovea localization task, the best manual labelling results were provided by user3 with ED of 21.1393 pixels and the worst results were provided by user5 with ED of 62.8048 pixels. Compared with Table 7, we can see that the automated method proposed by VUNO EYE TEAM and ForbiddenFruit teams achieved better results than the manual labeling results of user3, and the automatic methods proposed by the top six teams of the finalists produced better localization results than user5. It demonstrates that the machine-leaning models can achieve the same or even higher accuracy compared to ophthalmologists in the optic disc segmentation and fovea localization tasks.

Disc Segmentation Fovea Localization
DICE ED
user1 0.9982 0.8685 30.8106
user2 1.0000 0.8991 28.0984
user3 0.9965 0.7469 21.1393
user4 1.0000 0.8979 26.3184
user5 0.9982 0.9068 62.8048
user6 0.9982 0.8553 29.9480
user7 1.0000 0.9439 46.5350
Table 14: The comparison of the manual labels annotated by 7 ophthalmologists and the ground truth in the tasks of optic disc segmentation and fovea localization.

5 Conclusion

In this paper, we summarized the methods, results and findings of the ADAM challenge. We analyzed and compared the performances of the participating teams of each task in the onsite edition of the ADAM challenge at ISBI 2020. Deep learning-based methods play an important role in the research field of fundus image analysis, and the corresponding results show the promising performances in the segmentation tasks of optic disc, in the localization task of the fovea, as well as in the classification task of AMD disease. We observed that the ensemble strategy utilized by many teams improves the performance through integrating the multiple models. Moreover, the clinical prior knowledge, such as the relationship of blood vessels, optic disc, and fovea in the fundus images, and the positive correlation between the lesion detection and AMD classification, can be used to design the models and enhance their performances. In addition, the combination of segmentation and regression networks can also improve the segmentation or localization results. We also analyzed the performances of the models trained with only the ADAM dataset and with the ADAM dataset and other additional datasets. However, since the distribution of the online set is similar to that of the original training set, it is difficult to use our online set to reflect the generalization ability of the models trained with additional datasets.

ADAM challenge encourages researchers to take a more global approach to analyze the structures related to AMD and predict AMD disease from fundus image. A large dataset with comprehensive labels of fundus structure, lesions, and disease categories, as well as an evaluation framework, are publicly accessible through the website at https://amd.grand-challenge.org/. However, medical data is considered sensitive data and is subject to particularly strict rules. To be able to trace the use of our data, we require the researchers to apply to join in the ADAM challenge first, and to fill in true and complete personal information (name, institution and E-mail). After the participation request has been approved, the researchers can download the data freely at https://amd.grand-challenge.org/download/. If there is any problem with the download process, they can email organizers. Future participants are welcome to utilize the dataset to develop more robust and novel algorithms and to submit their results on the challenge website.

Appendix A. Summary of Challenge Solutions

This appendix provides detailed descriptions of the methods of the participating teams in four tasks.

Appendix A.1 Classification of AMD and non-AMD images task

A brief summary of the methods adopted by the participating teams is shown in Table 2 in the main manuscript. The top 5 teams in this classification task utilized the ensemble method. Among them, the 1st and the 4th teams adopted a self-ensemble strategy, which meant that the integrated models had the same structures but different parameters. In particular, the VUNO EYE TEAM (1st) trained an EfficientNet (tanEfficientNetRethinkingModel2019) by using a dataset with 15 types of lesion labels (hemorrhage, hard exudate, cotton wool patch, drusen, retinal pigmentary change, vascular abnormality, membrane, fluid accumulation, chorioretinal atrophy, choroidal lesion, myelinated nerve fiber, retinal nerve fiber layer defect, glaucomatous disc change, non-glaucomatous disc change, and macular hole), and obtained 15 models (sonDevelopmentValidationDeep2020)

. Then, they obtained the AMD classification results by combining the feature maps from the 15 models and using a fully connected layer and Sigmoid activation function (as shown in Fig. 

11). The Muenai_Tim team (4th) also used the EfficientNet architecture that has been widely successful at computer vision fields and optimal for memory consumption (tanEfficientNetRethinkingModel2019). They realized the self-ensemble by integrating the models which have the local minimal loss during the training processing.

Figure 11: The framework of the VUNO EYE TEAM for AMD classification, where 15 finding networks were utilized to extract the image features.

The other three teams took advantage of the ensemble approach of multiple models. The main difference in these approaches is the fusion strategy. A common ensemble strategy is averaging. The Zasti_AI team (3rd) and the ADAM-TEAM (5th) both adopted the average strategy to ensemble the performances of the multiple standard deep learning models (as shown in Fig. 12). The Zasti_AI team built 4 models with the EfficientNet architectures (tanEfficientNetRethinkingModel2019), 2 models with the ResNext structures (xieAggregatedResidualTransformations2017), 1 model with Inception-ResNet (szegedyInceptionv4InceptionResNetImpact2017), and 1 model with SENet architecture (huSqueezeandExcitationNetworks2018). The ADAM-TEAM built 4 models with Inception-v3, Xception, ResNet-50 (heDeepResidualLearning2016) and DenseNet-101 architectures, respectively. In these two methods, the results of each model contributed the same to the final prediction results. However, the ForbiddenFruit team (2nd) designed a novel ensemble function, which made the results of different models contribute differently. They ensembled 5 models, which included EfficientNet-B2, -B4, -B3, -B7 (tanEfficientNetRethinkingModel2019) and DenseNet-201 (huangDenselyConnectedConvolutional2017), by using the following function:

(7)

where is the input image, represents the i-th models, and (positive integer) are ensemble weights. The number is 5, and the weights were found experimentally: 3 for EfficientNet-B2, 2 for EfficientNet-B4, 1 for the other three models.

Figure 12: The average ensemble framework of the Zasti_AI team and the ADAM-TEAM for AMD classification task.

The remaining four teams designed their frameworks without ensemble strategy. The TeamTiger team (8th) directly utilized ResNet101 to realize the classification. To obtain the discriminative feature maps for better classification, the other teams developed several strategies to improve the feature extraction modules. The WWW(6th) and the Airamatrix (9th) teams pre-trained an EfficienNet-B7 and an EfficientNet-B4 based on ImageNet respectively, and then fine-tuned them based on the clinical dataset. The XxlzT team (7th) proposed a self-supervised module to obtain the parameters of the encoder architecture, which is used to extract the image features. As shown in Fig. 

13, they extracted the grayscale image from the fundus image and then used an encoder-decoder framework to generate the color information, which was then superimposed with the grayscale image to generate the fundus image. In the classification task, they utilized the encoder architecture and the shared parameters to extract features from the fundus images and then used a fully connected layer to realize the classification.

In addition, the training datasets of the 7 teams were completely derived from the ADAM dataset, while that of the Forbiddenfruit and Airamatrix teams contained ODIR datasethttps://odir2019.grand-challenge.org/dataset/

. The loss functions adopted by all teams are based on cross-entropy (CE), in which the WWW team used weighted cross-entropy loss 

(cuiClassBalancedLossBased2019) to solve the data imbalance problem.

Figure 13: The framework of the XxlzT team for AMD classification, where the two encoder share the same parameters.

Appendix A.2 Detection and segmentation of optic disc task

The methods used by the participating teams are summarized in Table 4 in the main manuscript. Since this task involved detection and segmentation of the optic discs, three teams adopted a multi-step training strategy, which first classifies the fundus images into with or without complete optic disc, and then segments the complete optic disc region. For classification, the XxlzT team (2nd) utilized ResNet, the WWW team (4th) and the Muenai_Tim(8th) team used EfficientNet-B7 and -B0. In the segmentation process, all three teams adopted U-Net (ronnebergerUNetConvolutionalNetworks2015), while for more precise results, the XxlzT team further performed fine segmentation of optic disc using a DeeplabV3 network (chenRethinkingAtrousConvolution2017) on the previous segmentation results, and the Muenai_Tim team used an emsemble of the U-Net variant, U-Net++, and EfficientNet-B7.

Figure 14: The rule for estimating integrity of optic disc designed by the ForbiddenFruit team. (A) fundus image, (B) the camera’s field of view (red) and optic disc segmentation (yellow), (C) enlarged figure of the optic disc region and the intersecting length of the segmented optic disc and the edge of the camera’s field of view.

Half of the participating teams utilized the strategy of segmenting the optic disc directly. Among them, three teams adopted U-Net. The differences between these three methods are the encoder structure of the U-Net and the ensemble strategy. Specifically, the TeamTiger team (6th) and the Zasti_AI team (8th) respectively adopted EfficientNet-B7 and ResNet architectures as the encoders. Moreover, the Zasti_AI team used an adversarial training setting (shankaranarayanaJointOpticDisc2017) to improve the accuracy of segmentation results. The ADAM-TEAM (7th) respectively used Inception-v3, ResNet-50, EfficientNet-B3, and DenseNet-101 as encoders in the U-Net based architectures. And, they determined the segmentation results by averaging the predictions of these four different models. Except for the U-Net architecture, the ForbiddenFruit team (3rd) utilized two Feature Pyramid Networks (FPN) (linFeaturePyramidNetworks2017), one of which used EfficientNet-B0 as encoder, and another used EfficientNet-B2. The outputs of the two FPNs were then averaged and the probability map was thresholded at 0.5. After segmentation, these four teams designed several post-processing steps to remove the incomplete optic disc. For example, the ForbiddenFruit team determined whether the optic disc was intact by calculating the degree of intersection between the segmentation results and the camera’s field of view. It can be seen in Fig. 14, if the length of intersection of the height in the segmentation mask, the mask should be removed. The Zasti_AI team simply kept a threshold on the area largest connected component in the segmentation map during the post-processing stage and discarded if the area was smaller than the threshold.

Notably, two teams considered other clinical information and used the multi-task strategy to segment the optic disc. Specifically, the Airamatrix team (1st) designed a framework to solve both the optic disc segmentation and fovea localization tasks. They transferred the localization task to a segmentation task, so joint training strategy was adopted to make the model simultaneously segment the optic disc and fovea regions. The fully convolutional networks (FCNs) with the ResNet-50 as backbone were used for optic disc and fovea segmentation. In addition, they applied erosion on the detected masks to further improve the accuracy. The VUNO EYE TEAM (5th) incorporated the blood vessels during training. They took a fundus image and a vessel image as input. The network consisted of two branches (see Fig. 15). One branch, which was EfficientNet-B4, processed the fundus image, and the other branch, which was EfficientNet-B0, operated on the vessel image. The penultimate feature maps of the fundus branch were concatenated to those of the vessel branch. In the decoder module, they up-scaled the feature maps using 11 convolutions, depth-wise separable convolutions (howardMobileNetsEfficientConvolutional2017), swish activation functions (ramachandranSearchingActivationFunctions2017), and depth-wise concatenation and then scaled the feature maps to yield those with the same size of the input. The final segmentation layer is generated using a 1

1 convolution followed by a Sigmoid function. They imposed a loss weight of 0.1 on the vessel branch as the vessel shape can give a good indication of the optic disc region, and the loss weight of the last layer was set to 1. In addition, the VUNO EYE TEAM used the snapshot ensemble approach to integrate the models obtained from epoch 62, 77, 93, 109, and 124 during training.

In this task, half of the participating teams used additional datasets. In detail, three teams (XxlzT, VUNO EYE TEAM, and Zasti_AI) used REFUGE dataset (orlandoREFUGEChallengeUnified2020a). Three teams (XxlzT, WWW, and VUNO EYE TEAM) used IDRiD dataset (porwalIDRiDDiabeticRetinopathy2020a). Two teams (WWW and VUNO EYE TEAM) used RIGA dataset (almazroaRetinalFundusImages2018). The VUNO EYE TEAM also used PALM dataset (fuIChallengePALMPAthoLogicMyopia2019). In the selection of the loss function, since the segmentation task can also be regarded as a binary task, 6 teams used BCE loss. In addition, the ADAM-TEAM and the Muenai_Tim team used Dice Loss, the ForbiddenFruit team used Dice Loss and Focal Loss, and the TeamTiger team used Jaccard Loss.

Appendix A.3 Localization of fovea task

Figure 15: The framework of the VUNO EYE TEAM for the optic disc segmentation task, which using vessel segmentation mask as input simultaneously.

Table 6 of the main manuscript shows the summary of the methods of the participating teams in this task. The methods based on the regression network were to predict the x and y coordinates directly. The TeamTiger team (8th) and the Muenai_Tim team (7th) directly used EfficientNet-B7 to predict the coordinates. While before regression, the Muenai_Tim team used EfficientNet-B0 to determine whether the fundus image contained fovea, and the coordinate was set to (0, 0) if it did not.

The Zasti_AI team (5th) created distance maps using the Euclidean distance transforms, and converted the coordinate regression problem into an image generation problem, which was to generate the distance map from the fundus images using a generative adversarial network 

(shankaranarayanaJointOpticDisc2017). Finally, they clustered one percent of the highest intensities and segment out the largest cluster. The fovea coordinate was the centroid of this largest cluster.

The VUNO EYE TEAM (1st) designed a segmentation mask of a single-pixel of the fovea, and two deviation masks on the x and y axis to deal with the inconsistent size of the output features and the input images. Then, they used U-Net to predict the above three masks. Specifically, they utilized the same network architecture designed in task 2, except for the last layer which consisted of a confidence map, a map for x-offset, and a map for y-offset. Similar to the loss function in task 2, they also gave losses to the offset maps, which are generated from the final feature maps of the vessel branch.

In addition to the coordinate information, six teams considered the region information of the fovea. Four teams of them trained segmentation networks guided by the binary fovea masks, where the fovea being foreground was represented by a circle centered at fovea with a radius according to the optic disc radius or the image height. Two teams transferred the coordinate prediction problem to an object detection problem with marking the detection box label according to the given coordinates. In detail, the FrobiddenFruit team (2nd) designed a regression branch and a segmentation branch. In the regression branch, three models were built respectively by VGG-19, Inception-v3, and ResNet-50. Besides, an additional Inception-v3 model was trained using a more realistic default location when the fovea is invisible: (1.25, 0.5). When the latter model disagreed with the others by more than 0.5 along x coordinate, the fovea was considered invisible. Otherwise, the predictions of the above four models were averaged to be the final regression result. In the segmentation branch, a circle centered on the fovea, with a diameter equal to 5% of the image height, was used as ground truth, and three FPNs with EfficientNet-B0, -B1, and -B2 as encoders were trained. Finally, images were first processed by the segmentation branch. If a fovea was detected, then the centroid was used as fovea location prediction. If no fovea was detected, then the image was processed by the regression branch for less precise but more robust estimation.

Figure 16:

The multiple blocks regression strategy proposed by the WWW team, which divided the image into multiple blocks, followed by making them one-hot-encoded for each block.

The WWW team (6th) also proposed a framework based on a regression and a segmentation branches. In the regression branch, they divide the image into multiple blocks, and followed by making them one-hot-encoded for each block, as shown in Fig. 16. Then, they utilized ResNet-50 to determine which block contains the fovea, and used the judged target block to make predictions in the same way, repeat three times to get the final coordinate value. In the segmentation branch, they took the original image size of 10241024 and its corresponding local differential filter image as input for U-Net as two different models. And, they also trained a Mask R-CNN with the image size of 512512 to retain better spatial information. In the segmentation branch, the centroid of the segmentation was set as the prediction. Finally, for the fovea localization task, the final result is obtained by averaging the four results obtained by the above models.

The Voxelcloud team (3rd) considered the rough fovea location could be estimated using the corresponding vessel masks. Thus, a novel fovea localization regression framework containing an image-stream structure and a vessel-steam structure was proposed, as shown in Fig. 17 (C). In the image-stream and the vessel-stream, they took into the fundus image and the vessel segmentation mask as input, and respectively trained six and four different models using 6-fold and 4-fold cross-validation. The architectures of the above ten models were based on Nested U-Net. Finally, the fovea regression probability maps are averaged to generate the ensemble score maps. During the testing phase, if the maximal value in the ensemble score map obtained by the image-stream is greater than 0.05, the fovea localization result will be calculated based on the image-stream ensemble. Otherwise, the fovea localization result will be calculated based on the vessel-stream ensemble if the maximal value in the ensemble score map obtained by the vessel-stream is greater than 0.4, if not, the fovea is thought to be not present in the image.

In addition to the supplemented vessel information, the Airamatrix team (4th) considered the relation of the optic disc and the fovea, thus they performed the fovea segmentation task with optic disc segmentation jointly. In their framework, the encoder contains the ResNet-50 model with identity blocks while the decoder was the same as the fcn8s network. The fovea was localized accurately by obtaining the centroid coordinates of fovea segmentation masks. The ADAM-TEAM (10th) and the XxlzT team (11th) both manually marked the fovea region label according to the given coordinates, and adopted the common object detection frameworks, YOLO-v2 and Faster RCNN, to detect the target boxes covering the fovea region. Finally, they convert the output boxes to the corresponding X and Y coordinates.

Figure 17: The framework of the Voxelcloud team. (A) fundus image with a fovea point, (B) vessel segmentation mask with a fovea point, (C) the fovea segmentation framework proposed by the Voxelcloud team containing an image-stream and a vessel-stream.

The above introduction shows that two teams used the vessel segmentation information. In detail, the VUNO EYE TEAM used a GAN-based model to segment retinal vessels. Similarly, the Voxelcloud team generated the vessel masks by a U-Net with GAN Regularization. The top 3 teams and the 6th team used the ensemble method, which combined different models to achieve better results. In the training process, the VUNO EYE TEAM, Voxelcloud, and ADAM-TEAM used additional datasets, including IDRiD, REFUGE, PALM, and ARIA (chea2021classification). Voxelcloud also used their proprietary dataset. The losses for training the regression network were the common MSE and MAE losses, and those for the segmentation network were the common BCE losses. For ADAM-TEAM and XxlzT team, the losses, such as IOU loss and Smooth L1 loss, in the YOLO-v2 and Fater RCNN frameworks were used.

Appendix A.4 Detection and segmentation of lesions task

The summary of the methods of the participating teams is shown in Table 8 in the main manuscript. Five teams considered the U-Net architecture or its variants. The VUNO EYE TEAM (1st), the Zasti_AI team (2nd), the ADAM-TEAM (7th), and the TeamTiger team (8th) utilized the U-Net architecture with different encoders. Specifically, the Zasti_AI team used Residual blocks, the TeamTiger team used EfficientNet-B0, and the ADAM-TEAM used Inception-v3, EfficientNet-B3, ResNet-50, and DenseNet-101 as the encoders, respectively. Moreover, the ADAM-TEAM used averaging method for ensembling to improve the segmentation performance. For fine segmentation of the tiny lesion, the TeamTiger team extracted the patches (256

256) from each fundus image with a stride of 30 percent for training. In addition, the VUNO EYE TEAM designed both the encoder and the decoder of the U-Net, they utilized the finding network used in the AMD classification task as the encoder and adopted a decoder that consisted of depthwise separable convolutions. For each lesion segmentation task, similar to in task 1, they integrated 15 models with different parameters. Except for the traditional U-Net architecture, the Muenai_Tim team (6th) used a nested U-Net structure 

(zhouUNetNestedUNet2018), which incorporates the dense skip pathway of DenseNet. Meanwhile, they also utilized FPN, Deeplab-v3 architectures to build the feature extracting models. Finally, they ensembled these different models.

The WWW (3rd), Airamatrix (4th), and XxlzT (10th) teams built their model based on the DeepLab-v3 architecture, which combines the advantages of spatial pyramid pooling and encoder-decoder structure for semantic segmentation task. The Airamatrix team used Xception as the backbone. The XxlzT team first trained a classification network based on ResNet50 to determine whether there was the lesion in the image, and then, for the image with lesion, they used the DeepLab-v3 framework based on ResNet101. The WWW team fused the predictions of two models (480 and 512 input sizes) to be the final segmentation result. In addition, to enhance the details for learning, the WWW team first calculated the mean image on all of the training images and followed by subtracting all images from the mean image. Second, they dilated each segmentation map by 1111 kernels to enlarge the object size for resolving the shortcoming that the small objects in the ground truth may be eliminated when down-sample the image. In this way, the small objects in the images could be successfully detected.

The remaining ForbiddenFruit team (5th) adopted the FPN architecture, and they ensembled two FPNs for each lesion type except other lesion, where the encoders were based on EfficientNet-B1 (input size 320320) and -B2 (320320) for drusen segmentation, -B2 (256256) and B1 (256256) for exudate segmentation, -B5 (384384) and -B2 (320320) for hemorrhage segmentation, -B1 (256256) and -B1 (384384) for scar segmentation. For others lesion segmentation, the encoder was based on EfficientNet-B1 (256256).

Four teams designed post-processing steps for the lesions segmentation task. The Zasti_AI team discarded the prediction where the lesion area was less than a specific threshold found empirically. The ForbiddenFruit team took advantage of the AMD score in the post-processing step. In detail, all detections in images with an AMD score below a lesion-specific threshold were removed: , . The WWW and ADAM TEAM teams used the region filling and contour filling method to establish better predict results. In addition, for the training processing, only the Airamatrix team used the additional dataset DiretDBIhttps://www.it.lut.fi/project/imageret/diaretdb1/ and fundus10k******https://github.com/li-xirong/fundus10k when deal with exdute and scar lesions.

References