1 Introduction
Retinopathy of prematurity (ROP) is an abnormal blood vessel development in the retina of prematurelyborn infants or infants with low birth weight [3]
. ROP can lead to permanent visual impairment and is one of the leading causes of infant blindness globally. Higher neonatal survival rates have significantly increased the number of premature infants; consequently, there are sharp increases in ROP cases for infants. It is estimated that nineteen million children are visually impaired worldwide
[1], among which ROP accounts for six to eighteen percent in childhood blindness [4]. Early treatment has confirmed the efficacy of treatment for ROP [2]. Therefore, it is crucial that atrisk infants receive timely retinal examinations and early detection of potential ROP.Early detection of ROP faces significant challenges. The most imposing one is the dire lack of experienced ophthalmologists for the screen of ROP, even in developed countries. This challenge is compounded by the limited awareness of ROP even among medical professionals and infants’ inability of active participation in medical diagnosis.
Deep learning has made significant progresses in image classification and pattern recognition, and has shown great potentials for applications in medical images, such as of diabetic retinopathy (DR) [5]. However, the development of highperformance deep learning methodology for medical imaging has two critical requirements. First, it requires collections of large datasets with tens of thousands of abnormal (positive) cases. Secondly, clinical validation datasets for evaluation of final performance require multiple grades for each image to ensure the consistency of the grading result. For instance, DR is a wellrecognized complication among tens of millions of diabetic patients and the available datasets with both positive and negative samples from DR diagnosis are in the order of several hundred thousands. In the case of ROP, however, it is infeasible to meet these two key requirements. Due to the limited awareness and expertise among medical professionals, ROP dataset is limited and extremely imbalanced in terms of the ratio between negative and positive images. (See Section 2). More importantly, clinical screening for ROP often requires unusually high sensitivity level, i.e., higher than the medical standard of 95%. This is to ensure that few positive ROP cases are missed for infants.
Our work.
We formulate the problem of identifying ROP from retinal fundus images in an optimization framework, and adopt neural network techniques to solve this optimization problem.
Our study consists of two stages. First, we use a shallow convolutional neural network called ROPBaseCNN. This ROPBaseCNNbased model works well and achieves over 91 percent in both specificity and sensitivity on the first dataset Data_0 collected from a single data source. To increase the robustness of the model, more ROP data Data_1 from multiple data sources are collected, and a deeper neural network ROPResCNN is developed. This updated ROPResCNNbased model overcomes the overfitting and vanishing/exploding gradient problem.
Our experiments demonstrate that ROPResCNNbased model dominates both human experts and ROPBaseCNNmodel, by a wide margin. It shows impressive performance on the combined Data_0 and Data_1: a perfect score on sensitivity, excellent scores of specificity (96%) and precision (96%), and acrosstheboard improvement of roughly 10% when compared with experienced ophthalmologists. Most importantly, it reduces human errors by over 66% in all categories, and in particular eliminates completely the error in the category of sensitivity, the most critical requirement for diagnosis of ROP.
In addition to excellent experimental results, our study shows that as the network gets deeper, significant features can be extracted for better understanding of ROP. For instance, in spite of the limited and imbalanced data, ROPResCNNbased model succeeds in learning and capturing explicitly a wellknown indicator for the medical diagnosis of ROP.
2 ROP: Data Collection, Augmentation, and Processing
2.1 Data Collection
To develop a model for ROP detection, ROP retinal fundus images were retrospectively collected from the Affiliated Eye Hospital of Nanchang University, which is an AAA (i.e., the highest ranked) hospital in China. All images were deidentified according to patient privacy protection policy, and ethics review was approved by the ethical committee of the university.
Two datasets were used. The first deidentified dataset, Data_0, consists of random samples of ROP images taken at the hospital between 2013 and 2018. A single type of fundus camera, Clarity Retcam3, was used with 130°fields of view. All operators had gone through professional training. Data_0 includes 2021 negative samples and 382 positive samples with the resolution of retinal fundus image. Images in this dataset share a common characteristic: the boundary between the vascular and the nonvascular areas is clear and the color difference is obvious. As shown in Figure 1, there is a clear white dividing line, called the demarcation line, between the vascular and the nonvascular areas of the peripheral retina. In the early stage of ROP, this demarcation line will get thicker until a ridge occurs. As the ridge gets thicker, proliferation of abnormal blood vessels will cause the retinal blood vessel to expand, eventually leading to the ROP problem. The appearance of thickened ridges is the main indicator used by ophthalmologists to diagnose ROP. Note that there is no such thickened ridge in the negative sample.
The second deidentified dataset Data_1 consists of 461 negative samples and 498 positive samples with the resolution of retinal fundus images. A variety of 130°fields cameras were used, including CLARITY Retcam3, SUOER SW8000, and MEDSO ORTHOCONE RSB002. This set of data is characterized by the similar appearance of the vascular and nonvascular areas, and with similar colors. However, the boundary between the vascular area and the nonvascular area is much clearer than that in Data_0. Figure 2 shows a negative sample and a positive sample from Data_1.
All images were graded by ophthalmologists for the presence of ROP severity and for the image quality using an annotation tool. The annotation tool was designed by ophthalmologists and implemented by ourselves. ROP severity was graded as positive or negative. Image quality was assessed by graders, with images of adequate quality considered gradable. The reliability of the grading result was assessed by four prominent ophthalmologists. The final grading results, for which the diagnosis from the hospital agreed with the majority of diagnosis from these ophthalmologists, were used for each retinal fundus image.
2.2 Data Processing and Balancing
The datasets are imbalanced, for instance, negative samples in Data_0 dataset is five times more than positive samples. Consequently, the training process may be significantly biased towards the class with more samples. Data imbalance is very common among medical data. There are several approaches, including undersampling [6], resampling and finetuning [7], oversampling [11], and weight balance and class balance [13].
To mitigate the imbalance problem, we design a hybrid method with a combination of several techniques:

Data enhancement: all samples in the dataset are first enhanced by brightness adjustment and random flipping. (See Figure 3). Afterwards, all images are resized into .

Tuning sampling ratio and class weights: we use different class weights in the cross entropy loss function in our optimization framework, to be introduced in the next section. We oversample the enhanced positive samples and resample the enhanced negative samples, so that the numbers of positive samples and negative samples in the sample batch are kept proportional to the inverse of their class weights. We experiment with different ratios through grid search in the validation set, and eventually set the ratio of positive and negative samples to
in the training process.
This data processing and balance strategy is used throughout our study.
3 Problem and Optimization
Problem formulation.
We formulate the problem of detecting ROP as a binary classification problem, where the positive images and the negative images are labeled as and , respectively. That is, given a fundus image, instead of labelling the image as either or
, we assign a score in terms of probability between
and to the input image. The higher the score, the higher the probability that the image has an ROP (i.e., ROP positive). When assigning the label for the input image, if the probability is higher than , it is then labelled as positive; otherwise it is negative. This is a natural choice for the neural network which requires the output to be a continuous variable.Now, suppose the probability is parametrized by such that it is denoted as . This set of parameter could be interpreted as various factors contributing to the probability of having an ROP. Then the training stage is to minimize the cross entropy loss function over the set of parameters . That is, denote the distribution of the pair of the image and the 01 label by , then the training process is to solve the following optimization problem,
Given the limited amount of available data and hence possible issue of overfitting, we add a kernel regularization. In particular, we adopt the regularization on the weight matrices of the fully connected layers. For each fully connected layer with weight matrix , we add the following regularization term to the loss function
where
is a hyperparameter to adjust the scale of the regularization. Finally, we adjust the loss function by the
class weight from the data processing stage, so that the final optimization problem is to solve the following regularized cross entropy loss function,(1) 
Optimization.
We used the Adam algorithm [9] to solve our optimization problem (1
). Adam algorithm combines a momentum method and an adaptive learning rate method, and uses the first order and the second order moments to optimize the neural network which is to be specified in details in section
4. Adam algorithm is more efficient than the vanilla stochastic gradient descent algorithm.
The parameters of Adam algorithm used here are: learning rate , and the exponential decay rate for the first and the second moment and , respectively.
To train our models more efficiently, we adjust the learning rate with respect to the validation loss. More specifically, the learning rate is reduced by 20% when the validation loss does not improve for epochs.
4 Convolutional Neural Network and Architectures
The network adopted here is the convolutional neural network (CNN) [10]. CNN is specialized for processing gridlike data including images. A CNN consists of the feature extractor and the decoder. The feature extractor has several convolutional layers and pooling layers. It captures the basic features such as lines and corners in the first few layers and extracts more advanced features such as the indicators of ROP in the later layers. The extracted features are then fed into the decoder part. The decoder is a set of fully connected layers. The decoder uses the extracted features to predict the target variable. In our problem, the probability of an input image having ROP is the target variable. We experiment with two different architectures for CNN: ROPBaseCNN and ROPResCNN.
Type of layer  parameters 

Input  shape=(300,300,3) 
Convolution  filters=32, kernel size= , stride2=(2,2), activation=ReLU 
Max pooling  pool size= , strides=(2,2) 
Convolution  filters=64, kernel size=, stride2=(2,2), activation=ReLU 
Max pooling  pool size=, strides=(2,2) 
Dropout  dropping probability = 0.25 
Flatten  none 
Fully connected  neurons=128, activation=ReLU, 
Dropout  dropping probability = 0.5 
Fully connected  neurons=64, activation=ReLU, 
Output(Dense)  shape=(1), activation=sigmoid 
model  ROPBaseCNN  ROPBaseCNN  ROPResCNN 

Train data  Data_0  Data_0+Data_1  Data_0+Data_1 
Test data  Data_0  Data_0+Data_1  Data_0+Data_1 
Precision  0.9479  0.8131  0.96 
Sensitivity  0.91  0.7891  1.0 
Specificity  0.9135  0.9335  0.96 
Accuracy  0.93  0.8948  0.98 
F1 score  0.9286  0.8009  0.98 
The architecture of ROPBaseCNN.
For dataset Data_0 with only 2401 samples, complicated models are prone to the overfitting problem on the training data, therefore we first adopt this shallow CNN model with only five layers: two convolution layers and three fullyconnected layers. To prevent the issue of overfitting, we add dropout layers [12] in the decoder part and the regularization for the kernel of the fully connected layers. The architecture of this shallow CNN, named ROPBaseCNN, is summarized in Figure 4.
Combined with aforementioned data processing strategy, the accuracy of ROP detection under ROPBaseCNN for dataset Data_0 is 93%. This model appears unstable in that its performance would deteriorate with the combined datasets Data_0 and Data_1. (See table 5).
The architecture of ROPResCNN.
The architecture of ROPResCNN combines a pretrained ResNet50 [8] model (no top) with a global average pool layer and a fullyconnected layer as the output layer. The weights of ResNet50 are used as the initial point of the optimization.
With the limited amount of ROP data, training a deep neural network from a random initialization is difficult. Instead, we adopt the pretrained network weights, which help accelerate the training process because ResNet50 is capable of capturing basic and important features for general image classification. Moreover, with this pretrained network, we manage to avoid the wellknown issues in deep networks such as vanishing/exploding gradient. Additionally, we use the global average pooling at the end of the pretrained residual network, reducing the dimension from 3D to 1D. Therefore, global pooling outputs one response for every feature map. At the end, a dense layer with one neuron with the Sigmoid activation function aggregates and outputs the probability.
Note that ROPResCNN favors more convolutional layers instead of fully connected ones. With the globalaveragepooling for dimension reduction, empirically it is not necessary to add regularization for ROPResCNN.
5 Implementation Results
Data_0 is used in ROPBaseCNN and split into three sets: the training set has 187 positive samples and 990 negative samples, the validation set has 80 positive samples and 425 negative samples, and the testing set has 115 positive samples and 606 negative samples with heldout class labels. A combination of Data_0 and Data_1 is used for ROPResCNN, and is split into three sets: training (431 positive samples and 1216 negative samples), validation (185 positive samples and 521 negative samples), and testing (264 positive samples and 745 negative samples with heldout class labels).
Evaluation metrics.
We use the following standard metrics to evaluate the performance of our models, including precision, sensitivity, specificity, accuracy, and the F1 score,
Precision  
Specificity  
F1 
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. The error reduction of the models is calculated as
Training on single GPU.
A single GTX 1080 GPU and 8GB of memory is used for training the ROPBaseCNNbased model and the ROPResCNNbased model. With appropriate data processing, one GPU turns out to be sufficient to fit the training process with 3000 samples per epoch. For ROPBaseCNN, the batch size is 32, and the training is stopped after 25 epochs; for ROPResCNN, the batch size is 64, and the training is stopped after 30 epochs.
Evaluations.
The results of ROPBaseCNNbased model with Data_0 are summarized in the first column of Table 5, the results of ROPBaseCNNbased model with the combined datasets Data_0 and Data_1 are summarized in the second column of Table 5; and the third column shows the results of ROPResCNNbased model with both datasets Data_0 and Data_1.
We take 200 infants’ retinal fundus images with confirmed grading results by ophthalmologists. Their results are then compared against those generated by our models. Figure 6 gives the detailed performance comparison. We see that ROPBaseCNNbased model manages to achieve comparable performance with experienced ophthalmologists, especially in terms of precision and specificity. However, its performance is not robust, with excellent scores from Data_0 vanishing on the combined Data_0 and Data_1.
ROPResCNNbased model dominates both human experts and ROPBaseCNNmodel, by a wide margin. It shows impressive performance on the combined Data_0 and Data_1: a perfect score on sensitivity, excellent scores in specificity (96%) and precision (96%), and acrosstheboard improvement of roughly 10% when compared with experienced ophthalmologists. Most importantly, it reduces human errors by over 66% in all categories, and in particular eliminates completely the error in the category of sensitivity, the most critical requirement for diagnosis of ROP.
Feature map.
The feature map from ROPBaseCNN, shown in Figure 7, captures an implicit indicator of ROP, the abnormal blood vessel growth. However, such a disorder from the retinal fundus image is not used by ophthalmologists as a standard indicator for diagnosis of ROP.
The feature map from ROPResCNN demonstrates that ROPResCNNbased model succeeds in learning and capturing explicitly the wellaccepted indicator for the medical diagnosis of ROP: the thickened ridge. (See Figure 8).
6 Summary
Our study shows that models using the stateofart CNN for general image classification can provide accurate and early detection of ROP with a perfect sensitivity score and excellent scores in specificity and precision. Beyond diagnosis, our study shows that deep neural network techniques can be potentially powerful to extract significant features for better understanding of ROP.
Broader impact
As far as the authors are concerned, a) researchers at the intersection of deep learning and medical imaging can potentially benefit from this work; b) no particular group of people in the society is expected to be put in disadvantage due to this work; c) this work is not subject to the failure of the system; d) data bias is out of the scope of the potential influence of this work; and e) proper understanding of models helps reducing the risk of misdiagnosis.
References
 [1] (201312) Pretermassociated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010. Pediatric research 74 Suppl 1, pp. 35–49. External Links: Document Cited by: §1.
 [2] (200312) Revised Indications for the Treatment of Retinopathy of Prematurity: Results of the Early Treatment for Retinopathy of Prematurity Randomized Trial. Archives of Ophthalmology 121 (12), pp. 1684–1694. External Links: ISSN 00039950, Document, Link, https://jamanetwork.com/journals/jamaophthalmology/articlepdf/415949/ecs30202.pdf Cited by: §1.
 [3] (2018) Screening examination of premature infants for retinopathy of prematurity. Pediatrics 142 (6). External Links: Document, ISSN 00314005, Link, https://pediatrics.aappublications.org/content/142/6/e20183061.full.pdf Cited by: §1.
 [4] (1997) Retinopathy of prematurity in middleincome countries. The Lancet 350 (9070), pp. 12–14. External Links: Document Cited by: §1.
 [5] (201612) Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316 (22), pp. 2402–2410. External Links: ISSN 00987484, Document, Link, https://jamanetwork.com/journals/jama/articlepdf/2588763/joi160132.pdf Cited by: §1.
 [6] (201612) Learning from classimbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. . External Links: Document Cited by: §2.2.
 [7] (2017) Brain tumor segmentation with deep neural networks. Medical Image Analysis 35, pp. 18 – 31. External Links: ISSN 13618415, Document, Link Cited by: §2.2.

[8]
(2016)
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. Cited by: §4.  [9] (201412) Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §3.
 [10] (1999) Object recognition with gradientbased learning. In Shape, Contour and Grouping in Computer Vision, Cited by: §4.
 [11] (1998) Data mining for direct marketing: problems and solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98, pp. 73–79. Cited by: §2.2.

[12]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research
15 (56), pp. 1929–1958. External Links: Link Cited by: §4.  [13] (2006) Training costsensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering 18 (1), pp. 63–77. Cited by: §2.2.
Comments
There are no comments yet.