Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction

by   J. D. Bodapati, et al.

Diabetic Retinopathy (DR) is one of the major causes of visual impairment and blindness across the world. It is usually found in patients who suffer from diabetes for a long period. The major focus of this work is to derive optimal representation of retinal images that further helps to improve the performance of DR recognition models. To extract optimal representation, features extracted from multiple pre-trained ConvNet models are blended using proposed multi-modal fusion module. These final representations are used to train a Deep Neural Network (DNN) used for DR identification and severity level prediction. As each ConvNet extracts different features, fusing them using 1D pooling and cross pooling leads to better representation than using features extracted from a single ConvNet. Experimental studies on benchmark Kaggle APTOS 2019 contest dataset reveals that the model trained on proposed blended feature representations is superior to the existing methods. In addition, we notice that cross average pooling based fusion of features from Xception and VGG16 is the most appropriate for DR recognition. With the proposed model, we achieve an accuracy of 97.41 accuracy of 81.7 Another interesting observation is that DNN with dropout at input layer converges more quickly when trained using blended features, compared to the same model trained using uni-modal deep features.



There are no comments yet.


page 2

page 5

page 7

page 8


Exploiting Multi-Modal Features From Pre-trained Networks for Alzheimer's Dementia Recognition

Collecting and accessing a large amount of medical data is very time-con...

Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

We present our preliminary work to determine if patient's vocal acoustic...

A Discriminative Vectorial Framework for Multi-modal Feature Representation

Due to the rapid advancements of sensory and computing technology, multi...

Diabetic Retinopathy Detection via Deep Convolutional Networks for Discriminative Localization and Visual Explanation

We proposed a deep learning method for interpretable diabetic retinopath...

Visual Attention: Deep Rare Features

Human visual system is modeled in engineering field providing feature-en...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Diabetic Retinopathy (DR) is an adverse effect of Diabetes Mellitus(DM) Cheung et al. (2008) that leads to permanent blindness in humans. It is usually caused by the damage to blood vessels that provide nourishment to light-sensitive tissue called the retina. As per statistics Flaxman et al. , DR is the fifth leading cause for blindness across the globe. According to the World Health Organization (WHO), by 2013, around 382 million people are suffering from DR, and this number may rise to 592 million by 2025. It is possible to save many people from going blind if DR is identified in the early stages. Small lesions are formed in the eyes of DR effected people and the type of lesions formed decides the level of severity of DR. Figure 0(a) shows types of lesions that include Micro Aneurysms (MA), Exudates, Hemorrhages, Cotton Wool Spots and improperly grown blood vessels on the retina.

Figure 1: Samples of DR effected fundus images: (a) Types of lesions formed (b) levels of severity

DR can be categorised into five different stages Gulshan et al. (2016): No DR (Class-0), Mild DR (Class-1), Moderate DR (Class-2), Severe DR (Class-3) and Proliferative DR (Class-4). Sample retinal images with different severity levels of DR are shown in the figure 0(b). Mild DR is the early stage during which the formation of Micro Aneurysms (MA) can be observed. As the disease progresses to Moderate stage, swelling of blood vessels can be found, which leads to blurred vision. During the later Non-Proliferative DR (NPDR) stage, abnormal growth of blood vessels can be noticed. This stage is severe due to the blockage of a large number of blood vessels. Proliferative DR (PDR) is the advanced stage of DR, during this stage retinal detachment along with large retinal break can be observed that leads to complete vision loss Williams et al. (2004).

In traditional DR diagnosis approaches, manual grading of the retinal scan is required to identify the presence or absence of retinopathy. If DR is confirmed as Positive, further diagnosis is recommended to identify severity level of the disease. This kind of diagnosis is quiet expensive and time consuming as it demands human expertise. If DR identification is automated then diagnosis of the disease becomes affordable to many people. In the recent past, several machine learning tools have been introduced to address the same.

Early approaches to DR identification, where the presence or absence of DR is revealed, focuses on spotting the Hard Exudates (HEs). A dynamic threshold based Support Vector Machine (SVM) is used to segment HE in the retinal images

Long et al. (2019). Fuzzy C-means is used to detect HE and SVM is used to identify severity level of the disease to make the system more sophisticated Haloi et al. (2015)

. SVM based classifiers are adapted to find cotton wool spots in the retinal images.

With the introduction of deep learning, focus of the researchers has been shifted from spotting HEs to MAs. A two step CNN is introduced to segment MAs in the given retinal scans

Noushin et al. (2019). Another CNN architecture that is trained using selective sampling approach is proposed to detect hemorrhages Grinsven et al. (2016). A max-out activation is introduced to improve the performance of a DNN model for which DR is used as an application to find MA Haloi (2015). Recently a bounding box based approach is introduced to identify the region of interest in the retinal images Srivastava et al. (2017). Though good number of methods are available in the literature, they are either sub-optimal or complex. Hence there is a need for a solution that is simple and robust.

The objective of this work is to design a simple and robust deep learning-based approach to recognize DR from the given retinal images. Major focus this work is to obtain a better feature representation of the retinal images which ultimately leads to the better model and to accomplish this, we propose Uni-modal and Multi-modal approaches. Initially, for the given retinal images, deep features are extracted from different pre-trained ConvNets like VGG16, NASNet, Xception and Inception ResNetV2. In Uni-modal approach, features extracted from a single pre-trained ConvNet gives the final feature representation. In multi-modal approach, our idea is to blend the deep features extracted from multiple ConvNets to get the final feature representation. We propose different pooling based approaches to blend multiple deep features. To check the efficiency of our feature representation, a Deep Neural Network (DNN) architecture is proposed for identification of DR (task1) and to recognize severity level of DR (task2). We observe that in multi-modal approach, blending deep features from Xception and Inception ResNet V2 outperforms others in both the tasks. Another interesting observation is that there is a drop in the number of false positives which is most desirable. Experimental studies on the benchmark APTOS 2019 dataset reveals that our blended feature representations trained using DNN model gives a superior performance compared to the existing methods.

Following are the major contributions of the proposed work:

  • Effectiveness of the uni-modal feature representation is verified.

  • A blended multi-modal feature representation approach is introduced

  • Different pool based approaches are proposed to blend deep features.

  • A DNN architecture with dropout at the input layer is proposed to test the efficiency of the proposed uni-modal and blended multi-modal feature representations.

  • APTOS 2019 benchmark dataset is used to compare the performance of the proposed approach with existing models

2 Related Work

In the resent past, machine learning models are very popular to solve various problems like image classification Bodapati and Veeranjaneyulu (2019), text processing Bodapati et al. (2019), real-time fault diagnosis Zhuo et al. (2018) and healthcare Xia et al. (2020); Moreira et al. (2019). It is very common to use ML algorithms to address disease prediction Gadekallu et al. (2020); Patel et al. (2020) Reddy et al. (2019).

In this section we report various conventional models available in the literature for the task of DR recognition. In Wu et al. (2013), an easy to remember scientific approach has been introduced for DR severity identification. In Akram et al. (2013), the authors presented a hybrid classifier by using both GMM and SVM as an ensemble model to improve the accuracy of the model. The same approach has been modified by augmenting the feature set with shape, intensity, and statistics of the affected region Akram et al. (2014)

. A random forest-based approach is proposed in

Casanova et al. (2014) Verma et al. (2011) and segmentation based approaches are proposed in Welikala et al. (2014). In Welikala et al. (2015)

, a genetic algorithm-based feature extraction method is introduced. Different shallow Classifiers such as the GMM, KNN, SVM, and AdaBoost are being analysed

Roychowdhury et al. (2013) to differentiate lesions from non-lesions. A hybrid feature extraction based approach is used in Mookiah et al. (2013).

In the next few lines, deep learning models available in the literature for the task of DR severity identification are introduced. A large dataset consisting of 1,28,175 retinal images is used and trained using deep CNN. In Porter et al. (2019) data augmentation method is used to generate the data on CNN architecture. Fuzzy models are used in Rahim et al. (2016), a hybrid model that is designed based on fuzzy logic, Hough Transform and numerous extraction methods are being implemented as part of their system. A combination of fuzzy C-means and deep CNN architectures are used in Dutta et al. (2018)

. A Siamese Convolutional Neural Network is used in

Zeng et al. (2019) to detect diabetic retinopathy.

With the introduction of deep learning models, focus has been shifted to deep feature based models. In Mateen et al. (2019) Muhammad Mateen used features extracted from different layers of pre-trained ConvNet like VGG19 and further applied PCA and SVD on those features, for dimension reduction Reddy et al. (2020) to avoid over-fitting. In the case of former models, the model is not robust, and in the latter case, the models are robust, but large datasets are needed to train the model. A PCA based fire-fly model Bhattacharya et al. (2020) along with deep neural network is used for DR detection Gadekallu et al. (2020), UCI repository is used for the experiments.

Performance of any ML algorithm is subject to the features extracted from the given data. Conventional ML models need a separate algorithm (GIST, HOG and SIFT) for feature learning and gives a global or local representation of the images and the features. Features extracted in this process are known as hand crafted features. Till the entry of deep learning models, these handcrafted features were dominant and being widely used for feature extraction.

2.1 Deep ConvNets for feature extraction and transfer learning

Deep learning modelsJindal et al. (2018); Vinayakumar et al. (2020); Alazab et al. (2020) learn the essential characteristics of the input images. This exceptional capability of the deep models make them representation models, as these models can represent the data efficiently and reduce the use of the additional feature extraction phase where features are handcrafted. Deeper layers of the CNN models can represent the entire given input efficiently than the early layers.

The downside of the deep learning models is that they need enormous amounts of data for training, which is usually scarce for most of the real-time applications. This problem can be addressed by the introduction of transfer learning, where the knowledge gained by a deep learning model can be transferred to other models. To achieve this deep pre-trained CNN models like VGG16, ResNet152 are available for transfer learning. Pre-trained models are the models that are trained on large amounts of data, and the weights updated during the training of the complex model can be applied to similar kind of tasks.

There are different types of pre-trained models which are trained on large scale datasets such as ImageNet that consists of more than a million images. Popular pre-trained deep CNN models like VGG16, VGG19, ResNet152, InceptionV3, Xception, NASNet, Inception ResNet V2 and DarkNet are briefly described below:

  • Visual Geometric Group (VGG 16):

    VGG16 is a deep ConvNet trained on 14 million images belonging to 1000 different classes and topped the leader board in ILSVR (ImageNet) challenge. In this architecture, 2X2 filters are used with stride 1 for convolution operation, and 2X2 filters with stride two and same padding are used for max-pooling operation across the network. At the end of architecture, two fully connected dense layers of 4096 neurons are connected followed by soft-max layer.

  • Neural Architecture Search Network (NASNet): This is a special kind of Deep CNN which searches for a better architectural building block on small datasets like CIFAR10 and transfer it to larger datasets like ImageNet. It has a better regularisation mechanism called Scheduled drop path, which significantly improves generalisation.

  • Xception: Xception is another deep ConvNet architecture that supports depth-wise separable convolution operations and outperformed ResNet and InceptionV3 in ILSVR challenge.

  • Inception ResNetV2:

    This is popularly known as InceptionV4, as it combines architectures of two different architectures called InceptionV3 and ResNet152. It has both inception and residual connections which boost the performance of the model.

Deep neural networks give excellent performance only when trained with extensive data. If the data used to train is not sufficient, then the DNN models tend t overfit. Deep, Convolutional Neural Networks are introduced in Simonyan and Zisserman (2014) for the task of Scalable Image Recognition. Xception, a deep CNN is developed using depthwise Separable convolutions to improve the performance Chollet (2017). A flexible architecture has been defined in Zoph et al. (2018), which can search for a better convolutional cell with better regularisation mechanism. All these models are trained on ImageNet Dataset for ILSVR challenge.

Our objective is to create a robust and efficient model to recognise DR with limited datasets and with limited computational resources. To achieve our objective of creating a robust model with small datasets, we seek the help of transfer learning and use various pre-trained ConvNets to extract deep features. We use the knowledge of these models to extract the most prominent features of colour fundus images. A deep neural network with dropout introduced at early layers is trained to detect and classify the severity levels of diabetic retinopathy. As we introduced dropout at the input layer, deep neural network is immune to over-fit.

3 Proposed Methodology

In this work, our objective is to develop a robust and efficient model to automate DR diagnosis. We focus on the extraction of deep features that are most descriptive and discriminate which ultimately improves the performance of DR recognition. In order to get an optimal representation, features are extracted from multiple pre-trained CNN architectures and are blended using pooling based approaches. These final representations are used to train a Deep Neural Network with a dropout at the input layer. Proposed model has three different modules: feature extraction, model training, and evaluation module.

Figure 2: Architectures of various pre-trained models along with an indication of layers from which features are Extracted

3.1 Feature Extraction

Performance of any machine learning model is highly influenced by the feature representations and the same is applicable to models used for DR recognition. With this motivation, we propose two different approaches (uni-modal and multi-modal) to extract optimal features from the given retinal images.

In the proposed work, initial representations of the retinal images are obtained from the pre-trained VGG16, NASNet, Xception Net and Inception ResNetV2. As each of the pre-trained model expects input images of varying sizes, given retinal images are reshaped according to the input dimensions accepted by these models for example, when VGG16 is used images are reshaped to 224*224*3. These reshaped retinal images are fed to the pre-trained models after removing the soft-max layer and freezing the rest of the layers. Activation outputs from the penultimate layers form the basis for the proposed feature extraction module. For each retinal image deep features are extracted from the pre-trained ConvNets and following are the details:

  • Each of the first (fc1) and second (fc2) fully connected layers of VGG16 produces a feature vector of 4096 dimensions

  • The final global average pooling layer of NASNet, Xception and InceptionResNetV2 gives feature vectors of size 4032, 2048 and 1536 respectively

Figure 2 gives the architectural details of the pre-trained VGG16, NASNet, Xception and InceptionResNetV2 and pointers are marked at the feature extraction layers. These features form the input to the proposed uni-modal and blended multi-modal approaches to obtain the optimal feature representations of the retinal images.

3.2 Uni-modal deep feature extraction:

In this approach, deep features are extracted from the final layers of one of the pre-trained ConvNets (VGG16, NASNet, Xception, ResNet V2) to get the global representation of the retinal images. These deep features are fed to classification models for DR identification and recognition. We propose to use DNN architecture with a dropout at the input layer for DR identification and classification. Figure 3 gives the details of different stages involved in DR recognition process that uses uni-modal deep ConvNet features.

Figure 3: Stages involved in uni-modal deep feature based DR recognition

3.3 Blended (multi-modal) deep feature extraction:

Unlike uni-modal approaches, multi-modal approaches use deep features extracted from multiple ConvNets and are blended using fusion techniques. The features obtained from different pre-trained models provide a different representation of the retinal images as they follow different architectures and are trained on different datasets. A stronger representation can be obtained by blending features from multiple ConvNets, as features of one ConvNet complements the features from other ConvNets involved in the process.

Figure 4: Stages involved in blended deep feature based DR recognition

We propose various pooling approaches to fuse the deep features extracted from multiple pre-trained ConvNets. The final blended deep features provide better descriptive and discriminate representation of the retinal images. These blended features are fed to the classification models for DR identification or severity recognition. Figure 4 gives the details of different stages involved in DR recognition process that uses blended multi-modal deep ConvNet features. The proposed blended multi-modal feature extraction module, uses features from both the fully connected layers of VGG16 (fc1 and fc2) and global average poling layer of Xception as input. The rationale behind choosing features VGG16 and Xception over others is two fold. In VGG16, each feature map of the final convolution block learns the presence of different lesions from the retinal images. Xception Net learns correlations across the 2-D space as a result each feature map provides the comprehensive representation of the entire retinal scan. Figure 5 visualizes the feature maps obtained from the final convolution blocks of VGG16 and Xception models when a retinal image is passed to these models.

Figure 5: Visualization of the feature maps of the final convolution blocks of VGG16 and Xception models on passing retinal image as input

3.3.1 Approaches to blend deep features from multiple ConvNets

In this work, two different pooling based approaches (1-D pooling and cross pooling) are proposed to fuse multi-modal deep features that are extracted from VGG16 (fc1, fc2) and Xception. 1-D pooling is used to select prominent local features from the each region of VGG16 whereas cross pooling allows to aggregate the prominent features obtained by 1-D pooling with global representation of Xception.

1-D pooling based fusion takes one feature vector as input, and produces another feature vector , where , and . is a reduced representation of , where and . Each feature element , of the output vector is computed using one of the following three approaches:


In cross pooling based feature fusion, two different feature vectors X, Y are passed as input, and another feature vector Z is produced, where . Each feature element , of the output vector is computed using one of the following three approaches:


1-D pooling is applied independently on features extracted from fc1 and fc2 layers of VGG16. Then cross pooling approach is applied on the resultant pooled features. This feature vector is merged with the features extracted from the Xception using cross pooling. Fusion module produces deep blended features, which are used to train the proposed DNN model. Figure 6 shows the proposed architecture of the deep feature fusion approach used to blend features from different ConvNets. As the final feature vector is a blended version of the local and global representations of the retinal images it provides strong features. Algorithm LABEL:algo gives the sequence of steps involved in the blended multi-modal feature fusion based DR recognition.

Figure 6: Approaches for Fusion of features extracted from Deep ConvNets


3.4 Model Training and Evaluation:

During this phase, we train the ML model with deep blended pre-trained features. We prefer to use Deep Neural Network (DNN) model for training. For DR identification task, as it is a simple binary classification task, a DNN with two hidden layers with 256, 128 units respectively with ReLU activation is used.

For DR severity classification task, a DNN with three hidden layers with 512, 256, 128 units respectively using ReLU activation is used. For both the DNNs with the input layer we applied 0.2 dropout to avoid model from over-fitting of model. This helped the model to become robust. Figure 7 represents the architecture of proposed approach for model training and evaluation.

Figure 7: Training and Evaluation of DNN model for identification and recognition of DR

4 Experimental Results

In this section, we provide details of experimental studies that are being carried out to understand the efficiency of the proposed blended multi-modal deep features representation.

4.1 Dataset Summary

For the experimental studies, the APTOS 2019 kaggle benchmark dataset available as part of the blindness detection challenge is used APT . This is a large dataset of retinal images taken using fundus photography under a variety of imaging conditions. The images are graded manually on a scale of 0 to 4 (0 - No DR, 1-Mild, 2-Moderate, 3-Severe, 4-Proliferative DR) to indicate different severity levels.

Severity Level # Samples
Class-0 (Normal) 1805
Class-1 (Mild Stage) 370
Class-2 (Moderate Stage) 999
Class-3 (Severe Stage) 193
Class-4 (Proliferative Stage) 295
Total 3662
Table 1: Dataset Summary of APTOS 2019 dataset

Table 1 gives the number of retinal images available in the dataset under each level of severity. We can observe that the dataset has an imbalance with more number of normal images, and with very few images in class3. In all the experiments, 80% of the data is used for training and the remaining 20% is used for validation.

4.2 Performance Measures:

For the evaluation of the proposed model, we report different measures: Accuracy, Precision, Recall, and F1 Score. In addition, we used an additional metric called Kappa statistic to compares an observed accuracy with an expected accuracy. Kappa Statistic is calculated as

Observed accuracy is defined as the number of samples that are correctly classified. Expected accuracy is defined as the accuracy that a classifier would be expected to achieve, which is directly related to the number of examples of each class, along with the number of examples that the predicted value satisfied with the correct label.

4.3 DR Identification and Severity level Prediction:

The whole set of experiments carried out in this work are divided into two different tasks. In task1, presence or absence of DR is identified where as in task2, severity level is predicted for the given retinal image.

4.3.1 Task1 - DR Identification:

In this task, given the DR image of a diabetic patient, we need to check whether the person is effected by retinopathy or not. DR identification is a binary classification task, so binary cross entropy loss is used to measure the loss, and Adam optimiser is used to optimise the objective function. The dataset contains images belonging to 5 different classes as shown in table 1 and is not suitable for binary classification task. Merging all the DR effected images into a single class gives 1857 positively labeled images and the remaining 1805 normal images are labeled as negative.

4.3.2 Task2 - Severity level Prediction:

Objective of task1 is to identify the presence or absence of DR, given a retinal image. While treating the DR effected patients, mere identification of DR would not be sufficient and understanding the level of severity would be helpful for better treatment. Hence we treat severity level identification as a separate task that categorises the given retinal image to one of the 5 severity levels. Categorical Cross entropy loss is used to represent loss and Adam optimiser is used to optimise the objective function.

4.4 Experimental studies to show the representative nature of uni-modal features for task1

This experiment is carried out to understand how efficiently retinal images are represented using uni-modal features that are directly obtained from single pre-trained ConvNet. Models like VGG16, Xception, NASNET, and ResNetV2 are considered to extract uni-modal features. For classification, models like Naïve Bayes classifier, logistic regression, decision tree, k-Nearest Neighbourhood (KNN) classifier, Multi Layered Perceptron (MLP) Support Vector Machine (SVM) and Deep Neural Network (DNN) are used.

Model Accuracy Precision Recall F1 Score Kappa Statistic
Logistic Regression 97.13 97 97 97 94.27
KNN 95.36 96 95 95 90.73
Naive Bayes 77.08 82 77 76 54.45
Decision Tree 91.27 91 91 91 82.52
MLP 96.45 97 96 96 92.91
SVM (linear) 96.58 97 97 97 93.17
SVM (RBF) 96.86 97 97 97 93.73
DNN 97.32 98 98 98 94.63
Table 2: Performance of ML algorithms on Task1 using features from fc2 layer of VGG16

Table 2 and 3 shows the performance of DR identification task using different ML models when the retinal images are represented with the features extracted from the first fully connected layer (fc2) of VGG16 and Xception respectively. With this we came to a conclusion that DNN outperforms the rest of the ML model irrespective of the models. Hence decided to use DNN model alone in the rest of the experiments.


Model Accuracy Precision Recall F1 Score Kappa Statistic
Logistic Regression 96.45 96 96 96 93
KNN 95.5 96 95 95 91
Naive Bayes 82.95 84 83 83 65.9
Decision Tree 87.59 88 88 88 75.17
MLP 96 96 96 96 91.89
SVM (linear) 96.18 96 96 96 92.36
SVM (RBF) 97.4 97 97 97 94.82
DNN 97.41 97 97 97 94.82
Table 3: Performance of ML algorithms on Task1 using features from Xception
Model Accuracy Precision Recall F1 Score Kappa Statistic
VGG16-fc1 97.27 97 98 97 95.12
VGG16-fc2 97.32 98 98 98 94.63
NASNet 97.14 97 97 97 94.27
Xception 97.41 97 97 97 94.82
Inception ResNetV2 97.34 97 97 97 94.54
Table 4: Task1 performance using DNN trained on different uni-modal features

Table 4 shows the representative power of uni-modal features that are extracted from different pre-trained models. It is clear from the results that the performance of the DNN model varies depending on the uni-modal features used. This experiment gives a clue that each pre-trained model extracts a different set of features from retinal images. The features extracted from Xception yields better performance in terms of accuracy for the diabetic retinopathy identification task. A nominal difference in terms of accuracy and kappa score can be observed between the models trained using different uni-modal features.


# epochs

loss Accuracy
VGG16-fc1 65 0.0024 97.27
VGG16-fc2 67 0.0139 97.32
NASNet 37 0.0310 97.14
Xception 16 0.0213 97.41
Inception ResNet V2 19 0.0815 97.34
Table 5: Task1-Comparison of DNN model (trained on uni-modal features) in terms of loss and number of epochs when trained on different uni-modal features

For a better understanding of the representative nature of different uni-modal features, loss and number of epochs taken to converge by the DNN models are reported in Table 5. We can observe that the model trained using VGG16-Fc1 reaches minimum loss compared to the rest of the models. In terms of convergence, Xception takes only 16 epochs whereas performance of Inception ResNetV2 outperformed other models.

To summarize the experiments on DR identification task, features extracted from Xception, VGG16-fc2 and Inception ResnetV2 yields the same accuracy with nominal differences. However, models trained on the VGG16-fc1 features gives better kappa scores compared to others. We can also observe that models trained on the VGG16-fc2 features gives better performance in terms of precision, recall and F1 scores. Regardless of the type of uni-modal features used, DNN consistently outperforms rest of the models especially in terms of kappa scores. The reason for the superior performance of the models trained using VGG16 and Xception features is that these models are good at extracting the lesion information that is useful to discriminate the DR effected images from those that are not effected.

4.5 Experimental studies to show the representative nature of uni-modal features for task2

We run a set of experiments to understand the nature of uni-modal features for severity prediction of DR. Task2 is more challenging compared to task1 as it involves multiple classes. DNN model with dropout at the input layer is used with different uni-modal features.

Type of Uni-modal features Accuracy Precision Recall F1 Score Kappa Statistic
VGG16-fc1 80.06 80 81 80 70.02
VGG16-fc2 79.81 79 80 79 68.88
NASNET 76.4 75 76 75 63.87
Xception 78.99 78 79 78 67.67
Inception ResNetV2 79.73 78 78 78 67.67
Table 6: Task2 performance using DNN trained on different uni-modal features

Based on the results reported in Table 6, we can observe the same trend that has been observed in task1. The scores obtained for task2 shows the complexity of severity prediction. The model trained on VGG-16+fc1 features shows superior performance than rest of the models. The same can be observed in terms of all the metrics.

Model # epochs loss Accuracy
VGG16-fc1 76 0.3623 80.06
VGG16-fc2 79 0.3986 79.81
NASNet 37 0.5612 76.39
Xception 23 0.4175 78.99
Inception ResNet V2 89 0.382 79.73
Table 7: Task2-Comparison of DNN model (trained on uni-modal features) in terms of loss and number ofepochs when trained on different uni-modal features

From Table 7 it is clear that among all the pre-trained features, VGG16-fc1 yields superior performance with minimum loss. However Xception converges in lesser number of epochs compared to other models.

4.6 Performance evaluation of the proposed blended multi-modal features

A clue from the experiments on uni-modal features is that different uni-modal features extract different sets of features from the retinal images. If we can use multiple deep features extracted from different models, they complement each other and helps to improve the scores. To get benefited from more than one set of uni-modal features we propose a blended multi-modal feature representation. This section is dedicated to show the representative power of the proposed feature representation with an application to DR identification and severity level prediction.

In addition we apply the proposed pooling methods to blend the features from multiple pre-trained models. Initially we blend features from first and second fully connected layers of VGG16. Then we extend this to fusion of 3 different features from fc1, fc2 layers of VGG16 and Xception.

4.6.1 Blended Multi-Modal deep features for task1

Modalities Pooling Accuracy Kappa Statistic Epochs Loss
VGG16-fc1 and VGG16-fc2 Max-pooling 96.12 91.89 68 0.0352
Avg-pooling 97.39 94.61 51 0.0293
Sum-pooling 95.5 91 64 0.0419
VGG16-fc1, VGG16-fc2 and Xception Max-pooling 96.85 92.6 69 0.0314
Avg-pooling 97.92 94.93 43 0.0201
Sum-pooling 96.1 92.31 56 0.0396
Table 8: DNN with blended multi-modal features with different fusions for Task1

We experiment the effect of blending deep features extracted from multiple pre-trained models on DR identification task. In addition we verify the proposed maximum, sum and average pooling approaches to blend multiple deep features.

From Tables 8, we can observe that average pooling based fusion works better for DR Detection compared to other models. Using average fusion the models trained on multi-modal features leads to superior performance in terms of accuracy and kappa static. In addition the model converges faster in less than 50 epochs and attains minimum loss. The accuracy obtained by model trained using multi-modal features is significantly better compared with to those trained on uni-modal features.

4.6.2 Blended Multi-Modal deep features for task2

From the previous experiments we understand that the models trained on multi-modal features give better performance compared to those trained on uni-modal features in the context of DR identification which is simple binary task. To understand that the proposed blended performs efficiently for more complex multi-class classification task, we apply the proposed feature representation for severity prediction task.

Modalities Pooling Accuracy Kappa Statistic Epochs Loss
VGG16-fc1 and VGG16-fc2 Maximum 78.06 66.87 72 0.4176
Average 80.34 69.21 62 0.2987
Sum 76.8 65.64 68 0.5693
VGG16-fc1, VGG16-fc2 and Xception Maximum 79.25 67.29 74 0.3986
Average 80.96 70.9 54 0.2619
Sum 77.12 66.42 61 0.4782
Table 9: DNN with blended multi-modal features with different fusions for Task2

From Table 10, we can see that average pooling based fusion of multiple deep features works better for Diabetic Severity Prediction. Compared to the blended features from VGG16-fc1 and VGG16-fc2, blended features from VGG16-fc1, VGG16-fc2 and xception gives better representation. For severity prediction also, the model that uses average pooling approach for fusion converges faster with better accuracy and kappa score when compared with other approaches for fusion.

Figure 8: Confusion matrix for the severity prediction task.

4.7 Comparison of proposed Blended feature extraction with existing methods

In this experiment we show the effectiveness of the proposed DNN with dropout at the input layer trained using the proposed blended multi-modal deep feature representation. with the existing models in the literature for DR prediction. We compare the proposed model with the performances of the models used in Gargeya and Leng (2017) and Kassani et al. (2019). From Table 10 we can see that the proposed method gives an accuracy of 80.96% which is significantly better than existing models in the literature. When compared to the existing models proposed DNN model is simple with only 3 hidden layers with 512, 256, and 128 units each hidden layer. Confusion matrix in Figure 8 shows the mis-classifications produced by the proposed model when applied for DR severity prediction task. From the figure we can see that most of the proliferate DR type images are predicted as moderate.

As the final feature vector is a blended version of the local and global representations of the retinal images the final representation provides strong features. The reason for improvement in the performance of the proposed model is that each feature map of the final convolution block of VGG16 learns the presence of different lesions from the retinal images and Xception Net comprehensive representation of the entire retinal scan. When we combine the deep features from VGG16 and Xception gives a compact representation that gives the wholistic representation of DR images.

Model Accuracy
DR detection using Deep Learning Gargeya and Leng (2017) 57.2%
DR Classification Using Xception Kassani et al. (2019) 79.59
DR Classification Using InceptionV3 Kassani et al. (2019) 78.72
DR Classification Using MobileNet Kassani et al. (2019) 79.01
DR Classification Using ResNet50 Kassani et al. (2019) 74.64
Blended features + DNN (proposed) 80.96
Table 10: Comparison of Proposed method using with existing methods

5 Conclusion

Major objective of this work is to acquire a compact and comprehensive representation of retinal images as the feature representations extracted from retinal images significantly influence the performance of DR prediction. Initially we extract features from deep pre-trained VGG16-fc1, CGG16-fc2 and Xception models. VGG16 model learns the lesions and Xception learns the global representation of the images. Then the features from multiple ConvNets are blended to get final prominent representation of colour fundus images. The final representation is a obtained by pooling the representations from VGG16 and Xception features. A DNN model trained using these blended features for the task of Diabetic Retinopathy severity level prediction. The proposed DNN model with dropout at the input avoids over-fitting and converges faster. Our experiments on benchmark APTOS 2019 dataset shows the superiority of the proposed model when compared to the existing models. Among the proposed pooling approaches, average pooling used to fuse the features extracted from the penultimate layers of multiple pre-trained ConvNets gives better performance with minimum loss in fewer epochs compared to others.

Conceptualization, J.D.B.; methodology, S.N.S; software, S.H.; validation, M.B., P.K.R.M. and O.J.; formal analysis, O.J.; investigation, P.K.R.M.; resources, J.D.B.; writing–original draft preparation, J.D.B.; writing–review and editing, N.V.; visualization, S.N.S.; supervision, M.B.; project administration, S.H.; funding acquisition, O.J. All authors have read and agreed to the published version of the manuscript.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) under Grant NRF-2018R1C1B5045013.

The authors declare no conflict of interest.



  • Cheung et al. (2008) Cheung, N.; Rogers, S.L.; Donaghue, K.C.; Jenkins, A.J.; Tikellis, G.; Wong, T.Y. Retinal arteriolar dilation predicts retinopathy in adolescents with type 1 diabetes. Diabetes Care 2008, 31, 1842–1846.
  • (2) Flaxman, S.; Bourne, R.; Resnikoff, S.; Ackland, P.; Braithwaite, T.; Cicinelli, M.; Das, A.; Jonas, J.; Keeffe, J.; Kempen, J.; et al. Global causes of blindness and distance vision impairment 1990-2020: A systematic review and meta-analysis. Lancet Glob. Health 2017, 5, e1221–e1234.
  • Gulshan et al. (2016) Gulshan, V.; Peng, L.; Coram, M.; Stumpe, M.C.; Wu, D.; Narayanaswamy, A.; Venugopalan, S.; Widner, K.; Madams, T.; Cuadros, J.; et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016, 316, 2402–2410.
  • Williams et al. (2004) Williams, R.; Airey, M.; Baxter, H.; Forrester, J.M.; Kennedy-Martin, T.; Girach, A. Epidemiology of diabetic retinopathy and macular oedema: A systematic review. Eye 2004, 18, 963–983.
  • Long et al. (2019) Long, S.; Huang, X.; Chen, Z.; Pardhan, S.; Zheng, D. Automatic detection of hard exudates in color retinal images using dynamic threshold and SVM classification: Algorithm development and evaluation. BioMed Res. Int. 2019, 2019, 3926930.
  • Haloi et al. (2015) Haloi, M.; Dandapat, S.; Sinha, R. A Gaussian scale space approach for exudates detection, classification and severity prediction. arXiv 2015, arXiv:1505.00737.
  • Noushin et al. (2019) Noushin, E.; Pourreza, M.; Masoudi, K.; Ghiasi Shirazi, E. Microaneurysm detection in fundus images using a two step convolution neural network. Biomed. Eng. Online 2019, 18, 67.
  • Grinsven et al. (2016) Grinsven, M.; Ginneken, B.; Hoyng, C.; Theelen, T.; Sanchez, C. Fast convolution neural network training using selective data sampling. IEEE Trans. Med. Imaging 2016, 35, 1273–1284.
  • Haloi (2015) Haloi, M. Improved microaneurysm detection using deep neural networks. arXiv 2015, arXiv:1505.04424.
  • Srivastava et al. (2017) Srivastava, R.; Duan, L.; Wong, D.W.; Liu, J.; Wong, T.Y. Detecting retinal microaneurysms and hemorrhages with robustness to the presence of blood vessels. Comput. Methods Programs Biomed. 2017, 138, 83–91.
  • Bodapati and Veeranjaneyulu (2019) Bodapati, J.D.; Veeranjaneyulu, N. Feature Extraction and Classification Using Deep Convolutional Neural Networks. J. Cyber Secur. Mobil. 2019, 8, 261–276.
  • Bodapati et al. (2019)

    Bodapati, J.D.; Veeranjaneyulu, N.; Shaik, S. Sentiment Analysis from Movie Reviews Using LSTMs.

    Ingénierie des Systèmes d’Information 2019, 24, 125–129.
  • Zhuo et al. (2018) Zhuo, P.; Zhu, Y.; Wu, W.; Shu, J.; Xia, T. Real-Time Fault Diagnosis for Gas Turbine Blade Based on Output-Hidden Feedback Elman Neural Network. J. Shanghai Jiaotong Univ. (Sci.) 2018, 23, 95–102.
  • Xia et al. (2020)

    Xia, T.; Song, Y.; Zheng, Y.; Pan, E.; Xi, L. An ensemble framework based on convolutional bi-directional LSTM with multiple time windows for remaining useful life estimation.

    Comput. Ind. 2020, 115, 103182.
  • Moreira et al. (2019) Moreira, M.W.; Rodrigues, J.J.; Korotaev, V.; Al-Muhtadi, J.; Kumar, N. A comprehensive review on smart decision support systems for health care. IEEE Syst. J. 2019, 13, 3536–3545.
  • Gadekallu et al. (2020) Gadekallu, T.R.; Khare, N.; Bhattacharya, S.; Singh, S.; Maddikunta, P.K.R.; Srivastava, G. Deep neural networks to predict diabetic retinopathy. J. Ambient Intell. Humaniz. Comput. 2020. doi:10.1007/s12652-020-01963-7.
  • Patel et al. (2020) Patel, H.; Singh Rajput, D.; Thippa Reddy, G.; Iwendi, C.; Kashif Bashir, A.; Jo, O. A review on classification of imbalanced data for wireless sensor networks. Int. J. Distrib. Sens. Netw. 2020, 16, 1550147720916404.
  • Reddy et al. (2019) Reddy, G.T.; Reddy, M.P.K.; Lakshmanna, K.; Rajput, D.S.; Kaluri, R.; Srivastava, G. Hybrid genetic algorithm and a fuzzy logic classifier for heart disease diagnosis. Evol. Intell. 2019, 13, 185–196.
  • Wu et al. (2013) Wu, L.; Fernandez-Loaiza, P.; Sauma, J.; Hernandez-Bogantes, E.; Masis, M. Classification of diabetic retinopathy and diabetic macular edema. World J. Diabetes 2013, 4, 290.
  • Akram et al. (2013) Akram, M.U.; Khalid, S.; Khan, S.A. Identification and classification of microaneurysms for early detection of diabetic retinopathy. Pattern Recognit. 2013, 46, 107–116.
  • Akram et al. (2014) Akram, M.U.; Khalid, S.; Tariq, A.; Khan, S.A.; Azam, F. Detection and classification of retinal lesions for grading of diabetic retinopathy. Comput. Biol. Med. 2014, 45, 161–171.
  • Casanova et al. (2014) Casanova, R.; Saldana, S.; Chew, E.Y.; Danis, R.P.; Greven, C.M.; Ambrosius, W.T. Application of random forests methods to diabetic retinopathy classification analyses. PLoS ONE 2014, 9, e98587.
  • Verma et al. (2011) Verma, K.; Deep, P.; Ramakrishnan, A. Detection and classification of diabetic retinopathy using retinal images. In Proceedings of the 2011 Annual IEEE India Conference, Hyderabad, India, 16–18 December 2011; pp. 1–6.
  • Welikala et al. (2014) Welikala, R.; Dehmeshki, J.; Hoppe, A.; Tah, V.; Mann, S.; Williamson, T.H.; Barman, S. Automated detection of proliferative diabetic retinopathy using a modified line operator and dual classification. Comput. Methods Programs Biomed. 2014, 114, 247–261.
  • Welikala et al. (2015)

    Welikala, R.A.; Fraz, M.M.; Dehmeshki, J.; Hoppe, A.; Tah, V.; Mann, S.; Williamson, T.H.; Barman, S.A. Genetic algorithm based feature selection combined with dual classification for the automated detection of proliferative diabetic retinopathy.

    Comput. Med. Imaging Graph. 2015, 43, 64–77.
  • Roychowdhury et al. (2013) Roychowdhury, S.; Koozekanani, D.D.; Parhi, K.K. DREAM: Diabetic retinopathy analysis using machine learning. IEEE J. Biomed. Health Inform. 2013, 18, 1717–1728.
  • Mookiah et al. (2013)

    Mookiah, M.R.K.; Acharya, U.R.; Martis, R.J.; Chua, C.K.; Lim, C.M.; Ng, E.; Laude, A. Evolutionary algorithm based classifier parameter tuning for automatic diabetic retinopathy grading: A hybrid feature extraction approach.

    Knowl.-Based Syst. 2013, 39, 9–22.
  • Porter et al. (2019) Porter, L.F.; Saptarshi, N.; Fang, Y.; Rathi, S.; Den Hollander, A.I.; De Jong, E.K.; Clark, S.J.; Bishop, P.N.; Olsen, T.W.; Liloglou, T.; et al. Whole-genome methylation profiling of the retinal pigment epithelium of individuals with age-related macular degeneration reveals differential methylation of the SKI, GTF2H4, and TNXB genes. Clin. Epigenetics 2019, 11, 6.
  • Rahim et al. (2016) Rahim, S.S.; Jayne, C.; Palade, V.; Shuttleworth, J. Automatic detection of microaneurysms in colour fundus images for diabetic retinopathy screening. Neural Comput. Appl. 2016, 27, 1149–1164.
  • Dutta et al. (2018) Dutta, S.; Manideep, B.; Basha, S.M.; Caytiles, R.D.; Iyengar, N. Classification of diabetic retinopathy images by using deep learning models. Int. J. Grid Distrib. Comput. 2018, 11, 89–106.
  • Zeng et al. (2019) Zeng, X.; Chen, H.; Luo, Y.; Ye, W. Automated diabetic retinopathy detection based on binocular Siamese-like convolutional neural network. IEEE Access 2019, 7, 30744–30753.
  • Mateen et al. (2019) Mateen, M.; Wen, J.; Song, S.; Huang, Z. Fundus image classification using VGG-19 architecture with PCA and SVD. Symmetry 2019, 11, 1.
  • Reddy et al. (2020) Reddy, G.T.; Reddy, M.P.K.; Lakshmanna, K.; Kaluri, R.; Rajput, D.S.; Srivastava, G.; Baker, T. Analysis of Dimensionality Reduction Techniques on Big Data. IEEE Access 2020, 8, 54776–54788.
  • Bhattacharya et al. (2020)

    Bhattacharya, S.; Kaluri, R.; Singh, S.; Alazab, M.; Tariq, U. A Novel PCA-Firefly based XGBoost classification model for Intrusion Detection in Networks using GPU.

    Electronics 2020, 9, 219.
  • Gadekallu et al. (2020) Gadekallu, T.R.; Khare, N.; Bhattacharya, S.; Singh, S.; Reddy Maddikunta, P.K.; Ra, I.H.; Alazab, M. Early Detection of Diabetic Retinopathy Using PCA-Firefly Based Deep Learning Model. Electronics 2020, 9, 274.
  • Jindal et al. (2018) Jindal, A.; Aujla, G.S.; Kumar, N.; Prodan, R.; Obaidat, M.S. DRUMS: Demand response management in a smart city using deep learning and SVR. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, UAE, 9–13 December 2018; pp. 1–6.
  • Vinayakumar et al. (2020) Vinayakumar, R.; Alazab, M.; Srinivasan, S.; Pham, Q.V.; Padannayil, S.K.; Simran, K. A Visualized Botnet Detection System based Deep Learning for the Internet of Things Networks of Smart Cities. IEEE Trans. Ind. Appl. 2020. doi:10.1109/TIA.2020.2971952.
  • Alazab et al. (2020) Alazab, M.; Khan, S.; Krishnan, S.S.R.; Pham, Q.V.; Reddy, M.P.K.; Gadekallu, T.R. A Multidirectional LSTM Model for Predicting the Stability of a Smart Grid. IEEE Access 2020, 8, 85454–85463.
  • Simonyan and Zisserman (2014) Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
  • Chollet (2017)

    Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258.

  • Zoph et al. (2018) Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710.
  • (42) APTOS 2019. Available online: (accessed on 30 December 2019).
  • Gargeya and Leng (2017) Gargeya, R.; Leng, T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology 2017, 124, 962–969.
  • Kassani et al. (2019) Kassani, S.H.; Kassani, P.H.; Khazaeinezhad, R.; Wesolowski, M.J.; Schneider, K.A.; Deters, R. Diabetic Retinopathy Classification Using a Modified Xception Architecture. In Proceedings of the 2019 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), Ajman, UAE, 10–12 December 2019; pp. 1–6.