Diabetic Retinopathy (DR) is a leading disabling chronic disease and one of the main causes of blindness and visual impairment in developed countries for diabetic patients. Studies reported that 90% of the cases can be prevented through early detection and treatment. Eye screening through retinal images is used by physicians to detect the lesions related with this disease. Due to the increasing number of diabetic patients, the amount of images to be manually analyzed is becoming unaffordable. Moreover, training new personnel for this type of image-based diagnosis is long, because it requires to acquire expertise by daily practice (Romero-Aroca et al., 2010), (Romero-Aroca, 2011).
, is a subfield of Machine Learning that allow the automatic model construction of very effective image classifiers using a parametric model. These models are able to identify and extract the statistical regularities present in data that are important for optimizing a defined loss function, with the final objective of mapping a high-multidimensional input into a smaller multi-dimensional output (f:
). This mapping allows the classification of multi-dimensional objects into a small number of categories. The model is composed by many neurons that are organized in layers in a hierarchical way. Every neuron receives the input from a predefined set of neurons. Every connection has a parameter that corresponds to the weight of the connection. The function of every neuron is to make a transformation of the received inputs into a calculated output value. For every incoming connection, the weight is multiplied by the input value received by the neuron. The aggregation of all inputs passes through an activation function that calculates the neuron output. Parameters are usually optimized using a stochastic gradient descent algorithm that minimizes a predefined loss function. These hierarchical models are able to learn multiple levels of representation that correspond to different levels of abstraction, which enables the representation of complex concepts in a compressed way(Bengio et al., 2013), (Bengio, 2009).
Convolutional neural networks (CNN) are a deep learning subfield that proved to be very effective for image classification, detection and segmentation. The first successful CNN was presented in (LeCun et al., 1998), and it was designed for hand-written digit recognition. This early CNN implementation used a combination of convolution, pooling and non-linearity that has been the key feature of DL since now. The DL breakthrough took place with the publication of (Krizhevsky et al., 2012)
, where for the first time a CNN won the Imagenet(Deng et al., 2009)
classification competition by a large margin. In that paper a set of innovative techniques where introduced including data augmentation, the use of rectified linear units (ReLUs) as activation function, the use of dropout for avoiding overfitting, overlapping max-pooling for avoiding the averaging effects of avg-pooling and the use of GPUs for speeding up the training time. Later on, different CNN improvements where also published. In(Simonyan and Zisserman, 2015), it was the first time that small 3x3 convolution filters where used and combined as a sequence of convolutions. In (He et al., 2016) the authors introduced residual networks. This networks used a combination of 3x3 convolutional layers with a by-pass of the input every two layers that was summed up to the output. Such bypasses improved the gradient propagation through the network allowing the design of deeper networks and improving the classification capabilities.
DL models have been also successfully applied in many medical classification tasks. In (Esteva et al., 2017) a DL classifier was designed achieving dermatologist-level accuracy for skin cancer detection. In (Zhu et al., 2018) a 3D CNN for automated pulmonary nodule detection and classification was designed. In (Wang et al., 2018) a CNN Alzheimer’s disease classifier with high performance was also described. In (Gulshan et al., 2016) the authors presented a diabetic retinopathy classifier with better performance than human experts in the detection of the most severe cases of the disease.
Ordinal regression is a term used for multi-class classification for the cases where a underlying property can be used for prestablishing an ordering of the classes. Several quality measures exist in the literature of machine learning and statistics for ordinal regression (Mehdiyev et al., 2016). Kappa is a well-known statistic coefficient defined by Cohen (Cohen, 1960) to measure inter-rater agreement in such cases. Weighted Kappa (Cohen, 1968) is another index used for measuring the goodness of a ordinal regression classification. In this last coefficient, disagreements are penalized proportionally to a power of the distance between classes. The penalization most commonly used is the quadratic. In such cases, the index is commonly referred as Quadratic Weighted Kappa (referred as QWK or indistinctly in this paper).
In medical diagnosis tasks is important not only the accuracy of predictions but also the reasons behind decisions. Self-explainable models enable the physicians to contrast the information reported with their own knowledge, increasing the probability of a good diagnostic, which may have a significant influence in patient’s treatment.
Different attempts have been done for interpreting the results reported by neural networks. In (Zeiler and Fergus, 2014) a network propagation technique is used for input-space feature visualization. After this work, (Bach et al., 2015) used a pixel-wise decomposition for classification decision. This decomposition could be done in two ways: considering the network as a global function, disregarding its topology (functional approach) or using the natural properties of the inherent function topology for applying a message passing technique, propagating back into the pixel space the last layer output values. After this work, in (Montavon et al., 2017) they used a so named Deep Taylor decomposition technique to replace the inherently intractable standard Taylor decomposition using a multitude of simpler analytically tractable Taylor decompositions.
Independent Component Analysis (ICA)(Hyvärinen and Oja, 2000)
is a statistical method for the separation of a multi-dimensional random signal into a linear combination of components that are statistically as independent from each other as possible. The theoretical foundation of ICA is based on the Central Limit Theorem, which establishes that the distribution of the sum (average or linear combination) of N independent random variables approaches a gaussian as
. When ICA method is applied, it is assumed that such separation exist, ie. that is possible to express the signal as a linear combination of independent components (IC). Perfect independence between random variables is achieved when mutual information between them is zero. Mutual information can be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the distributions of each variable. Mutual information can be decomposed, under linear transforms, as the sum of two terms: one term expressing the decorrelation of the components and one expressing their non-Gaussianity(Cardoso, 2003). ICA uses optimization to calculate its components. Two types of optimization objectives can be used: minimize the mutual information or maximize the non-gaussianess of each component.
In this paper we study a technique that allows the identification, separation and visualization in the input and hidden space of the IC responsible of a particular DR classification decision taken by a DL classifier given a certain eye fundus image. This is done by calculating the minimum number of IC that are able to encode the maximum information about the particular classification. Identifying such components, we reduce the redundancy of feature space and identify the components that share the minimum mutual information between each other. In that way, under the supposition that the feature extraction phase of the deep learning model has been able to disentangle completely the feature information, we are able to separate the independent elements causing the disease.
We use the pixel-wise visualization method to identify in pixel space such independent causes. We use the assessment of experts clinicians for interpreting such IC.
The paper is structured as follows: in Section 2 the current work on DL applied to DR is briefly introduced, then, the main works on interpretation of DL are presented. Section 3 we present the methods, Section 4 presents the results showing samples of the kind of visual interpretations given by the model and finally Section 5 present the final conclusions of our work.
2 Related Work
In (Gulshan et al., 2016) a binary DL classifier was published for the detection of the most severe cases of DR (grouping classes 0 and 1 of DR on one side, and classes 2, 3 and 4 on another). This model was trained using an extended version of the EyePACS dataset mentioned before with a total of 128,175 images and improving the proper tagging of the images using a set of 3 to 7 experts chosen from a panel of 54 US expert Ophtalmologists. This model surpassed the human expert capabilities, reaching at the final operating point approximately 97% sensitivity and 93.5% specificity in the test sets of about 10,000 images for detecting the worse cases of DR. The strength of this model was its ability to predict the more severe cases with a sensitivity and specificity greater than human experts.
In our previous work (de la Torre et al., 2018)
a DL classifier was published for the prediction of five different disease grades. This model was trained using the public available EyePACS dataset. The training set had 35,126 images and the test set 53,576. The quadratic weighted kappa (QWK) evaluation metric(Cohen, 1968) over the test set using a unique DL model without ensembling was close to the reported by human experts.
The drawback of all these implementations, as of many other DL based models, is its lack of interpretability. In last years different approximations have been proposed for converting the initial DL black box classifiers into interpretable ones. Between the more successful interpretation models existing today we have the sensitivity maps (Simonyan et al., 2013), layer-wise relevance propagation (Bach et al., 2015) and Taylor type decomposition models (Montavon et al., 2017).
All these methods use different strategies to backpropagate the classification scores to the input space and distribute the value of the final classification between input pixels. With this score distribution is possible to identify the most relevant pixels for a particular classification.
Layer-wise relevance propagation models allow not only to report the predicted class but also to evaluate the importance of every input pixel of the image in the final classification decision. In such a way, it is possible to determine which pixels are more important in the final decision and facilitate the human experts the construction of rationale explanations based on the interpretation of such maps.
DL models are organized in layers, being the inputs of each one a combination of the outputs of previous ones. We design the output layer to be a linear combination of last layer feature space components. In this way we are forcing the model to disentangle the important features that, combined in a linear way, allow the achievement of a maximum possible classification score. These components (or other obtained as a linear combination of them, like with ICA analysis), are easy to analyze due to the linear nature of its relationship with the classification scores.
3.1 ICA based interpretation procedure
In this paper we go a step forward in the interpretation of score maps. Instead of generating directly the pixel maps associated with a particular class, we try to identify and separate independent elements associated with the disease. Our new contribution comes from the identification, separation and visualization of the IC that explain a particular classification decision. We hypothesize that in order to achieve high performance classification scores, the network has to encode the information required to make the classification. Human experts base its decisions in the number and types of lesions present in the image. That’s why in some way, such information has to be present in a disentangled form in last layer feature space, previous to the output layer. Instead of directly visualizing the more important pixels under a classification decision, we split the information of such last feature layer into independent features using a Independent Component Analysis (ICA). A posteriori we use a pixel-wise relevance propagation derived method to visualize such independent component in input space. In this way, we can, not only generate importance pixel maps, but also differentiate between the underlying independent causes of the disease.
We study the last layer feature space, previous to the output layer linear combination in order to identify its properties and try to isolate the independent elements that are causing a particular classification. For this purpose we use a principal component analysis (PCA)(Pearson, 1901) to appraise the redundancy of this space and a ICA (Hyvärinen and Oja, 2000) using different number of components to identify the minimum of them required to achieve a classification score close enough to the achieved without such a dimensional reduction. ICA allows to find a linear representation of non-Gaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction (Hyvärinen and Oja, 2000)
. When the data is not Gaussian, there are higher order statistics beyond variance that are not being taken into account by PCA. While PCA captures only uncorrelated components, these uncorrelated components are not necessarily independent for general distributions. ICA minimizes the mutual information (or relative Kullback-Leibler divergence) of non-Gaussian data because two distributions with zero mutual information are statistically independent(Comon, 1994).
For finding the optimal number of IC, we apply ICA to the last layer feature space training set vectors. Using different number of IC and comparing the classification performance of the original model with the obtained using a linear combination of the reduced number of calculated components, it is possible to find the optimal number of components () that does not significantly reduce the classification performance of the original model. We modify the original model adding a new layer after the last layer feature space to calculate online the components of every analyzed image. This layer is a linear transformation and acts as a dimensionality reduction layer (see fig. 1). The final classification is achieved linearly combining the low dimensional IC layer.
After identifying the optimal , we use the receptive field and pixel-wise explanation model (de la Torre et al., 2017) to visualize the independent scores in the input space. In this way we are visualizing not only a score map explaining a classification but also differentiating and visualizing the mathematically IC responsible of a particular classification.
3.2 Mathematical formalization
Let be the set of all feature vectors of training set:
being the number of elements of the training set, and the dimension of feature space vector.
Let the set of IC calculated from :
being the number of IC, .
Every can be expressed as a linear combination of :
Where is calculated using a optimization method, minimizing the mutual information between IC () (Hyvarinen, 1999).
The ordinal regression problem solved using can be expressed as:
being the predicted class vector and the evaluation function used in the training process of the neural network calculated for the validation set.
We want to solve the same problem using the reduced ICA components (feature space compressed version):
being the predicted class vector using IC.
The optimal number of IC to select is the one that minimizes the difference in performance between both models:
In this study we use the EyePACS dataset of the Diabetic Retinopathy Detection competition hosted on the internet Kaggle Platform. For every patient right and left eye images are reported. All images are classified by ophthalmologists according to the standard severity scale presented before in (Wilkinson et al., 2003). The images are taken in variable conditions: by different cameras, illumination conditions and resolutions.
The training set contains a total of images; of class 0, of class 1, of class 2, of class 3 and of class 4. The validation set used for hyper-parameter optimization has images; of class 0, of class 1, of class 2, of class 3 and of class 4. The test set, contains a total of images; of class 0, of class 1, of class 2, of class 3 and of class 4.
4.2 Baseline Model
Our baseline model uses a 3x640x640 input image obtained from a minimal preprocessing step where only the external background borders are trimmed and later resized to the required input size. It is a CNN of 391,325 parameters, divided in 17 layers. Layers are divided in two groups: the feature extractor and the classifier. The feature extraction has 7 blocks of 2 layers. Every layer is a stack of a 3x3 convolution with stride 1x1 and padding 1x1 followed by a batch normalization and a ReLU activation function. Between every block a 2x2 max-pooling operation of stride 2x2 is applied. Afterwards, the classification phase takes place using a 2x2 convolution. A 4x4 average-pooling reduces the dimensionality to get a final 64 feature vector that are linearly combined to obtain the output scores of every class. A soft-max function allows the conversion of scores to probabilities to feed the values to the proper cost function during the optimization process. The feature extractor has 16 filters in the first block, 32 in the second and 64 in all the other.
4.3 Model Modifications
For all the training set we calculate the last layer feature space, obtaining a 64-dimensional vector as a representation of each image. We observe that this vector is highly redundant. After applying a PCA, with only 10 components it is possible to explain 99% of the variance.
Using the 64-dimensional feature-space vector () of all the training set, we calculate a set of ICAs using different values. With each one, we train a linear classifier to calculate the evaluation metric obtained over a validation set (fig. 0(b)). We choose the minimal that allows achieving maximum performance. The optimal for this problem is , achieving a not far from the achieved by the original model without dimensionality reduction ().
Fig. 2 shows the contribution of each component to the score of each class. We can see that ICA values are differentiating between extreme classes 0, 2 and 4. As we are training the network using as a loss function , the optimization takes place as an ordinal regression. We expect the network to find the underlying causes present in the image that produce the predefined sorting of the classes, ie. the types of lesions and the number of them. Class 0 score contributions come from , and ; being the class markers of the presence of disease , and . Analyzing the pixels with higher negative signals in the three components will give us the points that are contributing the most to the signaling of a possible presence of the disease. Backpropagating the scores of each one of this negative components will give a richer explanation with distinction between three possible independent causes of the final diagnostic given by the model.
We use a two-dimensional t-SNE visualization (van der Maaten and Hinton, 2008) for qualitative evaluation of the difference between the quality of the separation using the 64 feature vector and of the reduced version with only 3 ICA components. In fig. 3 we show the 2D t-SNE visualization for the original feature space and for the ICA reduced space. We can see how for both spaces class 0, 2 and 4 classes are clearly separated. Class 0 and 1 are not properly separated and in the case of 3 and 4, although the separation is not perfect, it is possible to distinguish a different location of both classes for both spaces. As the quantitative index QWK also show, there is not a qualitative separation capability difference between the t-SNE map of the original features and the map of the ICA 3 dimensional compressed feature map. For such difference in the classification performance index ( vs ) we can conclude that the independent component analysis has been able to find a adequate compressed expression of the information encoded in the network.
4.4 Score components contribution for a test sample
We use a pixel-wise relevance propagation derived method for visualizing each ICA component independently. In this way, it is possible to visualize the mathematically independent contributions, enhancing the localization of different types of primary elements causing the disease. Figure 4 shows the intermediate score maps generated using a receptive field of 61x61. Figure 5 shows the same intermediate scores for a class 4 image. These hidden layer maps are useful when a general map of the lesion locations is enough.
Input-space pixel scores are the most suitable when pixel detail is required for detecting the individual lesions causing the disease.
Figure 6 show the input space final scores for
. We show the points having contributions higher than a prestablished limit, in this case with values higher than two standard deviations.
score maps have an almost perfect match with real lesions having a very low rate of false positives and false negatives. and include statistical regularities not directly related with relevant lesions. From such interpretation we conclude that ICA acts as a filter separating lesion information present in images from blink artifacts, noise and other statistical regularities present in images.
In this paper we studied some of the feature space properties of a already trained diabetic retinopathy deep learning classifier of 17 layers. We saw how, even being a small network, the redundancy of the feature space is very high. We hypothesized that if the network is able to achieve human-level classifications, in some way, it has been able to identify the important statistical regularities present in the image that are important for the classification. As last layer is a linear classifier, such properties has to be disentagled, ie. expressed in a linear way in such last layer. We applied a independent component analysis over such last layer with the method explained in the paper in order to find the optimal number of components that maximize the classification capabilities of such compressed version of the features. We experimentally proved that reducing to only 3 components is possible to achieve almost of the evaluation metric. Such value experimentally prove that the ICA compression has been able to extract all the important elements required for the classification. With such elements, we applied a visualization technique for identifying the elements in the image of each one of the components, using a pixel-wise derived visualization technique. We are able to generate three maps for every image each one identifying independent statistical regularities important for the classification.
Our method allows not only the classification of retinographies but also the identification and localization in the image of the mathematically independent signs of the disease. The presented ICA score model is of general applicability and can be easily adapted for the usage in other image classification tasks thus, as future work we plan to test it in other types medical images (eg. cancer detection in mammograms).
This work is supported by the URV grants 2017PFR-URV-B2-60, and the Spanish research project PI15/01150 (Instituto de Salud Carlos III) and Fondos FEDER. The authors would like to thank to Kaggle and EyePACS for providing the data used in this paper.
- Romero-Aroca et al. (2010) Pedro Romero-Aroca, Ramon Sagarra-Alamo, Marc Baget-Bernaldiz, Juan Fernández-Ballart, and Isabel Méndez-Marin. Prevalence and relationship between diabetic retinopathy and nephropathy, and its risk factors in the north-east of spain, a population-based study. Ophthalmic epidemiology, 17(4):251–265, 2010.
- Romero-Aroca (2011) Pedro Romero-Aroca. Managing diabetic macular edema: The leading cause of diabetes blindness. World journal of diabetes, 2(6):98, 2011.
- Lecun et al. (2015) Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 5 2015. ISSN 0028-0836.
- Schmidhuber (2015) J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
- Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, August 2013. ISSN 0162-8828.
- Bengio (2009) Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
- LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
- Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, ICLR 2015, 2015.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Esteva et al. (2017) Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115, 2017.
- Zhu et al. (2018) Wentao Zhu, Chaochun Liu, Wei Fan, and Xiaohui Xie. Deeplung: Deep 3d dual path nets for automated pulmonary nodule detection and classification. In IEEE Winter Conference on Applications of Computer Vision, 2018.
- Wang et al. (2018) Shui-Hua Wang, Preetha Phillips, Yuxiu Sui, Bin Liu, Ming Yang, and Hong Cheng. Classification of alzheimer’s disease based on eight-layer convolutional neural network with leaky rectified linear unit and max pooling. Journal of medical systems, 42(5):85, 2018.
- Gulshan et al. (2016) V. Gulshan, L. Peng, and M. Coram. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Journal of the American Medical Association, 316(22):2402–2410, 2016.
- Mehdiyev et al. (2016) Nijat Mehdiyev, David Enke, Peter Fettke, and Peter Loos. Evaluating forecasting methods by considering different accuracy measures. Procedia Computer Science, 95:264–271, 2016.
- Cohen (1960) Jacob Cohen. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
- Cohen (1968) Jacob Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213, 1968.
- Zeiler and Fergus (2014) Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
- Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- Montavon et al. (2017) Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
- Hyvärinen and Oja (2000) Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
- Cardoso (2003) Jean-François Cardoso. Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research, 4:1177–1203, 2003.
- Pratt et al. (2016) Harry Pratt, Frans Coenen, Deborah M Broadbent, Simon P Harding, and Yalin Zheng. Convolutional neural networks for diabetic retinopathy. Procedia Computer Science, 90:200–205, 2016.
de la Torre et al. (2018)
Jordi de la Torre, Aïda Valls, and Domenec Puig.
Weighted kappa loss function for multi-class classification of
ordinal data in deep learning.
Pattern Recognition Letters, 105:144 – 154, 2018.
Machine Learning and Applications in Artificial Intelligence.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
- Pearson (1901) Karl Pearson. Principal components analysis. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 6(2):559, 1901.
- Comon (1994) Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
- de la Torre et al. (2017) Jordi de la Torre, Aida Valls, and Domenec Puig. A deep learning interpretable classifier for diabetic retinopathy disease grading. arXiv preprint arXiv:1712.08107, 2017.
- Hyvarinen (1999) Aapo Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE transactions on Neural Networks, 10(3):626–634, 1999.
- Wilkinson et al. (2003) CP Wilkinson, FL Ferris 3rd, Ronald E Klein, Paul P Lee, Carl David Agardh, Matthew Davis, Diana Dills, Anselm Kampik, R Pararajasegaram, and Juan T Verdaguer. Global diabetic retinopathy project group. proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology, 110(9):1677–1682, 2003.
- van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.