Log In Sign Up

Fully Convolutional Neural Network for Semantic Segmentation of Anatomical Structure and Pathologies in Colour Fundus Images Associated with Diabetic Retinopathy

by   Oindrila Saha, et al.

Diabetic retinopathy (DR) is the most common form of diabetic eye disease. Retinopathy can affect all diabetic patients and becomes particularly dangerous, increasing the risk of blindness, if it is left untreated. The success rate of its curability solemnly depends on diagnosis at an early stage. The development of automated computer aided disease diagnosis tools could help in faster detection of symptoms with a wider reach and reasonable cost. This paper proposes a method for the automated segmentation of retinal lesions and optic disk in fundus images using a deep fully convolutional neural network for semantic segmentation. This trainable segmentation pipeline consists of an encoder network, a corresponding decoder network followed by pixel-wise classification to segment microaneurysms, hemorrhages, hard exudates, soft exudates, optic disk from background. The network was trained using Binary cross entropy criterion with Sigmoid as the last layer, while during an additional SoftMax layer was used for boosting response of single class. The performance of the proposed method is evaluated using sensitivity, positive prediction value (PPV) and accuracy as the metrices. Further, the position of the Optic disk is localised using the segmented output map.


page 1

page 3


Squeeze-SegNet: A new fast Deep Convolutional Neural Network for Semantic Segmentation

The recent researches in Deep Convolutional Neural Network have focused ...

Semantic segmentation of mFISH images using convolutional networks

Multicolor in situ hybridization (mFISH) is a karyotyping technique used...

MPG-Net: Multi-Prediction Guided Network for Segmentation of Retinal Layers in OCT Images

Optical coherence tomography (OCT) is a commonly-used method of extracti...

OrthoSeg: A Deep Multimodal Convolutional Neural Network for Semantic Segmentation of Orthoimagery

This paper addresses the task of semantic segmentation of orthoimagery u...

Automated Diabetic Retinopathy Grading using Deep Convolutional Neural Network

Diabetic Retinopathy is a global health problem, influences 100 million ...

Solar Potential Assessment using Multi-Class Buildings Segmentation from Aerial Images

Semantic Segmentation of buildings present in satellite images using enc...

Deep Neural Networks for Automatic Grain-matrix Segmentation in Plane and Cross-polarized Sandstone Photomicrographs

Grain segmentation of sandstone that is partitioning the grain from its ...

1 Introduction

Figure 1: Overview of proposed method

Diabetic retinopathy (DR) is the leading cause of blindness in the working-age population. Screening for DR and monitoring disease progression, especially in the early asymptomatic stages, is effective for preventing visual loss and reducing costs for health systems [1]. Most screening programs use non-mydriatic digital color fundus cameras to acquire color photographs of the retina. These photographs are then examined for the presence of lesions indicative of DR. The most common signs of DR are red lesions (microaneurysms, hemorrhages) and bright lesions (exudates). The presence of red lesions and/or hard exudates (bright lesions) are indicative of early stage DR. Microaneurysms (MAs) are focal dilatations of retinal capillaries and appear as red dots in retinal fundus images. Bright lesions or intraretinal lipid exudates results from the breakdown of blood retinal barrier. Excluded fluid rich in lipids and proteins leave the parenchyma, leads to retinal edema and exudation. Lastly, wherever capillary walls are weak inside the retina, dot hemorrhages lesions are found which are slightly larger than MAs. On rupturing it will cause intra-retinal hemorrhages. Progression of DR also causes macular edema, neo-vascularization and in later stages, retinal detachment.

Early methods of semantic segmentation that relied on lowlevel vision cues have fast been superseded by popular machine learning algorithms. In particular, deep learning has seen huge success lately in handwritten digit recognition, speech, categorising whole images and detecting objects in images

[2]. However, Some of the recent approaches in semantic pixel-wise labelling using deep CNNs give results which are coarse [3]

. This is primarily because max pooling and sub-sampling reduce feature map resolution. SegNet

[4] solves this by mapping low resolution features to input resolution for pixel-wise classification. This mapping produces features which are useful for accurate boundary localization. The proposed method uses a end to end trainable segmentation tool consisting of a encoder network followed by a corresponding decoder network and finally a pixel wise classification layer as shown in Fig 1. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [5] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The idea is to capture the global context of the image as the optic disk and hard exudates have similar brightness levels which makes it hard to differentiate them when only the local context is considered.

2 Methodology

In the proposed approach, the challenge of segmenting the regions corresponding to microaneurysms, hemorrhages, soft exudates, hard exudates and optic disk is formulated as a task of pixel-wise classification of the retinal images using a fully-convolutional neural network. Optic Disk has been added as a class in the same segmentation problem as lesions, so that the model is better able to differentiate exudates and optic disk. Given an image of size M N 3 the problem of segmentation can be formulated as a pixel-wise classification task where each pixel is assigned a label { , , .. } such that an output image of size M N C is generated. Each channel of corresponds to one of the classes. In this proposed method a fully convolutional network having an encoder decoder architecture is trained for the task of segmentation using Binary Cross Entropy loss objective function given as


2.1 Network

SegNet has an encoder network and a corresponding decoder network, followed by a final pixelwise classification layer. The encoder network consists of 13 convolutional layers which correspond to the first 13 convolutional layers in the VGG16 network [5]

. Each encoder layer has a corresponding decoder layer and hence the decoder network has 13 layers. In contrast to original SegNet, the final decoder output is fed to a sigmoid layer to produce class probabilities for each pixel independently in 7 channels. In the target of 7 channels, each channel is the same size as input image : 536

356 and consists of activations in the range [0,1] where 0 corresponds to background and 1 to the presence of corresponding class.

Each encoder in the encoder network performs convolution with a filter bank to produce a set of feature maps. These are then batch normalized


. Then an element-wise rectified linear non-linearity (ReLU)

is applied. Following that, max-pooling with a 2

2 window and stride 2 (non-overlapping window) is performed and the resulting output is sub-sampled by a factor of 2. Max-pooling is used to achieve translation invariance over small spatial shifts in the input image. Sub-sampling results in a large input image context (spatial window) for each pixel in the feature map. While several layers of max-pooling and sub-sampling can achieve more translation invariance for robust classification correspondingly there is a loss of spatial resolution of the feature maps. The increasingly lossy (boundary detail) image representation is not beneficial for segmentation where boundary delineation is vital. Therefore, the boundary information in the encoder feature maps are captured and stored before sub-sampling is performed. This is done by storing only the max-pooling indices ( due to memory constraints ), i.e, the locations of the maximum feature value in each pooling window is memorized for each encoder feature map.

The appropriate decoder in the decoder network upsamples its input feature map(s) using the memorized max-pooling indices from the corresponding encoder feature map(s). This step produces sparse feature map(s). These feature maps are then convolved with a trainable decoder filter bank to produce dense feature maps. A batch normalization step is then applied to each of these maps. The high dimensional feature representation at the output of the final decoder is fed to a sigmoid layer. This squashes the value to [0,1] denoting the probabilities for presence of a class for each pixel independently. The output of the sigmoid layer is a 7 channel image of probabilities, each channel denoting one of the classes.

2.2 Training

For training, the images were downsampled to 536 356 which is exactly of original images keeping aspect ratio same. Patch-wise training of the network was not resorted to as the patches containing exudates and optic disk had similar intensity which renderes the task of differentiating them rather difficult. In addition to the dataset released for the challenge, Drishti-GS 111 dataset was used for data augmentation. The images of the Drishti-GS dataset [7]

were resized keeping the aspect ratio intact and zero-padded to bring it to 536

356 pixels. For Optic Disk mask for the Drishti GS dataset, the values 0.75 were taken from the segmentation soft map which signifies agreement by 3 of 4 experts upon presence of optic disk. Data was further augmented by also taking horizontal, vertical and 180 degree flipped versions of the original images. Two additional classes were introduced, which are the retinal disk excluding the lesions and optic disk; the black background. The retinal disk was found by thresholding a grayscale version of the fundus image.

Figure 2: The first column shows the retina fundus images, the second the predicted segmented masks while the third column shows the ground truth segmented masks. Green : Optic Disk, Red: Soft Exudates, Blue: Microaneurysms, Cyan: Haemorrhages, Yellow: Hard Exudates

We use the Binary Cross Entropy loss as the objective function. The losses are averaged for each minibatch over observations as well as over dimensions. It is observed that overfitting happens for different classes at different epochs. By monitoring the validation loss, different stages of the trained network are saved which gives best performance for each of the classes so as to create an ensemble of networks for inference.

2.3 Inference

During inference we introduce an additional Softmax layer after the Sigmoid layer which normalizes the value of a pixel for each class across channels. The Softmax layer has no trainable parameters, hence inclusion in inference is not dependant on training. On using the Softmax layer during inference the masks come closer to groundtruth than the case without Softmax. Finally, segmented ouput is upsampled for each class to 4288 2848 and compared with the groundtruths. Localization of optic disk is done by finding the centroid of the region segmented out as the optic disk which is obtained after thresholding the output of the trained network.

3 Experiments

The dataset used for this problem is from the IDRiD Diabetic Retinopathy Segmentation and Grading challenge. The dataset for the challenge however provides optic disk segmentation mask for the images with Apparent Retinopathy only. Hence, to identify the presence of optic disk as well as absence of lesions in images with no apparent retinopathy, we used the Drishti-GS dataset. The retinal fundus images of the Drishti-GS dataset, like IDRiD, are all collected from Indians. Importantly, they do not have the presence of any lesions and have the segmented map of optic disk available.

We use Adam optimizer [8] with learning rate and 0.9. Early stopping of the training based on the validation loss is adopted to prevent overfitting. It was observed that the validation loss started to increase after 200 epochs. Before choosing proposed methodology we conducted previous experiments. In the first experiment, the network was trained using patches of size 256 256 extracted from the original image of size 4288 2848 pixels. The second experiment considered only a 5 class problem without including the retinal disk and black background as separate classes.

4 Results and Discussions

Sensitivity and Positive predictive value (PPV), is plotted against the threshold for each class for obtained grayscale segmented masks. It is observed that as threshold is increased Sensitivity reduces whereas PPV increases.

Figure 3: Sensitivity PPV tradeoff

The best threshold was then chosen from the point of intersection of these two plots as shown in Figure 3. As shown, for HE - Haemorrhages we choose 0.4 as the best threshold. Similarly, for each class the best threshold was chosen from the intersection points.

The qualitative results are shown in Fig 2. Table 1 lists the quantitative results.

Class Metric Score
Optic Disk Jaccard Index (IOU) 0.8572
Microaneurysms Area under PPV vs SE 0.0059
Hard Exudates Area under PPV vs SE 0.5498
Haemorrhages Area under PPV vs SE 0.0829
Soft Exudates Area under PPV vs SE 0.1823

Table 1: Quantitative Results

Of the previous experiments, on training using patches, the model was not able to distinguish between optic disk and exudates due to lack of global view. Also taking a 7 class problem in place of a 5 class one improves results by a considerable margin.

The evaluation of the localisation of optic disk was done by finding the Euclidean distance between the predicted and the ground truth; after segmenting the optic disk using the above method and finding the centroid. The given ground truth locations of optic disk were not used for training, but the evaluation was done using the provided locations. The mean Euclidean distance for given 413 images was found to be .

5 Conclusion

Previous work [9] in automatic detection of diabetic retinopathy deal with grading or identifying stage of the disease. These methods require extensive pre processing and many steps to finally reach the result [10]

. The proposed method solves the problem of segmentation of lesions in an end-to-end manner. The segmented output can then also be leveraged to determine the severity level directly. Also, no manual feature extraction is needed in our process. This method provides a single unified solution for six of the subtasks. Further work however needs to be done to be better able to identify the Microaneurysm lesions.