MTCD: Cataract Detection via Near Infrared Eye Images

by   Pavani Tripathi, et al.

Globally, cataract is a common eye disease and one of the leading causes of blindness and vision impairment. The traditional process of detecting cataracts involves eye examination using a slit-lamp microscope or ophthalmoscope by an ophthalmologist, who checks for clouding of the normally clear lens of the eye. The lack of resources and unavailability of a sufficient number of experts pose a burden to the healthcare system throughout the world, and researchers are exploring the use of AI solutions for assisting the experts. Inspired by the progress in iris recognition, in this research, we present a novel algorithm for cataract detection using near-infrared eye images. The NIR cameras, which are popularly used in iris recognition, are of relatively low cost and easy to operate compared to ophthalmoscope setup for data capture. However, such NIR images have not been explored for cataract detection. We present deep learning-based eye segmentation and multitask network classification networks for cataract detection using NIR images as input. The proposed segmentation algorithm efficiently and effectively detects non-ideal eye boundaries and is cost-effective, and the classification network yields very high classification performance on the cataract dataset.



There are no comments yet.


page 1

page 2

page 4

page 6

page 7

page 8

page 9


Assessment of iris recognition reliability for eyes affected by ocular pathologies

This paper presents an analysis of how the iris recognition is impacted ...

Implications of Ocular Pathologies for Iris Recognition Reliability

This paper presents an analysis of how iris recognition is influenced by...

EyePAD++: A Distillation-based approach for joint Eye Authentication and Presentation Attack Detection using Periocular Images

A practical eye authentication (EA) system targeted for edge devices nee...

Iris Presentation Attack Detection Based on Photometric Stereo Features

We propose a new iris presentation attack detection method using three-d...

Fast Eye Detector Using Metric Learning for Iris on The Move

This paper proposes a fast eye detection method based on fully-convoluti...

Diabetic Retinopathy Detection using Ensemble Machine Learning

Diabetic Retinopathy (DR) is among the worlds leading vision loss causes...

mEBAL: A Multimodal Database for Eye Blink Detection and Attention Level Estimation

This work presents mEBAL, a multimodal database for eye blink detection ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cataract is an age-related ocular disorder in which the eye lens develops a cloudy layer due to the breaking down of proteins in the eye, which makes it opaque, leading to blurry vision. Both eyes of a person can have a different level of cataract and can develop at the different or same time. It is one of the most common eye diseases and is one of the primary causes of blindness (pascolini2012global). According to the National Blindness and Visual Impairment Survey of India 2015-19, people above the age of 50 years may develop blindness due to cataract. The condition contributes to the blindness cases, of severe visual impairment cases, and moderate visual impairment cases in this age group. According to india, in India, 50-80% of bilateral blindness cases can be attributed to cataract. These numbers demonstrate the need of detecting and correcting cataract in time.

Figure 1: Showcasing the affected samples of pre and post cataract from NIR spectrum(top row) and visible spectrum (bottom row)

The current process for cataract detection involves using a slit-lamp or an ophthalmoscope for capturing the eye images, and an ophthalmologist examines and tests the eyes of the patient to diagnose the presence of a cataract. While this is the gold standard, the rate of blindness, particularly in remote rural areas, is more than the trained ophthalmologists and resources (murthy2008current). On the other hand, for biometrics authentication, the low cost near infra-red (NIR) cameras are used in iris recognition. These cameras are available in different form factors, i.e. single eye and dual eye, and they are easy to use. We postulate that eye images obtained from these cameras can help design low cost, accessible, and easy-to-use solutions for cataract detection.

As shown in Fig. 1, NIR eye images provide iris and pupil region which can be utilized to explore whether these images are useful for cataract detection. However, these samples also highlight the challenges involved in designing an automated algorithm. textcolorblueAs shown in Fig. 2

, the captured images may not be ideal because of: (i) drooping eyelids due to old age, (ii) inadequate camera-to-eye distance and angle, and (iii) excessive contraction or dilation of pupil due to other medical conditions (or ongoing medications such as blood thinner). Therefore, as the first step, designing an efficient segmentation algorithm that segments the iris and pupil regions from the acquired non-ideal images is important. Once the iris and pupil are segmented, the proposed approach involves designing the feature extraction and classification algorithm to differentiate between healthy eye images and images with cataract. In the feature extraction and classification stage as well, the primary challenges are irregular shape and size of iris and pupil. Depending on the kind of occlusion present in the eye image, the angle of capture, and the shape of iris/pupil, the segmented iris and pupil regions can be of different shapes. The classification algorithm should account for these variabilities and perform accurate classification.

To address the above-mentioned challenges, this research presents an automated algorithm, termed as MTCD, for cataract detection from NIR images. As shown in Fig.3, the input NIR eye image is first processed using the proposed hierarchical pyramid network termed as

to segment iris and pupil patterns from images of eyes acquired in unconstrained environments, in the presence of five different covariates, viz. at-a-distance, clouding for pupil due to cataract, punctured iris due to cataract surgery, and excessive contraction or dilation. After post-processing, the segmented eye image (with iris and pupil boundaries) are then used by a multitask deep learning approach that performs two tasks: the first task classifies the image as

healthy or unhealthy, and the second task classifies the images to one of three classes: pre-cataract, post-cataract, and others. The class ’others’ consists of samples that are neither suffering from cataract nor have undergone surgery. The results of the proposed cataract detection algorithm are demonstrated on the publicly available IIITD Cataract Surgery dataset (nigamphacoemulsification). We further evaluate the performance of the proposed on four challenging eye (iris) datasets that comprise the covariates mentioned above. Since cataract is assessed in the presence of eye drops used for dilating the pupil, we have also prepared a Pupil Dilation dataset comprising images before and after the use of eye drops111To the best of our knowledge, this is the only dataset in the research community and it will be released to the research community.. The results on different datasets show that the proposed algorithm yields the best performance in terms of both computation efficiency and memory requirements.

Figure 2: Showcasing the visual differences in pupil and iris in the pre and post cataract samples. Top row shows the images with cataract and before surgery. Bottom row shows images with cataract removed after surgery.
Figure 3: Proposed Pipeline of the MTCD Approach (Best viewed in colour). Architecture of Segmentation Network and Classification Network are shown in Fig. 4 and Fig. 7, respectively.

2 Related Work

Literature review is divided into two subsections: (i) cataract related and (ii) iris and pupil segmentation related.

2.1 Cataract Related

There are limited efforts in automating cataract detection.  srivastava2014automatic have proposed a method to grade the nuclear cataract slit-lamp images using gray level image gradients. YANG201645 used an ensemble approach on models by exploiting three independent feature sets; wavelet, sketch, and texture-based features for grading the cataract fundus images. ran2018cataract

have extracted features from fundus images using a three-layer deep convolution neural network (CNN) and random forests (RF) to grade the cataract. They have demonstrated that RF improves grading accuracy.


have used pre-trained AlexNet for feature extraction for fundus image and support vector machines (SVM) for classifying images in different categories of cataract.

ZHANG2019104978 have implemented a framework to grade the cataract into six levels using feature fusion approach obtained via ResNet18 model and handcrafted GLCM features. xu2019fully aimed at grading cataracts from slit-lamp photos using Faster-RCNN to locate the nuclear region and finally used ResNet101 to grade the samples. xu2019hybrid have used the deep model to learn useful features directly from input fundus images for grading the cataract and employed the deconvolution network method to investigate how CNN characterizes cataract layer-by-layer. grammatikopoulou2019cadis have proposed an approach for semantic segmentation in cataract surgery videos. 9283218 proposed GraNet, a CNN-based model, by introducing a point-wise convolution method to learn high-level features for the classification of nuclear cataract from anterior segment optical coherence tomography (AS-OCT) images. To the best of our knowledge, no work has been reported which utilizes images acquired in the NIR spectrum. The proposed work aims to use NIR eye images as the input to cataract classification.

2.2 Iris and Pupil Segmentation Related

The literature on iris and pupil segmentation in iris biometrics is very rich. Starting with pre-deep learning approaches such as daugman1993high, vatsa2008improving; zhang2010texture to learning-based approaches (ICCV_2015; radman2017automated)

, most of the algorithms focus on near-ideal eye imaging. In the recent literature, Convolutional Neural Networks (CNNs) based approaches are more prevalent. These approaches provide an end-to-end mechanism to search for optimal iris and pupil boundaries. Since segmentation requires the model to correctly segment very fine regions such as iris pixels occluded by eyelashes and specular reflections present in the pupil or iris, the targeted models are designed for segmentation in non-cooperative scenarios

(liu_MFCN; arsalan2017deep; lakra2018segdensenet; hofbauer2019exploiting; hu2019icb; wang2020towards). To the best of our knowledge, there is no segmentation algorithm designed for eye images affected due to cataract or post-cataract surgery. Existing algorithms which work well on normal eyes but may not work properly due to the artifacts due to cloudy pupil (or any other medical conditions), small punctures or irregularities in iris that may have resulted due to cataract surgery, or pupil may be medically dilated. In this research, one of the contributions is proposing a segmentation algorithm that helps to segment iris and pupil boundaries in medically affected eyes.

Figure 4: Presents the proposed architecture, for iris and pupil segmentation in an unconstrained environment. The dotted boxes represent the pyramid structure. The upsampling level increases in the x-direction, and the hierarchical level increases in the y-direction. The intermediate feature maps in L1-L5 levels present the different information stored in each map which results in preserving the fine and the global structure of the iris and pupil in the final output. (Best viewed in color)

3 Proposed MTCD Approach

The broad pipeline of the proposed MTCD algorithm is shown in Fig. 3. The eye image acquired from the NIR camera is given as input to the segmentation network. The segmented image is then used by the multitask network for classification. In this pipeline, the segmentation algorithm has to be robust to address real-world challenges such as specular reflections, eyelashes, de-pigmentation, irregularities due to cataract and cataract-removal surgery. In this section, we present the proposed PyramidNet for iris and pupil boundaries segmentation followed by the classification network.

Figure 5: Illustrating the difference in the local information present in the intermediate outputs when the feature maps are directly upsampled compared to the when the upsampling is done in the proposed pyramid like fashion.

3.1 Proposed PyramidNet for Iris and Pupil Segmentation

Fig. 4 presents a diagrammatic representation of the proposed algorithm. The input image is processed by the proposed algorithm which produces its binary mask. This mask is multiplied with the original input image to extract the region of interest with iris and pupil boundaries.

The proposed algorithm uses DenseNet as the backbone network. To improve the flow of information and gradient between the layers, (huang2017densely) proposed dense connectivity between the layers (direct connections from any layer to subsequent layers). Let the input image be and

be the non-linear transformation of the

layer. Input to the layer is a concatenation of all the feature maps from the preceding layers, , i.e.,


where, denotes the concatenation of the of the feature-maps produced in layers . To allow down-sampling of the feature maps, the DenseNet architecture has been divided into multiple densely connected blocks known as dense blocks. We represent the set of these dense blocks as where the range of is from to the total number of dense blocks

. For the task of image classification, DenseNet is trained using categorical cross-entropy loss function. In the proposed method, DenseNet has been used in the

Pyramid Structure for iris and pupil segmentation.

Upsampling using Pyramid Structure: Deep learning architectures (arsalan2017deep; arsalan2018irisdensenet; lakra2018segdensenet; liu_MFCN; FCN), directly upsample the intermediate outputs to the size of the final predicted mask resulting in a coarse mask. Upsampling one resolution up, fusing with the previous intermediate output, and continuing the upsampling process in this manner preserves the finest details. For instance, if the feature map of size is directly upsampled to , then the local structure is not fully preserved. However, if the feature map is first upsampled to followed by an upsampling to the size, , then the maximum local structure is preserved. We refer to this kind of upsampling procedure as upsampling in a pyramidic manner. Fig. 5 presents the difference in the feature maps fused to create the final output.

As shown in Fig. 4 we first reduce the number of channels of each dense block, to two and consider the segmentation as a two-class semantic segmentation problem, viz. iris class and background class. This creates the first deep pyramid structure. Each deep pyramid is represented as where denotes the hierarchy of the feature maps in the y-direction, and represents the upsampling level in the x-direction. Stacking multiple such deep pyramids creates a structural pyramid where each level is represented as Lr, where is equal to the number of deep pyramids.

Deep Pyramid: As represented in Equation 2, the output of each of the dense blocks is convolved with a kernel to reduce the number of channels to two. This convolution operation sets the beginning point of our upsampling path and creates the first deep pyramid, symbolized as where the range of i is from to the number of outputs in a hierarchical level, in this case, the maximum value of i is equal to the number of blocks present in the base architecture. Mathematically, represents the set of feature maps present in this level, , where nBlocks is equivalent to the number of dense blocks in the base architecture.


The next deep pyramid, whose set of feature maps is represented as utilizes . It is mathematically defined as:


where the symbol denotes a set of fusion operations to combine the feature maps. After deconvolution, the upsampled features maps are concatenated with the features maps of the previous hierarchical level. After this, a convolution filter is applied to this two-channel output. This convolution operation is done for two reasons. Firstly, it reduces the aliasing effect that may have occurred due to upsampling of lower resolution feature maps. Secondly, it helps in removing the noise present in the higher resolution feature maps. Due to the concatenation operation, the number of channels in the fused output increases from two to four. To reduce the number of channels back to two, we apply convolution on the fused output. We continue fusing the outputs of each of the deep pyramid until the hierarchy level becomes the same as the number of blocks. Mathematically, every deep pyramid can be defined as:


where denotes the hierarchy of the feature maps in the y-direction, and represents the upsampling level in the horizontal upsampling path. Due to the fusion of feature maps, the number of hierarchy levels keeps decreasing as we move forward in the horizontal upsampling path.

Figure 6: Illustrating the process of calculating the loss that is back-propagated through the segmentation network.

As shown in Fig. 4 it can be observed that each deep pyramid contains varied information. The feature map set of the highest hierarchical level has the maximum resolution. It contains maximum noise along with very fine details of the iris. The last hierarchical level feature map set of least resolution contains minimum noise and preserves the maximum global iris and pupil structure. Hence, when these feature maps are fused to create the next deep pyramid, the maximum amount of noise is removed while keeping the fine and global iris and pupil structure intact. Further, the total computation cost while adding the feature maps is minimal. Consequently, accurate masks can be obtained without introducing too many overhead parameters to the base network.

Structural Pyramid: Fusing all the deep pyramids in the proposed manner creates a structural pyramid. Each level of structural pyramid contains feature maps of the same resolution and is represented as Lr, where r is equivalent to the number of deep pyramids. It can be visually seen from Fig 4 that each set of feature maps in a particular structural pyramid level presents different information towards the final prediction. For instance, in level (represented in Magenta), some feature maps preserve the edge information, whereas others preserve the global structure of the iris and pupil, resulting in an accurate mask.

Iris and Pupil Mask Prediction: Once we have only one set of feature maps in the deep pyramid, it is flattened and softmax is applied to obtain per-pixel classification, i.e.


where, represents the number of classes, viz. two in our case and

symbolizes the softmax parameters and the probability map,

is achieved. To get the eye mask, each pixel is allocated the channel with the highest probability. Fig. 6 illustrating the process of predicting the mask. Finally, a binary morphological post-processing is performed where the mask is first dilated, followed by erosion operation. Finally, the eroded output is multiplied with the original image to generate a region of interest, i.e. eye region only.

3.2 Cataract Classification using Multitask Learning

Figure 7: Multitask Classification Network for Cataract Classification.

For the given problem of cataract detection, a segmented eye can be healthy or unhealthy and if unhealthy, it can be a cataract or any other disorder. The cataract affected eye may further be categorized into pre-cataract surgery or post-cataract surgery. In this research, we present this problem as a multitask learning problem with the following two tasks:

  • Task 1 (T1): the first task is to classify the input image into one of the classes: {healthy, unhealthy}

  • Task 2 (T2): second task is to classify the input image into three classes, i.e. {pre-cataract, post-cataract, others222The ’others’ class consists of samples which are neither affected by cataract nor by surgery.}.

Multitask learning can be accomplished in various ways, such as joint learning of multiple related tasks liu2019joint and learning auxiliary tasks to support main task (liebel2018auxiliary)

. Due to a limited number of images in the dataset, it is not easy to train any deep neural network from scratch. Therefore, we have used the transfer learning approach resulting in a reduction of training (computation) time as well as help in achieving better performance on smaller datasets. We have used the pre-trained ResNet50

(he2016deep) as the base model and trained it for learning feature representations for cataract detection.

Fig. 7 illustrating the block diagram of the proposed multitask network. To train this network, joint optimization of the losses pertaining to these two tasks are performed. The final loss function is computed as the weighted sum of two classification losses. We have used binary cross-entropy (BCE) loss and categorical cross-entropy (CCE) loss for Task 1 and Task 2, respectively. The two individual losses and the final loss are defined as follows:


where, i is the class index and p is the class probability.

3.3 Implementation Details

This section provides the implementation details of the proposed approach.

Segmentation Network: The proposed segmentation architecture, utilizes the DenseNet model with convolution layers and is trained from scratch using the CASIAv4-distance dataset 333, UBIRISv2 (ubiris) and IIITD Cataract Surgery dataset (nigamphacoemulsification)

. The model is trained for 60 epochs using adaptive moment estimation

(kingma2014adam), Adam optimiser with initial learning rate of .

During training, contrast normalization and flip operations are used to augment the dataset size by times. For contrast normalization, different contrast factors have been used. Size of the input images for all the datasets in the NIR spectrum is . The ROI is extracted using SegDenseNet (lakra2018segdensenet). After extraction of ROI the size of the image reduces to which is then fed into the proposed .

Classification Network:

For training the classification network, we have used IIITD Cataract Surgery, IIITD alcohol and (the proposed) pupil dilation datasets. The cataract samples (pre-cataract and post-cataract surgery) are considered as unhealthy for Task 1 and then two separate classes in Task 2. The other two datasets are used as the healthy class (more details about the dataset are in the next section). For Task 1 and Task 2, we have used sigmoid and softmax activation functions, respectively. For feature extraction, transfer learning concept is utilized where pre-trained (on ImageNet dataset) ResNet50 is used as the base model and fine-tuning is performed on the train sets of the above mentioned datasets. As shown in Fig.

4, a global average pooling (GAP) layer and two fully connected (FC) layers are added on the pre-trained ResNet50 model. These two fully connected layers are added for two classification tasks, Task 1 and Task 2. The best results are obtained with a model trained on 100 epochs with a learning rate of 0.00001, , Adam as an optimizer, and a batch size of 4 on NVIDIA V100 32GB GPU. To achieve better generalization, we have also performed data augmentation with contrast normalization by various factors and flip operations, which increased the dataset size by five times.

4 Datasets

The proposed deep learning based segmentation and classification method is evaluated on three datasets, viz., IIITD Cataract Surgery (nigamphacoemulsification), IIITD Alcohol (arora2012iris), and on the proposed Pupil Dilation dataset. These datasets are chosen since they comprise various covariates of eye image, making them suitable choices for evaluating the efficiency of the proposed models.

Pupil Dilation dataset: The proposed dataset contains images showcasing variations due to Pupil Dilation. Tropicacyl Plus, a prescription drug used to treat paralysis of the ciliary muscle and dilate pupils before and after ophthalmic surgery, is used to create the dataset. The dataset consists of images acquired from human subjects before and after the medicine is administered by the ophthalmologist. The pupil dilation dataset contains 528 images, pre-eyedrop treatment and post-eyedrop treatment images of subjects. Fig. 8 shows sample images acquired pre and post eyedrop treatment. Table 1 presents various characteristics of the images. To the best of our knowledge, this is the first dataset of it’s kind and it will be releasing it to the research community. For experiments, 50 samples are used for testing while the remaining form the training set. After augmentation, the number of training samples is 1815.

IIITD Cataract Surgery Dataset contains 880 samples from 132 individuals, 440 each representing cataract and post cataract surgery samples (represented as pre and post cataract surgery). 100 samples from both the classes are kept in test set and the remaining comprise the train set. After augmentation, the number of training samples is 4080.

IIITD Alcohol dataset arora2012iris studied the effect of alcohol on pupil dilation/constricts. The pupil dilates/constricts due to intake of alcohol which results in affecting the iris recognition performance. Also, it is clearly shown in Fig. 8 (row c and d) how the alcohol can affect the size of the pupil which in turn can affect the iris recognition. More details about this dataset can be found in arora2012iris. This dataset contains 440 images pertaining to 110 subjects. Of these, 50 randomly selected samples are used for testing, while the remaining comprise the training set. After applying the augmentation, the number of training samples is 1170.

Data Preparation: For learning the segmentation model, we have pre-trained the model on CASIAv4-distance and UBIRISv2 (ubiris)

datasets and then IIITD Alcohol and IIITD Cataract Surgery datasets are used for fine-tuning. For learning the cataract classification model, the ImageNet pre-trained base model is used and then IIITD Alcohol, IIITD Cataract Surgery, and the proposed Pupil Dilation datasets are used. Further, data augmentation is applied so as to minimize the data imbalance problem. IIITD Alcohol and Pupil Dilation datasets belong to one class, and the IIITD Cataract Surgery belongs to the other class, thus making the overall data balanced. Ground truth segmentation masks for iris and pupil have been manually annotated using Adobe Photoshop. We will release the proposed database, annotations, and train-test partition details via

Figure 8: Sample images: (a) and (b) are pre and post cataract surgery; (c) and (d) are pre and post alcohol; (e) and (f) are pre and post pupil dilation from the Pupil Dilation dataset.
Characteristics Pupil Dilation
Sensor Vista Sensor
Environment Indoor
Sessions Two
No. of individuals 88
No. of images 276 (pre) and 276 (post)
Resolution 640x480
Excessive dilation due to the administered eyedrops.
Table 1: Characteristics of the proposed Pupil Dilation dataset.

5 Experimental Results

The performance of the proposed MCTD approach is presented in two parts, (i) segmentation and (ii) classification. The effectiveness of the algorithm is compared by varying the base model and comparing the results with existing algorithms. We have also performed an ablation study to demonstrate the effectiveness of various components of the algorithm.

5.1 Segmentation Performance

The performance of the segmentation algorithm is measured using the average classification error rate proposed in the NICE-I competition (NICE).


where, , , , and denote the ground truth mask, the predicted mask, total number of test samples, height, and width of the mask, respectively. The logical exclusive-OR operator calculates the correspondent disagreeing pixels’ proportion between the ground truth and the predicted segmentation mask. We compare results with a non-deep learning method (ICCV_2015) and two deep learning methods: IrisParseNet (wang2020towards), and SegDenseNet (lakra2018segdensenet).

Fig. 9 shows the sample results on the IIITD Cataract Surgery dataset where the masks are overlaid on the iris and pupil region. These examples show that the proposed algorithm is able to detect the fine boundaries of iris and pupil region. Table 2 presents segmentation errors obtained from the proposed algorithm and the existing algorithms. The percentage error has reduced by %, %, and % (from the next best performing model on these datasets) on IIITD Cataract Surgery, IIITD Alcohol, and Pupil Dilation datasets, respectively compared to existing techniques. It shows that our method yields state-of-the-art accuracies on all these datasets. Further, Fig. 10 compares the performance across the methods based on the classification accuracy. It can be observed that the proposed method is able to classify each iris pixel more accurately compared to existing deep learning methods for iris segmentation.

IrisParseNet, 2020
9.87 3.06 8.16
Zhao and Kumar, 2015
6.28 8.51 7.67
SegDenseNet, 2018
0.98 1.42 3.46
Proposed Method
0.77 1.25 2.36
Table 2: Comparisons of the proposed and existing iris segmentation techniques using average segmentation error (%). For fair comparison no post-processing is done for wang2020towards.
Figure 9: Illustrating the segmentation output by FCN-8s (first row) and the proposed (second row) algorithms on the IIITD Cataract Surgery dataset. The masks are overlaid on the images to visually demonstrate segmentation with respect to the iris and pupil boundaries. The results demonstrate that the proposed algorithm yields finer boundaries compared to FCN-8s approach.

Fig. 11 shows sample masks generated by the proposed method and comparison with existing algorithms on the three datasets. As can be visually observed, the proposed method can predict very accurate masks, implying that the proposed method preserves both global and fine structures of the iris and pupil. The first row shows how the model can segment the iris region even when it is severely occluded by reflection. Further, all the masks predicted by the proposed method have fine-details, such as removing areas secluded by fine eyelashes. Also, unlike the SegDenseNet (lakra2018segdensenet), the proposed method can predict the mask for sample images of IIITD Cataract dataset, which contains bubbles. It is our assertion that the proposed method can overcome this because upsampling in pyramid fashion preserves both the local and global structure. Further, as shown in Fig. 11, when the contrast difference between the iris and the sclera region is extremely low, the proposed algorithm is still be able to detect the boundaries. It can be directly observed that both SegDenseNet (lakra2018segdensenet) and (ICCV_2015) fail to segment the boundaries correctly. However, the proposed method, can handle these cases with great precision because it restores the information in a pyramid manner. The fine edge information and global structure present in the structural pyramid when fused can accurately predict the boundaries even when the contrast difference is extremely low.

Figure 10: Showcasing the classification accuracy of existing and proposed segmentation methods on the datasets used in the paper.
Figure 11: Showcasing the results of iris segmentation on multiple datasets. (a) The input image; masks obtained by (b) ICCV_2015 method, (c) SegDenseNet (lakra2018segdensenet) (the next best performing deep learning approach), (d) proposed PyramidNet, and (e) ground truth.

We also compare the performance of the proposed algorithm with the FCN architecture FCN. In this approach, a deconvolution operation has been used to upscale the image and combine with the previous layer output feature maps. However, in the proposed architecture, the second and third blocks of the DenseNet (as shown in figure 4) are utilized in two ways. This results in multiple feature maps of the same resolution. Combining these incorporates both the coarse and fine structures of iris and pupil in the segmentation process. The difference between the proposed and FCN-8s outputs has been shown in Figure 9. The results for FCN-8s are computed using our implementation of FCN-8s. On the cataract dataset, FCN yields % segmentation error and achieves % segmentation error. This comparison shows that for iris and pupil boundary segmentation, it is imperative to combine feature maps at each upsampling level.

To further show the efficacy of each component of architecture, ablation study is performed. In the proposed method the number of structural levels is equivalent to the number of dense blocks present in the base architecture. To understand the effect of each structural pyramid level on the final output, we have computed segmentation error. In our ablative study, the least segmentation error is achieved when all the dense blocks are used for building the structural pyramid. This is so because for iris and pupil segmentation both global and fine structures must be preserved. It is our assertion that maximum fine structure is preserved by the output of the first level of the structural pyramid and the maximum global information is stored in the last i.e. fifth level of the structural pyramid. We have observed that there is a small decrease in segmentation error if the feature maps obtained at structural pyramid level are directly upsampled to the size of the output image compared to upsampling of feature maps of level. However, on upsampling the feature map obtained at level, there is a significant decrease in the segmentation error because the maximum amount of local information/the finest details of the iris and pupil are preserved in it.

Precision Recall
Baseline T1 97.34 0.96 0.97 0.97
SegDenseNet T2 92.34 0.92 0.92 0.92
PyramidNet T1 100 1.0 1.0 1.0
T2 95.67 0.96 0.96 0.96
PyramidNet + T1 100 1.0 1.0 1.0
Post-Processing T2 96.67 0.97 0.97 0.97
Table 3: Summarizes the performance of the proposed approach of multitask eye image classification by changing the segmentation algorithm. We used SegDenseNet (lakra2018segdensenet) as a baseline and replaced it with our proposed PyramidNet, which outperformed the baseline results. It is also evident from the table that the post-processing step on the segmentation masks of PyramidNet improves both the tasks’ overall performance.
size (MB)
No. of
parameters (M)
Test time
119.0 31.28 0.15
57.30 8.00 0.024
Proposed 11.9 0.92 0.017
Table 4: Characteristics of the models proposed for iris segmentation. Cost for wang2020towards has been directly taken from the paper.

The proposed method has significantly lower number of parameters compared to IrisParseNet (wang2020towards) and SegDenseNet (lakra2018segdensenet)444Parameters are calculated using our own implementation of (liu_MFCN) methods. As shown in Table 4, the number of parameters has reduced by 30 times and the size of the model has reduced by 10 times. Further, the testing time of the proposed method is the least. The proposed model yields state-of-the-art results on three datasets and is the optimal model both in terms of computation cost and memory consumption. To be uniform, all the algorithms are implemented and run on the same machine, keeping all the configurations same.

5.2 Cataract Classification

The cataract classification performance is reported in terms of the classification accuracy, precision, recall, and F1 score. The output of segmentation algorithm, i.e. segmented iris and pupil region, is used as input to the classification algorithm. For comparison, we have used SegDenseNet (lakra2018segdensenet) approach (2nd best segmentation approach - from Table 2). Further, in order to showcase the effect of binary morphological operations (post-processing) after the proposed PyramidNet, we have shown the results with and without post processing. Table 3

summarizes the results of the proposed multitask classification algorithm with three segmentation approaches. It can be clearly observed that PyramidNet yields improved performance compared to the baseline results of SegDenseNet. PyramidNet differentiates between the healthy and unhealthy classes with 100% accuracy. Further, PyramidNet with post-processing does not deteriorate the performance in Task T1 but improves the classification performance in Task T2. For differentiating between the diseased classes, i.e., task T2, PyramidNet with post-processing yields an error of only 3.3%. Analyzing the precision and recall, we have observed that both precision and recall of

healthy class is 1. This result is due to the fact that there is no overlap between the samples of healthy and unhealthy classes. It is further supported by Table 5

(confusion matrix) and very few pre-cataract and post-cataract samples are misclassified into each other. Interestingly, among the remaining two classes, the precision of

post-cataract class is lower than the others class, while the recall of the post-cataract class is higher than the others class. After post-processing, the overall performance and precision of post-cataract performance improves, however, the recall reduces marginally by 0.03.

Fig.12 shows the tSNE plots of the healthy and unhealthy classes (Task 1), the first one is for the image space and second one is for the feature space. It is observed that the affected class (pre and post cataract) is well distinguishable from the healthy class. Fig. 13 shows the sample results of the proposed method. In the experiments, for Task 2, we have observed that some of the pre-cataract and post-cataract samples are misclassified with each other (as shown in Table 5).

Task 1
Healthy Unhealthy
Healthy 1.0 0.0
Unhealthy 0.0 1.0
Table 5: Illustrates the confusion matrix for two tasks
Task 2
0.96 0.4 0.0
0.6 0.94 0.0
Others 0.0 0.0 1.0
VGG16 IV3 DN121 RN50
Table 6: Shows the F1 scores obtained by varying the pre-trained models on the two tasks T1 and T2.
Figure 12: Illustrating the tSNE plot for Task 1: left plot shows the samples in original image spacea and the right plot shows the samples in feature space.
Figure 13: Shows some correctly classified and misclassified samples from the dataset (best viewed in color).

We next analyze the effect of base model, learning rate, and number of epochs:

Effect of changing the base model: For cataract classification, the performance of different deep learning models, viz. InceptionV3 (IV3) (szegedy2016rethinking), VGG16 (simonyan2014very), ResNet50 (RN50), and DenseNet121 (DN121) are compared. As reported in Table 6, ResNet50 outperforms all other architectures for both the tasks and is an effective choice as a base model.

Changing the learning rate: In this experiment, the learning rate is varied from 0.01 to 0.000001. We observe that the learning rate of 0.00001 outperforms the others yielding test accuracy for Task 1 and accuracy for Task 2.

Changing the number of epochs: We have also evaluated the performance by varying the number of epochs, and reported the results. It is shown that 100 epochs with learning rate = 0.00001 yields the best results for this classification problem. If we increase the number of epochs by 20, the results remain the same, beyond which the model starts overfitting.

6 Conclusion

Cataract is a primary cause of visual impairment worldwide and cataract surgery is the most common elective surgical intervention. Typically, the prognosis, regular monitoring, and the decision of whether a patient should be taken up for surgery mostly depends on the discretion of the ophthalmologist. In resource constrained settings with limited experts, it is very important to have a clinical decision-support technique to improve sensitivity and specificity of cataract detection and monitoring. This paper presents a deep learning pipeline for cataract detection. To the best of our knowledge, this is the first work which proposes to use near infrared eye images, popularly used in iris biometrics, for cataract detection. A deep learning-based architecture, , is proposed for segmenting iris and pupil boundaries where the model fuses the coarse and fine information extracted from convolution blocks at different levels in a pyramid-like fashion. The segmented iris and pupil regions are then used for cataract classification via a multi-task network. Experiments performed on the cataract dataset shows that (i) effective cataract detection is possible in NIR domain, (ii) the proposed segmentation algorithm is effective in detecting iris and pupil boundaries even with challenging scenarios, and (iii) the overall cataract detection performance encourages such an approach to be used in automated decision support system. It is our assertion that the findings of this research and the availability of our datasets, will spur further research in this domain.