Leveraging Undiagnosed Data for Glaucoma Classification with Teacher-Student Learning

by   Junde Wu, et al.

Recently, deep learning has been adopted to the glaucoma classification task with performance comparable to that of human experts. However, a well trained deep learning model demands a large quantity of properly labeled data, which is relatively expensive since the accurate labeling of glaucoma requires years of specialist training. In order to alleviate this problem, we propose a glaucoma classification framework which takes advantage of not only the properly labeled images, but also undiagnosed images without glaucoma labels. To be more specific, the proposed framework is adapted from the teacher-student-learning paradigm. The teacher model encodes the wrapped information of undiagnosed images to a latent feature space, meanwhile the student model learns from the teacher through knowledge transfer to improve the glaucoma classification. For the model training procedure, we propose a novel training strategy that simulates the real-world teaching practice named as 'Learning To Teach with Knowledge Transfer (L2T-KT)', and establish a 'Quiz Pool' as the teacher's optimization target. Experiments show that the proposed framework is able to utilize the undiagnosed data effectively to improve the glaucoma prediction performance.


page 1

page 2

page 3

page 4


Subclass Distillation

After a large "teacher" neural network has been trained on labeled data,...

Learning to Teach with Deep Interactions

Machine teaching uses a meta/teacher model to guide the training of a st...

Towards Generalizing Sensorimotor Control Across Weather Conditions

The ability of deep learning models to generalize well across different ...

Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression

Knowledge distillation (KD) has been actively studied for image classifi...

Two-stage Image Classification Supervised by a Single Teacher Single Student Model

The two-stage strategy has been widely used in image classification. How...

Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

We introduce DeepInversion, a new method for synthesizing images from th...

Wakeword Detection under Distribution Shifts

We propose a novel approach for semi-supervised learning (SSL) designed ...

1 Introduction

Glaucoma is the leading cause of irreversible vision loss that primarily damages the optic cup/disc and surrounding optic nerve[17]. Recently, deep learning methods have achieved rapid advancement and been widely adopted to the automatic glaucoma classification using fundus images [5, 11, 13, 12]. However, a substantially large amount of properly labeled data is generally required for training deep learning models, which might not be easily accessed as the accurate grading of glaucoma, especially at the early stage, requires years of expertise for glaucoma specialists.

Beyond the publicly available glaucoma classification datasets, on the other hand, there have been several high-quality publicly available datasets for cup/disc segmentation, but without image-level glaucoma labels[2, 14, 4]

. The Cup-to-Disc Ratio (CDR) parameter, which can be easily computed from the cup/disc masks, is one of the most important clinical parameters for the diagnosis of glaucoma. Generally, patients with a CDR value higher than 0.6 are considered as glaucoma suspects and a higher CDR value indicates a higher probability of having glaucoma

[9]. This inspires us to take advantage of the images with only cup/disc segmentation masks to improve the glaucoma classification performance.

In order to properly utilize the undiagnosed images, we propose to transfer the knowledge of pixel-wise cup/disc labels to the learner model via a teacher-student learning paradigm, which has been a popular and effective way to incorporate any prior information. However, most existing teacher-student learning methods learn a compact student from a stronger but more complex teacher, for the purpose of knowledge distillation[3, 18, 15]. Some other methods[7, 20] learn a teacher to improve the student’s training speed or performance, but they assume that the ground-truth labels of all the training samples are available, which is not true under our scenario. To the best of our knowledge, there is still a research gap of how to utilize the teacher model to learn from undiagnosed data to improve the student model’s performance on glaucoma classification.

In this paper, we aim to address this research gap by proposing a novel training strategy, named “Learning To Teach with Knowledge Transfer (L2T-KT)” with a reserved quiz pool, imitating the real-world teaching practice. In L2T-KT, the teacher learns to encode the undiagnosed fundus images and the corresponding cup/disc masks to a latent feature space with the ultimate goal of improving the student’s performance on the quiz pool. Meanwhile, the student is updated by the supervision of the teacher through knowledge transfer. Three major contributions are made with this paper. Firstly, we propose to adapt the teacher-student learning paradigm to the glaucoma screening task and verify the feasibility to utilize undiagnosed images to improve the glaucoma classification performance. Secondly, we propose a novel training strategy of L2T-KT and quiz pool to update the teacher model with undiagnosed images, which enables the teacher to extract potential important features from the undiagnosed images and further improve the performance of the student model via knowledge transfer. Finally, the proposed method can be easily extended to learn from totally unlabeled images or transductive learning to improve the model performance.

2 Methodology

Consider all the collected fundus images as dataset that can be divided into the primary training set with glaucoma classification labels, and auxiliary training set with cup/disc masks but without glaucoma labels. The primary training set is denoted as: , where denotes the fundus image, denotes the glaucoma label for the input image . Meanwhile, the auxiliary training data is denoted as , where is the fundus image, denotes the optic cup/disc mask. Furthermore, the primary training dataset is further divided into textbook pool (for training the student model) and quiz pool (for updating the teacher model). Provided with these datasets, the target of this research is to construct a framework that can learn a mapping function using both primary dataset and auxiliary dataset , and thus is expected to outperform the mapping function that learns only using the primary training dataset. Since the teacher-student learning paradigm has been widely used for knowledge distillation and proved effective for extracting latent information, we construct a deep learning model based on teacher-student learning for the glaucoma screening task.

As shown in Fig. 1, the overall framework contains two networks: the teacher model and the student model

. The teacher model is a convolutional neural network which encodes the fundus images together with the corresponding cup/disc masks into a latent feature space. And the student model shares the same feature extraction backbone as that of the teacher. In this paper, the state-of-the-art classification network EfficientNet (B4) is adopted as the feature extraction backbone

[16]. Different from the teacher model, the student model contains a fully connected layer (fc) to make predictions for glaucoma.

Figure 1: Framework and data flow of the proposed L2T-KT framework. Stage (1), train the student model with knowledge transfer loss; Stage (2), update the student model with textbook pool data using binary cross entropy loss; Stage (3), train the teacher model with quiz pool data using binary cross entropy loss.

The proposed framework is optimized via an iterative three-stage training strategy. In the first stage, the student model is supervised by the teacher model using data from the auxiliary dataset with knowledge transfer loss, as marked by the cyan color data flow in Fig. 1. In the second stage, the student model is further optimized by the ground-truth glaucoma labels from the textbook pool with binary cross entropy (BCE) loss, as marked by the green data flow. In the last stage, as marked by the red data flow, the teacher model is updated with the proposed ‘learning to teach’ strategy using the quiz pool . The detailed updating strategy of the framework is explained in Algorithm 1. Note that the input to the student model is only the fundus image, meanwhile input to the teacher model contains both image and its corresponding cup/disc mask from the auxiliary dataset .

Given networks: teacher model , student model ;
Datasets: primary training dataset , auxiliary training dataset ;
Initialize textbook pool and quiz pool by randomly split ;
Initialize randomly and initialize with pretrained baseline parameters;
while Training do
       Sample from ;
       Send to to get ;
       Send to to get ;
       Update with knowledge transfer loss by Eqn. 3;
       Sample from ;
       Update with BCE loss by Eqn. 4;
       Sample from ;
       Update through L2T-KT by Eqn. 5 and Eqn.6;
       Update by Eqn. 1, Eqn. 2;
end while
Algorithm 1 Overall learning process of the proposed model

2.1 Quiz Pool

In the real-world teaching practice, students have access to the textbook content, but have no access to the answers of the quiz problems, which are used by the teacher to evaluate the student’s performance and update the teaching strategy based on the evaluation scores. Inspired by this scenario, we propose to split the primary data set into two subsets, the textbook pool and the quiz pool . The student model learns with the ground-truth glaucoma labels from the textbook pool ; meanwhile, the student’s performance is evaluated on the quiz pool . The evaluation score is used to update the teacher model.

In this paper, and are split with two different approaches. The first method is the static quiz pool, where is randomly selected from the auxiliary set and kept the same during the training procedure. In this paper, 20% of samples are randomly selected from as the quiz pool .

In the second method, we propose to update the quiz pool dynamically during the training process, i.e., the dynamic quiz pool. Practically, teachers often reserve the important or difficult contents for the quiz problems. Similar to this idea, a dynamic quiz pool is established, which focuses on the difficult cases and positive cases, since missing the positive cases is at higher risk for glaucoma screening. The pool is dynamically updated depending on the samples’ difficulty reported by comparing the student’s predictions with the glaucoma labels. The difficulty of individual samples reported by the student can be obtained with:


where denotes the ground-truth glaucoma label and denotes the student’s prediction. Then, the probability of a sample from being selected into the dynamic quiz pool is calculated by:


where denotes the relative importance of negative samples compared to that of positive samples, and encourages the pool to focus on the difficult samples. In this paper, and

are empirically set to 0.7 and 2, respectively. The last term is a shifted sigmoid function that controls the average difficulty

of the quiz pool within a reasonable range. It encourages the quiz pool to retain the test content if it is challenging to the student, while changing the pool if it is too easy. In this work, and

are set to 16 and 0.5, respectively. In addition, when a new sample is added to the quiz pool, the easiest sample will be dropped, so as to keep a constant size of the quiz pool. In the implementation, the quiz pool is updated every epoch.

2.2 Student Update Through Knowledge Transfer

The student model is trained on both textbook pool of the primary dataset with glaucoma label and the auxiliary dataset without glaucoma label.

In the first stage, the student model is trained with the auxiliary set and supervised by the teacher model with knowledge transfer. More specifically, the color fundus images are first concatenated with the corresponding cup/disc masks and then fed to the teacher model, which will encode the input to the latent feature maps . Meanwhile, the same set of color fundus images (without masks) will be sent to the student model as well to get the latent feature maps . Then the knowledge transfer (KT) loss between and can be computed by learning the domain-invariant latent representations with Centered Kernel Alignment (CKA) [10], as below:


where denotes the Frobenius norm.

In the second stage, the student model is further trained with data from the textbook pool , which contains fundus images and the ground-truth glaucoma labels. At this stage, the student model is directly supervised by the ground-truth glaucoma labels with binary cross entropy (BCE) loss:


where is the ground-truth glaucoma label, and is the student’s prediction.

2.3 Teacher Update Trough L2T-KT

Following the real-world teaching practice where teachers often update their teaching strategies based on students’ feedback, we propose to update the teacher model parameters based on the student’s performance on the constructed quiz pool . Formally speaking, consider a teacher network with parameters as and a student network with parameters as . The response of to a concatenated fundus image and mask is . The response of to the raw fundus image is . The first step of L2T-KT training strategy is computing the update of the student parameters with the knowledge transfer loss between and , which can be expressed in the gradient descent format as:


where denotes the learning rate of the student model and is set as . After the knowledge transfer and student parameter update, we denote the refreshed student as .

The teacher’s goal is learning to teach the student to achieve better performance on the quiz. In other words, the optimization target of the teacher is the refreshed student to perform better than on the same . Let denotes the prediction of over a random sample and denotes the ground-truth label of , the teacher can be optimized by minimizing the BCE loss between and with gradient descent. It is theoretically feasible because as shown in Eqn. 5, teacher parameter is a variable of the updated student’s parameter . Therefore, the partial derivative of w.r.t can be computed, and can be updated via:


where is the learning rate of the teacher model and set as . We compute the partial derivative of w.r.t the teacher parameter , rather than its own parameter as commonly used. That is because we aim at making the teacher to learn how to teach a better student, but not making the student to learn by itself. Note that the refreshed student is only temporarily used in L2T-KT, which will not change the parameters of the original student.

3 Experiments

3.0.1 Datasets

The data utilized in this work mainly originates from two sources: the primary dataset with glaucoma labels from Beijing Tongren Hospital with approval obtained from the institutional review board, and the auxiliary dataset with cup/disc segmentation masks from publicly available dataset RIGA [2]. The primary dataset contains in total of 3,830 fundus images graded by certified glaucoma specialists, including 1,586 glaucoma and 2,244 non-glaucoma images. We randomly selected 60% images as the training set, 15% as the validation set and the rest 25% as the test set to evaluate the model performance. The RIGA dataset contains 650 fundus images with pixel-wise cup/disc masks labeled by experts, but image-level glaucoma labels are not provided [2].

3.1 Ablation Studies

Ablation studies have been conducted to evaluate the effectiveness of the proposed framework under different setups of the quiz pool, including the the static quiz pool and dynamic quiz pool. The comparison baseline method utilizes the same backbone as that of the teacher/student model, i.e., EfficientNet-B4, and trained with the glaucoma/non-glaucoma labels of the primary training set. Four metrics are adopted to evaluate the model performance, including accuracy (Acc), sensitivity (Sen), specificity (Spec) and area under the receiver operating characteristic curve (AUC).

Table 1 shows the quantitative comparisons of the baseline method and the proposed framework trained on undiagnosed auxiliary dataset under different settings of the quiz pool. Compared with the baseline method using purely labeled data, training using both labeled and undiagnosed data with the proposed L2T-KT framework with a static quiz pool increases the AUC with 2.39% and accuracy with 2.33%. In addition, by changing the static quiz pool to a dynamic quiz pool, the model performance is further improved, with an obvious improvement on the model sensitivity, since the dynamic quiz pool favors the positive cases. Clinically, for the glaucoma screening task, a higher sensitivity measure is much more important than specificity, so as not to miss the potential glaucoma patients.

Static Dynamic Acc Sen Spec AUC
Baseline 90.69 90.10 93.78 95.77
Proposed 93.02 90.70 94.53 98.16
Proposed 93.29 96.03 91.42 98.29
: Baseline method using purely labeled data;
: Proposed method: labeled data + undiagnosed data + static quiz pool;
: Proposed method: labeled data + undiagnosed data+dynamic quiz pool;
Table 1: Performance comparison (%) of leveraging undiagnosed data under different settings of the quiz pool.

3.2 Auxiliary Data Setting

We have also evaluated the model performance under different settings of the auxiliary dataset. The evaluation is conducted on both the private dataset and a publicly available glaucoma classification dataset LAG [12], which contains 1,711 glaucoma images and 3,143 non-glaucoma images. Apart from the undiagnosed auxiliary set with ground-truth cup/disc masks (RIGA), we have also alternatively trained on the totally unlabeled auxiliary set, by producing pseudo masks of RIGA images using a state-of-the-art cup/disc segmentation algorithm [19]. As Table 2 lists, by using an auxiliary set with pseudo cup/disc masks, the model performance degenerates slightly compared with the standard undiagonosed auxiliary set using ground-truth cup/disc masks, with the AUC value drop of 0.17 and 0.45 for LAG and private set, respectively. However, compared with the baseline model in Table 1, the proposed framework using pseudo labels still surpasses that of the baseline model with a remarkable margin, with an AUC improvement of 2.07 for the private set.

The performance of the proposed method can be further improved when it is authorized to get access to the raw images of the test dataset, i.e., in the transductive learning scenario. When taking the raw fundus images of the test set and their pseudo cup/disc masks as the auxiliary data (denoted as ‘Transductive’ in Table 2), the proposed method achieves the best performance, with an AUC score of 99.51 and 98.41 for the LAG and private set, respectively. This indicates the expandability and effectiveness of the proposed L2T-KT framework.

3.3 Comparing with State-of-the-Art

LAG Private
Acc Sen Spec AUC Acc Sen Spec AUC
Li et al. [12] (supervised) 95.3 95.4 95.2 97.5 - - - -
Fu et al. [8] (supervised) 93.88 96.79 92.29 98.27 91.94 91.38 92.30 96.70
Pinto et al. [6] (semi) 92.75 92.30 93.16 97.11 91.85 92.51 89.79 96.12
Pinto et al.[6] (trans) 93.77 93.26 94.68 97.95 92.03 92.75 92.60 97.45
Ghamdi et al.[1] (semi) 94.11 97.43 92.29 98.16 91.76 93.19 90.83 96.88
Ghamdi et al.[1] (trans) 95.01 97.75 93.52 98.73 92.57 94.10 91.57 97.39
Auxiliary-GT (proposed) 95.81 98.40 94.22 99.49 93.29 96.03 91.42 98.29
Auxiliary-Psd (proposed) 95.47 98.72 93.70 99.32 92.84 95.01 91.42 97.84
Transductive (proposed) 96.04 98.72 94.75 99.51 93.64 96.37 91.82 98.41
: RIGA dataset with ground-truth cup/disc mask is used as undiagnosed auxiliary set;
: RIGA dataset with pseudo cup/disc mask is used as totally unlabeled auxiliary set;
: Test set with pseudo cup/disc mask is used as totally unlabeled auxiliary set;
Table 2: Performance comparison (%) with other methods.

The proposed framework has been compared with other state-of-the-art methods on the glaucoma classification task, including two fully-supervised methods[8, 12] and two semi-supervised methods[1, 6]. The semi-supervised methods take RIGA or the test set as the unlabeled datasets, which are denoted as ‘semi’ and ‘trans’, respectively. As listed in Table 2, the proposed method achieves the best performance on both the private and LAG datasets, especially for the sensitivity metric, indicating the effectiveness of the proposed method in exploiting extra undiagnosed data. In addition, different from the existing methods of designing complex architectures or using model ensembling, the proposed method adopts a plug-and-play training strategy without any change on the backbone network. At the prediction stage, only the student network is used, ensuring fast computational speed during inference.

4 Conclusion

Many publicly available datasets contain images with cup/disc masks but without image-level glaucoma labels. In order to fully exploit those undiagnosed images for the glaucoma screening task, we proposed a novel training strategy Learning to Teach with Knowledge Transfer (L2T-KT), which enabled the model to learn from those undiagnosed images through teacher-student paradigm. Detailed experiments revealed that the proposed method could not only improve the glaucoma screening performance through learning from the data with cup/disc masks, but also could be easily extended to learn from the completely unlabeled data with pseudo labels and improved the test set performance via transductive learning. Future works will continue to explore the potential of the proposed method and optimize the time efficiency at the training stage.

4.0.1 Acknowledgment

This work was funded by the Key Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), National Key Research and Development Project (No. 2018YFC2000702) and Science and Technology Program of Shenzhen, China (No. ZDSYS201802021814180).


  • [1] M. Al Ghamdi, M. Li, M. Abdel-Mottaleb, and M. A. Shousha (2019)

    Semi-supervised transfer learning for convolutional neural networks for glaucoma detection

    In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3812–3816. Cited by: §3.3, Table 2.
  • [2] A. Almazroa, S. Alodhayb, E. Osman, E. Ramadan, M. Hummadi, M. Dlaim, M. Alkatee, K. Raahemifar, and V. Lakshminarayanan (2018) Retinal fundus images for glaucoma analysis: the RIGA dataset. In SPIE Conference on Medical Imaging, Cited by: §1, §3.0.1.
  • [3] J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in Neural Information Processing Systems, pp. 2654–2662. Cited by: §1.
  • [4] E. J. Carmona, M. Rincón, J. García-Feijoó, and J. M. Martínez-de-la-Casa (2008)

    Identification of the optic nerve head with genetic algorithms

    Artificial Intelligence in Medicine 43 (3), pp. 243–259. Cited by: §1.
  • [5] X. Chen, Y. Xu, D. W. K. Wong, T. Y. Wong, and J. Liu (2015) Glaucoma detection based on deep convolutional neural network. In 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 715–718. Cited by: §1.
  • [6] A. Diaz-Pinto, A. Colomer, V. Naranjo, S. Morales, Y. Xu, and A. F. Frangi (2019)

    Retinal image synthesis and semi-supervised learning for glaucoma assessment

    IEEE Transactions on Medical Imaging 38 (9), pp. 2211–2218. Cited by: §3.3, Table 2.
  • [7] Y. Fan, F. Tian, T. Qin, X. Li, and T. Liu (2018) Learning to teach. In 6th International Conference on Learning Representations, Cited by: §1.
  • [8] H. Fu, J. Cheng, Y. Xu, C. Zhang, D. W. K. Wong, J. Liu, and X. Cao (2018) Disc-aware ensemble network for glaucoma screening from fundus image. IEEE Transactions on Medical Imaging 37 (11), pp. 2493–2501. Cited by: §3.3, Table 2.
  • [9] D. F. Garway-Heath, S. T. Ruben, A. Viswanathan, and R. A. Hitchings (1998) Vertical cup/disc ratio in relation to optic disc size: its value in the assessment of the glaucoma suspect. British Journal of Ophthalmology 82 (10), pp. 1118–1124. Cited by: §1.
  • [10] S. Kornblith, M. Norouzi, H. Lee, and G. E. Hinton (2019) Similarity of neural network representations revisited. In

    36th International Conference on Machine Learning

    Cited by: §2.2.
  • [11] A. Li, J. Cheng, D. W. K. Wong, and J. Liu (2016)

    Integrating holistic and local deep features for glaucoma classification

    In 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 1328–1331. Cited by: §1.
  • [12] L. Li, M. Xu, X. Wang, L. Jiang, and H. Liu (2019) Attention based glaucoma detection: a large-scale database and CNN model. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 10571–10580. Cited by: §1, §3.2, §3.3, Table 2.
  • [13] Z. Li, Y. He, S. Keel, W. Meng, R. T. Chang, and M. He (2018) Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 125 (8), pp. 1199–1206. Cited by: §1.
  • [14] J. Lowell, A. Hunter, D. Steel, A. Basu, R. Ryder, E. Fletcher, and L. Kennedy (2004) Optic nerve head segmentation. IEEE Transactions on Medical Imaging 23 (2), pp. 256–264. Cited by: §1.
  • [15] A. A. Rusu, S. G. Colmenarejo, Ç. Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2016) Policy distillation. In 4th International Conference on Learning Representations, Cited by: §1.
  • [16] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In 36th International Conference on Machine Learning, Cited by: §2.
  • [17] Y. Tham, X. Li, T. Y. Wong, H. A. Quigley, T. Aung, and C. Cheng (2014) Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology 121 (11), pp. 2081–2090. Cited by: §1.
  • [18] G. Urban, K. J. Geras, S. E. Kahou, Ö. Aslan, S. Wang, A. Mohamed, M. Philipose, M. Richardson, and R. Caruana (2017) Do deep convolutional nets really need to be deep and convolutional?. In 5th International Conference on Learning Representations, Cited by: §1.
  • [19] S. Wang, L. Yu, X. Yang, C. Fu, and P. Heng (2019) Patch-based output space adversarial learning for joint optic disc and cup segmentation. IEEE Transactions on Medical Imaging 38 (11), pp. 2485–2495. Cited by: §3.2.
  • [20] L. Wu, F. Tian, Y. Xia, Y. Fan, T. Qin, J. Lai, and T. Liu (2018)

    Learning to teach with dynamic loss functions

    In Advances in Neural Information Processing Systems, pp. 6466–6477. Cited by: §1.