Deep convolutional neural networks (CNNs) are considered the state-of-the-art for image segmentation problemsLeCun et al. (1998); Guo et al. (2018); LeCun et al. (2015). One limitation of deep CNNs is that they requires large volumes of training data in order to generalize well Chen and Lin (2014)
. Unfortunately, acquisition of manual segmentation labels is a time-consuming process that requires domain expertise to produce an acceptable ground truth. This is most notable in the domain of medical images, for which accurate delineation of structures require many years of experience on the part of the annotator. In contrast, spatial localization of the same structure of interest is often relatively simple, exhibits lower variance, and takes significantly less time to perform.
In this work, we explore the potential of CNNs to learn transferable features from easily-labeled data to tasks that would normally require larger quantities of expertly-labeled data to achieve the same performance. To demonstrate our approach, we consider the task of segmenting the optic disc from retinal fundus photographs. The optic disc is readily discernible from fundus photographs and can have differing coloration and morphology as a consequence of both normal variation and pathology such as glaucoma Joshi et al. (2011). Segmentation of the optic disk allows for such characteristics to be readily quantified from conventional fundus images.
2 Materials and Methods
Our method consists of a two phase training scheme of the widely used U-net Ronneberger et al. (2015) architecture, as shown in Figure 1. In the first phase of training, the encoding arm of the U-net is trained independently as an optic disc localization network. After convergence, the network’s weights are frozen and the decoding arm is trained as an optic disc segmentation network, re-using the features learned during localization training. We posit that the semantic information learned during training of the encoding arm are transferable to the segmentation task, such as the shape, texture and boundary characteristics of the disc.
2.1 Data and preprocessing
Training, validation, and test data sets were created from a database of more than 10,000 de-identified retinal images obtained using a commercially available camera (RetCam; Natus Medical Incorporated) as part of the [details of institution redacted].
We used 9047 images with optic disc center location annotations for the localization task, and a further 92 images labeled with optic disc binary masks for the segmentation task. Images were pre-processed with grayscale conversion, normalization, contrast limited adaptive histogram equalization and gamma correction Brown et al. (2018).
2.2 Optic disc localization
We formulate the localization task as a regression problem using the encoding arm of a U-net in Figure 1. Pre-processed fundus photographs are provided as input to the network, which undergoes a series of convolutional and pooling operations to produce a volume of image features. These features are passed to a fully connected network without activation (linear) to produce a tuple of () coordinates. We use convolutions, max-pooling, batch-normalization, and ReLU activations throughout the network. We also utilize dropout to mitigate overfitting. This network is trained on pre-processed images and their annotated centroid coordinates, minimizing a mean squared error (MSE) cost function using the RMSprop optimizer (Tieleman and Hinton, 2012). We evaluate the performance of the localizer network by calculating the Euclidean distance between the ground truth and predicted centroids.
2.3 Optic disc segmentation
Following training as a localizer, the fully connected layers are removed and the remaining network weights are frozen. The decoding arm is then added to the network, and residual connections added to produce a conventional U-net architecture. Image features learned by the localizer (encoding) network are concatenated with those features learned by the segmentation (decoding) network. The final two layers of the network are a 1x1 convolution followed by a sigmoid activation. The network is trained on the pre-processed images and binary masks representing the optic disc, minimizing a negative log soft Dice loss function using the Adam optimizerKingma and Ba (2014)
. We evaluate the performance of the segmentation network by measuring the Dice overlap between the ground truth and predicted optic disc masks at a probability threshold of 0.5.
At the first stage of the training (optic disc localization) the MSE values were 132.16 and 206.15 for the validation and test sets, respectively. For optic disc segmentation, the results were evaluated using five-fold cross-validation, and the mean Dice coefficient was 0.88 with 0.1 standard deviation.
3.1 Comparison with the conventional U-net
We evaluated the performance of our proposed method with the conventional U-net approach. We trained the full network with the identical layers, hyper-parameters and training data, but without the pre-trained encoder weights (random initialization). The average Dice coefficient was 0.84, with 0.2 standard deviation. The results showed that by learning the features in localization task first, we improved upon the performance of optic disc segmentation.
3.2 Performance with fewer training samples
To see whether the features learned by the localization network can alleviate the demand for manual segmentations, we trained the two models described above (with and without the pre-trained encoder) again, but with different numbers of training samples. Figure 2 shows the Dice coefficient of the models with increasing numbers of training samples. As expected, the Dice coefficient increased in both approaches when using a larger number of images. However, our proposed method affords greater robustness and the differences are most significant when the number of samples is very low. Figure 1(b) shows the results were significantly different at all the sample sizes using a paired -test. The results are consistent with our hypothesis that with the pre-trained encoder, we need fewer segmentation labels to achieve comparable performance.
-test under the null hypothesis that the two models have the same dice coefficients.
In this paper, we introduced a training scheme for optic disc segmentation based on pre-training the encoding arm of a U-net architecture. The approach reduces the demand for expertly-labeled data to achieve good segmentation performance on held out test sets. Though only tested on limited dataset with a relatively simple problem, we propose that this concept is extensible to other imaging modalities and segmentation tasks. We intend to assess the scalability of such an approach on more complex tasks such as brain tumor segmentation, and compare our approach with other pre-training schemes.
- LeCun et al.  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Guo et al.  Yanming Guo, Yu Liu, Theodoros Georgiou, and Michael S Lew. A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2):87–93, 2018.
- LeCun et al.  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- Chen and Lin  Xue-Wen Chen and Xiaotong Lin. Big data deep learning: challenges and perspectives. IEEE access, 2:514–525, 2014.
- Joshi et al.  Gopal Datt Joshi, Jayanthi Sivaswamy, and SR Krishnadas. Optic disk and cup segmentation from monocular color retinal images for glaucoma assessment. IEEE transactions on medical imaging, 30(6):1192–1205, 2011.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Brown et al.  James M Brown, J Peter Campbell, Andrew Beers, Ken Chang, Susan Ostmo, RV Paul Chan, Jennifer Dy, Deniz Erdogmus, Stratis Ioannidis, Jayashree Kalpathy-Cramer, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA ophthalmology, 2018.
Tieleman and Hinton 
T. Tieleman and G. Hinton.
Lecture 6.5—RmsProp: Divide the gradient by a running average of
its recent magnitude.
COURSERA: Neural Networks for Machine Learning, 2012.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.