Deep feature transfer between localization and segmentation tasks

by   Szu-Yeu Hu, et al.

In this paper, we propose a new pre-training scheme for U-net based image segmentation. We first train the encoding arm as a localization network to predict the center of the target, before extending it into a U-net architecture for segmentation. We apply our proposed method to the problem of segmenting the optic disc from fundus photographs. Our work shows that the features learned by encoding arm can be transferred to the segmentation network to reduce the annotation burden. We propose that an approach could have broad utility for medical image segmentation, and alleviate the burden of delineating complex structures by pre-training on annotations that are much easier to acquire.



There are no comments yet.


page 1

page 2

page 3

page 4


The Domain Shift Problem of Medical Image Segmentation and Vendor-Adaptation by Unet-GAN

Convolutional neural network (CNN), in particular the Unet, is a powerfu...

L-SNet: from Region Localization to Scale Invariant Medical Image Segmentation

Coarse-to-fine models and cascade segmentation architectures are widely ...

Multi-level Context Gating of Embedded Collective Knowledge for Medical Image Segmentation

Medical image segmentation has been very challenging due to the large va...

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Computer vision tasks such as object detection and semantic/instance seg...

Bi-Directional ConvLSTM U-Net with Densley Connected Convolutions

In recent years, deep learning-based networks have achieved state-of-the...

Collaborative Boundary-aware Context Encoding Networks for Error Map Prediction

Medical image segmentation is usually regarded as one of the most import...

Stack-U-Net: Refinement Network for Image Segmentation on the Example of Optic Disc and Cup

In this work, we propose a special cascade network for image segmentatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) are considered the state-of-the-art for image segmentation problems

LeCun et al. (1998); Guo et al. (2018); LeCun et al. (2015). One limitation of deep CNNs is that they requires large volumes of training data in order to generalize well Chen and Lin (2014)

. Unfortunately, acquisition of manual segmentation labels is a time-consuming process that requires domain expertise to produce an acceptable ground truth. This is most notable in the domain of medical images, for which accurate delineation of structures require many years of experience on the part of the annotator. In contrast, spatial localization of the same structure of interest is often relatively simple, exhibits lower variance, and takes significantly less time to perform.

In this work, we explore the potential of CNNs to learn transferable features from easily-labeled data to tasks that would normally require larger quantities of expertly-labeled data to achieve the same performance. To demonstrate our approach, we consider the task of segmenting the optic disc from retinal fundus photographs. The optic disc is readily discernible from fundus photographs and can have differing coloration and morphology as a consequence of both normal variation and pathology such as glaucoma Joshi et al. (2011). Segmentation of the optic disk allows for such characteristics to be readily quantified from conventional fundus images.

2 Materials and Methods

Our method consists of a two phase training scheme of the widely used U-net Ronneberger et al. (2015) architecture, as shown in Figure 1. In the first phase of training, the encoding arm of the U-net is trained independently as an optic disc localization network. After convergence, the network’s weights are frozen and the decoding arm is trained as an optic disc segmentation network, re-using the features learned during localization training. We posit that the semantic information learned during training of the encoding arm are transferable to the segmentation task, such as the shape, texture and boundary characteristics of the disc.

Figure 1: Diagram of the U-net architecture and two phase training scheme for optic localization and segmentation.

2.1 Data and preprocessing

Training, validation, and test data sets were created from a database of more than 10,000 de-identified retinal images obtained using a commercially available camera (RetCam; Natus Medical Incorporated) as part of the [details of institution redacted].

We used 9047 images with optic disc center location annotations for the localization task, and a further 92 images labeled with optic disc binary masks for the segmentation task. Images were pre-processed with grayscale conversion, normalization, contrast limited adaptive histogram equalization and gamma correction Brown et al. (2018).

2.2 Optic disc localization

We formulate the localization task as a regression problem using the encoding arm of a U-net in Figure 1. Pre-processed fundus photographs are provided as input to the network, which undergoes a series of convolutional and pooling operations to produce a volume of image features. These features are passed to a fully connected network without activation (linear) to produce a tuple of () coordinates. We use convolutions, max-pooling, batch-normalization, and ReLU activations throughout the network. We also utilize dropout to mitigate overfitting. This network is trained on pre-processed images and their annotated centroid coordinates, minimizing a mean squared error (MSE) cost function using the RMSprop optimizer (Tieleman and Hinton, 2012). We evaluate the performance of the localizer network by calculating the Euclidean distance between the ground truth and predicted centroids.

2.3 Optic disc segmentation

Following training as a localizer, the fully connected layers are removed and the remaining network weights are frozen. The decoding arm is then added to the network, and residual connections added to produce a conventional U-net architecture. Image features learned by the localizer (encoding) network are concatenated with those features learned by the segmentation (decoding) network. The final two layers of the network are a 1x1 convolution followed by a sigmoid activation. The network is trained on the pre-processed images and binary masks representing the optic disc, minimizing a negative log soft Dice loss function using the Adam optimizer

Kingma and Ba (2014)

. We evaluate the performance of the segmentation network by measuring the Dice overlap between the ground truth and predicted optic disc masks at a probability threshold of 0.5.

3 Results

At the first stage of the training (optic disc localization) the MSE values were 132.16 and 206.15 for the validation and test sets, respectively. For optic disc segmentation, the results were evaluated using five-fold cross-validation, and the mean Dice coefficient was 0.88 with 0.1 standard deviation.

3.1 Comparison with the conventional U-net

We evaluated the performance of our proposed method with the conventional U-net approach. We trained the full network with the identical layers, hyper-parameters and training data, but without the pre-trained encoder weights (random initialization). The average Dice coefficient was 0.84, with 0.2 standard deviation. The results showed that by learning the features in localization task first, we improved upon the performance of optic disc segmentation.

3.2 Performance with fewer training samples

To see whether the features learned by the localization network can alleviate the demand for manual segmentations, we trained the two models described above (with and without the pre-trained encoder) again, but with different numbers of training samples. Figure 2 shows the Dice coefficient of the models with increasing numbers of training samples. As expected, the Dice coefficient increased in both approaches when using a larger number of images. However, our proposed method affords greater robustness and the differences are most significant when the number of samples is very low. Figure 1(b) shows the results were significantly different at all the sample sizes using a paired -test. The results are consistent with our hypothesis that with the pre-trained encoder, we need fewer segmentation labels to achieve comparable performance.

(a) Dice coefficients for both training schemes
examples P-value
(b) p-values, based on paired -test
Figure 2: Dice coefficients of the two training schemes with different proportions of ground truth segmentation data (over all validation splits). The blue line represents the model without the pre-trained encoder, and the orange line is with pre-training as a localizer. The P-values were derived from a paired

-test under the null hypothesis that the two models have the same dice coefficients.

4 Conclusion

In this paper, we introduced a training scheme for optic disc segmentation based on pre-training the encoding arm of a U-net architecture. The approach reduces the demand for expertly-labeled data to achieve good segmentation performance on held out test sets. Though only tested on limited dataset with a relatively simple problem, we propose that this concept is extensible to other imaging modalities and segmentation tasks. We intend to assess the scalability of such an approach on more complex tasks such as brain tumor segmentation, and compare our approach with other pre-training schemes.


  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Guo et al. [2018] Yanming Guo, Yu Liu, Theodoros Georgiou, and Michael S Lew. A review of semantic segmentation using deep neural networks. International Journal of Multimedia Information Retrieval, 7(2):87–93, 2018.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • Chen and Lin [2014] Xue-Wen Chen and Xiaotong Lin. Big data deep learning: challenges and perspectives. IEEE access, 2:514–525, 2014.
  • Joshi et al. [2011] Gopal Datt Joshi, Jayanthi Sivaswamy, and SR Krishnadas. Optic disk and cup segmentation from monocular color retinal images for glaucoma assessment. IEEE transactions on medical imaging, 30(6):1192–1205, 2011.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • Brown et al. [2018] James M Brown, J Peter Campbell, Andrew Beers, Ken Chang, Susan Ostmo, RV Paul Chan, Jennifer Dy, Deniz Erdogmus, Stratis Ioannidis, Jayashree Kalpathy-Cramer, et al. Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA ophthalmology, 2018.
  • Tieleman and Hinton [2012] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural Networks for Machine Learning, 2012.

  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.