In medical image analysis, transferring pre-trained encoders as initial models is an effective practice. While supervised representation learning is widely applied, it usually depends on a large amount of annotated data and the learnt features might be less efficient for new tasks differing from original training task . Thus, some researchers turn to study unsupervised representation learning [9, 17], and particularly unsupervised discriminative representation learning was proposed to measure similarity of different images [16, 18, 7]. However, these methods mainly learn the instance-wise discrimination based on global semantics, and cannot characterize the similarities or differences of local regions of an image. Hence, they are less efficient for many medical image analysis tasks, such as lesion detection, structure segmentation, identifying distinctions between different structures, in which local discriminative features are needed to be captured. In order to make unsupervised representation learning suitable for these tasks, we introduce local discrimination into unsupervised representation learning in this work.
It is known that medical images of humans contain similar anatomical structures, and thus pixels can be classifying into several clusters based on their context. Based on such observations, a local discriminative embedding space can be learnt, in which pixels with similar context will distribute closely and dissimilar pixels can be dispersed. In this work, a model containing two branches is constructed following a backbone network, in which an embedding branch is used to generate pixel-wise embedding features and a clustering branch is used to generate pseudo segmentations. This model will classify pixels into several clusters, where pixels belonging to the same cluster have similar embedding features and different clusters have dissimilar ones. In this way, local discriminative features can be learnt in an unsupervised way, which can be used for evaluating similarity of local image regions.
The proposed method is further applied to some typical medical image analysis tasks. The proposed method is firstly used to train feature extractors for fundus images and chest X-ray images. Then, the learnt features are applied to two different tasks. (1) The learnt features are utilized in 9 different downstream tasks via transfer learning, including segmentations of retinal vessel, optic disk (OD), recognition of haemorrhages, hard exudates, etc., to enhance the performances of these tasks. (2) Inspired by specialists’ ability of recognizing anatomical structures based on prior knowledge, we utilize the learnt features to cluster local regions of the same anatomical structure under the guidance of topological priors generated by simulation or from other structures with similar topology.
2 Related work
is an unsupervised representation learning framework providing an good initialization for downstream tasks and it can be considered as an extension of exemplar convolution neural network (CNN). The main conception of instance discrimination is to build an encoder to dispersedly embed training samples over a hypersphere space 
. Specifically speaking, a CNN is trained to project each image into a low-dimensional hypersphere space, in which the similarity between images can be evaluated by cosine similarity. In this embedding space, dissimilar images are forced to be separately distributed and similar images are forced to be closely distributed. Thus, the encoder can make instance-level discrimination. Wu et al.-th example can be expressed by inner product of the embedding vector and vectors stored in the memory bank. And the discrimination ability of encoder is obtained by learning to correctly classify image instance into the corresponding record in the memory bank. However, the vectors stored in the memory bank are usually outdated caused by discontinuous updating. To address it, Ye et al.  propose a framework with siamese network which introduces augmentation invariant into the embedding space to cluster similar images to realize real-time comparison.
The ingenious design enables instance discrimination effectively utilize unlabeled images to train a generalized feature extractor for downstream tasks and shrink the gap between unsupervised and supervised representation learning . However, summarizing a global feature to represent image instance miss local details, which are crucial for medical image tasks, and the high similarity of global semantics between images of same body part makes instance-wise discrimination less practical. Therefore, it is more convinced to focus on local discrimination for unsupervised representation learning of medical images. Meanwhile, medical images of the same body part can be divided into several clusters due to the similar anatomical structures of them, which inspires us to propose a framework to cluster similar pixels to learn local discrimination.
The illustration of our unsupervised framework is shown in Figure 1. This model has two main components. The first is learning a local discriminative representation, which aims to project pixels into a low-dimensional space, and pixels with similar context should be closely distributed and dissimilar pixels should be far away from each other on the embedding space. The learnt local discriminative representation can be taken as a good feature extractor for downstream tasks. The second is introducing prior knowledge of topological structure and relative location into local discrimination, where the prior knowledge will be fused into the model to make the distribution of pseudo segmentations closer to the distribution of priors. By combining priors of structures with local discrimination, regions of the expected anatomical structure can be clustered.
3.1 Local discrimination learning
As medical images of the same body region contain same anatomical structures, image pixels can be classified into several clusters, each of which corresponds to a specific kind of structure. Therefore, local discrimination learning is proposed to train representations to embed each pixel into a hypersphere space, in which pixels with similar context will be encoded closely. To achieve this, two branches, including an embedding branch to encode each pixel and a clustering branch to generate pseudo segmentations to cluster pixels, are built following a backbone network and trained in a mutually beneficial manner.
as the deep neural network, whereis the parameters of network. The unlabeled image examples are denoted as where . After feeding into the network, we can get embedding features and probability map , i.e., , where is the -dimensional encoded vector for position of image and is a vector representing the probability of classifying pixel into clusters. And is denoted as the probability of classifying pixel into the -th cluster. We force and by respectively setting and normalization in clustering branch and embedding branch.
3.1.2 Jointly train clustering branch and embedding branch:
After getting embedding features and pseudo segmentations, the center embedding feature of -th cluster can be formulated as followed:
normalization is also used to make in the hypersphere, thus, the similarity between and can be evaluated by cosine similarity as followed:
To make pixels of same cluster closely distributed and pixels of different clusters dispersedly distributed, there should be high similarity between and corresponding center embedding features , and low similarity between and
as well. Thus, the loss function can be formulated as followed:
3.1.3 More constraints:
3.2 Prior-guided anatomical structure clustering
Commonly, specialists can easily identify anatomical structures based on corresponding prior knowledge, including relative location, topological structure, and even based on knowledge of similar structures. Therefore, DNN’s ability of recognizing structures based on local discrimination and topological priors is studied in this part. Reference images, which are binary masks of similar structures, real data or simulation and show knowledge of location and topological structure, is introduced to the network to force the clustering branch to obtain corresponding structures as shown in Figure 3.
We denote the distribution of -th cluster as and the distribution of corresponding references as . The goal of optimization is to minimize Kullback-Leibler (KL) divergence between them, and it can be formulated as followed:
To minimize the KL divergence between and , adversarial learning  is utilized to encourage the produced pseudo segmentation to be similar as the reference mask. During training, a discriminator is set to discriminate pseudo segmentation and reference mask , while aims to cheat . The loss function for and adversarial loss for are defined as followed:
3.2.1 Reference masks:
(1) From similar structures: Similar structures share similar geometry and topology. Therefore, we can utilize segmentation annotations from similar structures to guide the segmentation of target, e.g., annotations of vessel in OCTA can be utilized for the clustering of retinal vessel in fundus images. (2) From real data: Corresponding annotations of target structure can be directly set as the prior knowledge. (3) Simulation: Based on the comprehension, experts can draw the pseudo masks to show the information of relative location, topology, etc. For example, based on retinal vessel mask, the approximate location of OD and fovea can be identified. Then, ellipses can be placed at these positions to represent OD and fovea based on their geometry priors.
4 Experiments and Discussion
The experiments can be divided into two parts to show the effectiveness of our proposed unsupervised local discrimination learning and prove the feasibility of combining local discrimination and topological priors to cluster target structures.
4.1 Network architectures and initialization
The backbone is a U-net consisted with a VGG-liked encoder and a decoder. The encoder is a tiny version of VGG-16 without fully connection layers (FCs), whose channel number is quarter of VGG-16. The decoder is composed with 4 convolution blocks, each of which is made up of two convolution layers. The final features of decoder will be concatenated with the features generated by the first convolution block of encoder for the further processing of the clustering branch and the embedding branch. Embedding branch is formed of 2 convolution layers with 32 channels and a normalization layer to project each pixel into a 32-D hypersphere. Clustering branch is consisted with 2 convolution layers with 8 channels followed by a normalization layer.
To minimize the KL divergence between the pseudo segmentation distribution and the references distribution, a discriminator is created. The discriminator is a simple classifier with 7 convolution layers and 2 FCs to make classification. The channel numbers of convolution layers are 16, 32, 32, 32, 32, 64, 64 and the first 5 layers are followed by a max-pooling layer to halve the image size. FCs’ channels are 32 and 1, and the final FC is followed by a Sigmoid layer.
Patch discrimination to initialize the network: It is hard to simultaneously train the clustering branch and the embedding branch from scratch. Thus, we firstly jointly pre-train the backbone and the embedding branch by patch discrimination, which is an improvement of instance discrimination . The main idea is that the embedding branch should project similar patches (patches under various augmentations) into close positions on the hypersphere. The embedding features will be firstly processed by an adaptive average pooling layer (APP) to generate spatial features, each of which represents feature of corresponding patches of image . We denote as the embedding vector for (-th patch of ), where by setting a normalization. is denoted as the embedding vector of corresponding augmentation patch . The probability of region being recognized as region can be defined as followed:
Assuming all patches being recognized as is independent, then the joint probability of being recognized as and not being recognized as is as followed:
The negative log likelihood and loss function are formulated as followed:
We also introduce mixup  to make the representations more robust. Based on mixup, virtual sample
is firstly generated by linear interpolation ofand , where . The embedded representation for patch is , and we expect it is similar to the mixup feature . The loss is defined as followed:
When pre-training this model, we set the training loss as . The output size of APP is set as to split each image into 16 patches. And each batch contains 16 groups of images and 8 corresponding mixup images, and each of group contains 2 augmentations of one image. The augmentation methods contain , , , ,
in pytorch. The optimizer is Adam with initial learning rate () of
, which will be half if the validation loss does not decrease over 3 epochs. The maximum training epoch is 20.
4.2 Experiments for learning local discrimination
4.2.1 Datasets and preprocessing:
Our method is evaluated in two medical scenes. Fundus images: The model will be firstly trained on diabetic retinopathy (DR) detection dataset of kaggle  ( for training, for validation). Then, the pre-trained encoder is transferred to 8 segmentation tasks: (1) Retinal vessel: DRIVE  (20 for training, 20 for testing), STARE  (10 for training, 10 for testing) and CHASEDB1  (20 training, 8 testing). (2) OD and cup: Drishti-GS  (50 for training, 50 for testing). ID (OD)  (54 for training, 27 for testing). (3) Lesions: Haemorrhages dataset (Hae) and hard exudates dataset (HE) from IDRID . Chest X-ray: The encoder is pre-trained on ChestX-ray8  ( for training and for validation) and transferred to lung segmentation  (69 for training, 69 for testing). All images of above datasets are resized to .
4.2.2 Implementation details:
(1) Local discriminative representation learning: The model is firstly initialized by pre-trained model of patch discrimination. Then the training loss is set as . Each batch has 6 groups of images, each of which contains 2 augmentations of one image, and 3 mixup images. The maximum training epoch is 80 and the optimizer is Adam with .
(2) Transferring: The encoder of downstream tasks is initialized by the learnt feature extractor of local discrimination. The decoder is composed with 5 convolution blocks, each of which contains 2 convolution layers and is followed by a up-pooling layer. The loss is set as . This model will be firstly trained in 100 epochs in frozen pattern with Adam with , and be trained in fine-tune pattern with in the following 100 epochs.
(3) Comparative methods: Random: The network is trained from scratch. Supervised: Supervised by the manual score of DR, the encoder will be firstly trained by making classification. Wu et al.  and Ye et al. : Instance discrimination methods proposed in  and . LD: The proposed local discrimination.
|Retinal vessel||Optic disc and cup||Lesions||X-ray|
|Wu et al.|
|Ye et al.||97.40|
The evaluation metric is mean Dice-Sørensen coefficient (DSC):, where is the binary results of predictions and is the ground truth. Quantitative evaluations for downstream tasks of fundus and lung segmentation is shown in Table 1, and we can have following observations:
1) The generalization ability of the trained local discriminative representation is demonstrated by the leading performance in the 6 fundus tasks and lung segmentation. Compared with models trained from scratch, models initialized by our pre-trained encoder can respectively gain improvements of 1.39%, 7.16%, 2.05%, 11.54%, 1.12%, 6.48%, 8.96%, 8.33% and 1.17% in DSC for all 9 tasks.
2) Compared with instance discrimination methods by Wu et al.  and Ye et al. , the proposed local discrimination is capable to learn finer features and is more suitable for unsupervised representation learning of medical images.
3) The proposed unsupervised method is free from labeled images and the learnt representation is more generalized, while supervised representation learning relies on expensive manual annotations and learns specialized representations. As shown in Table 1, our method shows better performance than supervised representation learning, whose target is to classification DR, and the only exception is on segmenting haemorrhages which is the key evidence for DR.
4.3 Experiments for clustering structures based on prior knowledge
4.3.1 Implementation details:
In this part, we respectively fuse reference images from real data, similar structures and simulations into local discrimination to investigate the ability of clustering anatomical structures. A dataset with 3110 high-quality fundus images from  and 1482 frontal X-rays from  are utilized as the training data. The reference images can be constructed in 3 ways: (1) From real references: ALL 40 retinal vessel masks of DRIVE are utilized as the references for clustering pixels of vessel. (2) From similar structures: Similar structures share similar priors, thus, 10 OCTA vessel masks are utilized as the references for retinal vessel of fundus. (3) Simulation: We directly draw 20 simulated lung masks to guide lung segmentation. Meanwhile, based on vessel masks of DRIVE, we place ellipses at approximate center location of OD and fovea to generate pseudo masks. Some reference masks are shown in Figure 4.
needs to jointly learn local discrimination and cheat , thus, it will be updated by minimizing the following loss:
The optimizer for is Adam with . The discriminator is optimized by minimizing and the optimizer is Adam with . It is worth noting that during the clustering training of OD and fovea, all masks of real vessel, fovea and OD are concatenated and fed into to provide enough information and is firstly pre-trained to cluster retinal vessel. The maximum training epoch is 80.
Visualization examples are shown in Figure 4. Quantitative evaluations are as followed: (1) Retinal vessel segmentation is evaluated in the test data of STARE. And the are respectively 66.25% and 57.35% for models based on real references and based on OCTA annotations. (2) The segmentation of OD is evaluated in the test data of Drishti-GS and gains of 83.60%. (3) The segmentation of fovea is evaluated in the test data of STARE. Because the region of fovea is fuzzy, we measure the mean distance between the real center of fovea and the predicted center. The mean distance is . (4) The segmentation of lung is evaluated in NLM  and the DSC is 81.20%.
Based on above results, we can have following observations:
1) In general, topological priors generated from simulation or similar structures in a different modality is effective to guide the clustering of target regions.
2) However, real masks contain more detailed information and are able to provide more precise guidance. For example, compared with vessel segmentations based on OCTA annotations, which missing the thin blood vessels due to the great thickness of OCTA mask, segmentations based on real masks can recognize thin vessels due to the details provided and the constraint of clustering pixels with similar context.
3) For anatomical structures with fuzzy intensity pattern, such as fovea, combining local similarity and structure priors is able to guide precise recognition.
In this paper, we propose an unsupervised framework to learn local discriminative representation for medical images. By transferring the learnt feature extractor, downstream tasks can be improved to decrease the demand for expensive annotations. Furthermore, similar structures can be clustered by fusing prior knowledge into the learning framework. The experimental results show that our methods have best performance on 7 out of 9 tasks in fundus and chest X-ray images, demonstrating the great generalization of the learnt representation. Meanwhile, the feasibility of clustering structures based on prior knowledge and unlabeled images is demonstrated by combining local discrimination and topological priors from real data, similar structures or even simulations to segment anatomical structures including retinal vessel, OD, fovea and lung.
-  Note: https://www.kaggle.com/c/diabetic-retinopathy-detection/data Cited by: §4.2.1.
-  (2019) Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15535–15545. Cited by: §2.
-  (2013) Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE transactions on medical imaging 33 (2), pp. 577–590. Cited by: §4.2.1, §4.3.2.
-  (2015) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38 (9), pp. 1734–1747. Cited by: §1, §2.
Deep sparse rectifier neural networks.
Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §3.1.3.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.2.
Momentum contrast for unsupervised visual representation learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2, §2.
-  (2000) Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical imaging 19 (3), pp. 203–210. Cited by: §4.2.1.
-  (2020) Whole milc: generalizing learned dynamics across tasks, datasets, and populations. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 407–417. Cited by: §1.
-  (2009) Measuring retinal vessel tortuosity in 10-year-old children: validation of the computer-assisted image analysis of the retina (caiar) program. Investigative ophthalmology & visual science 50 (5), pp. 2004–2010. Cited by: §4.2.1.
-  (2018) Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data 3 (3), pp. 25. Cited by: §4.2.1.
-  (2014) Drishti-gs: retinal image dataset for optic nerve head (onh) segmentation. In 2014 IEEE 11th international symposium on biomedical imaging (ISBI), pp. 53–56. Cited by: §4.2.1.
-  (2004) Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23 (4), pp. 501–509. Cited by: §4.2.1.
-  (2018) Weakly-supervised lesion detection from fundus images. IEEE transactions on medical imaging, pp. 1501–1512. Cited by: §4.3.1.
-  (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106. Cited by: §4.2.1, §4.3.1.
-  (2018) Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §2, §4.2.2, §4.2.3.
Instance-aware self-supervised learning for nuclei segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 341–350. Cited by: §1.
-  (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 6210–6219. Cited by: §1, §2, §4.1, §4.2.2, §4.2.3.
-  (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §4.1.