Deep Semi-supervised Metric Learning with Dual Alignment for Cervical Cancer Cell Detection

by   Zhizhong Chai, et al.

With availability of huge amounts of labeled data, deep learning has achieved unprecedented success in various object detection tasks. However, large-scale annotations for medical images are extremely challenging to be acquired due to the high demand of labour and expertise. To address this difficult issue, in this paper we propose a novel semi-supervised deep metric learning method to effectively leverage both labeled and unlabeled data with application to cervical cancer cell detection. Different from previous methods, our model learns an embedding metric space and conducts dual alignment of semantic features on both the proposal and prototype levels. First, on the proposal level, we generate pseudo labels for the unlabeled data to align the proposal features with learnable class proxies derived from the labeled data. Furthermore, we align the prototypes generated from each mini-batch of labeled and unlabeled data to alleviate the influence of possibly noisy pseudo labels. Moreover, we adopt a memory bank to store the labeled prototypes and hence significantly enrich the metric learning information from larger batches. To comprehensively validate the method, we construct a large-scale dataset for semi-supervised cervical cancer cell detection for the first time, consisting of 240,860 cervical cell images in total. Extensive experiments show our proposed method outperforms other state-of-the-art semi-supervised approaches consistently, demonstrating efficacy of deep semi-supervised metric learning with dual alignment on improving cervical cancer cell detection performance.



There are no comments yet.


page 8


Affinity guided Geometric Semi-Supervised Metric Learning

In this paper, we address the semi-supervised metric learning problem, w...

Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

Supervised learning based object detection frameworks demand plenty of l...

Semi-Supervised Metric Learning: A Deep Resurrection

Distance Metric Learning (DML) seeks to learn a discriminative embedding...

Renal Cell Carcinoma Detection and Subtyping with Minimal Point-Based Annotation in Whole-Slide Images

Obtaining a large amount of labeled data in medical imaging is laborious...

GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Semi-supervised learning is a challenging problem which aims to construc...

Proposal Learning for Semi-Supervised Object Detection

In this paper, we focus on semi-supervised object detection to boost acc...

Deep Q-Network-Driven Catheter Segmentation in 3D US by Hybrid Constrained Semi-Supervised Learning and Dual-UNet

Catheter segmentation in 3D ultrasound is important for computer-assiste...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cervical cancer screening is an effective way to prevent the occurrence rate of cervical cancer. According to the Bethesda system [13]

, the most common lesions can be concluded as atypical squamous cells of undetermined significance (ASC-US), low-grade squamous intraepithelial lesion (LSIL), high-grade squamous intraepithelial lesion (HSIL) and atypical glandular cells (AGC), which requires experienced cytologist for a careful observation under microscope. Digitalization and artificial intelligence (AI) technology empower computational pathology with the potential to alleviate cytologists from overburdened workload.

Recently, deep learning has demonstrated promising achievements in a wide range of pathological tasks, including cervical cancer [19, 8], lung cancer [17, 25], colorectal cancer [29], and breast cancer [21, 7], which are highly relied on a large number of human annotations. Nevertheless, the professional expertise of pathology highly restricts the acquisition of fine annotations on a large scale. Hence, a series of semi-supervised methods had been proposed. Shi et al. [16] proposed a self-ensembling based semi-supervised deep architecture to train the network with noisy labels. Xie et al. [30] proposed a pairwise relation network on both labeled and unlabeled data for gland segmentation. Wang et al. [25]

employed fully convolutional framework to explore the possibility of weakly supervised learning on both labeled and unlabeled whole-slide lung images for classification. To the best of our knowledge, most of cervical lesion detection studies are fully-supervised learning without consideration of unlabeled data, e.g.,

[8]. How to fulfill the potential and benefit from unlabeled data with massive hard mimics is very challenging but of great value for cervical cancer detection.

Many semi-supervised methods have been proposed for object detection under different application scenarios. A broad branch of works were based on knowledge distillation. For example, Wang et al. [22] proposed a soft target focal loss to take care of the foreground-background imbalance when learning from ensembled predictions. Zhou et al. [32]

proposed consistency constraints on both the logits and feature maps from a teacher model. Wang et al.

[26] proposed adaptive asymmetric label sharpening for knowledge distillation to learn small lesions from unlabeled data. These methods highly rely on the quality of predictions generated by self-ensembling models or prediction ensemble. Another branch of works generates pseudo labels for the unlabeled data. Liu et al. [10] argued that the imbalance of pseudo foregrounds and backgrounds highly affected the semi-supervised detection performance, based on which they proposed to utilize focal loss to better learn the pseudo labels. Wang et al. [24] presented the co-mining algorithm that combines the predictions from the Siamese network for refining pseudo labels. Nevertheless, theses works could be prone to overfitting on the pseudo labels.

To enable large-scale semi-supervised cervical cancer cell detection, we present a novel deep semi-supervised metric learning network for cervical cancer detection from pathology images. Our model performs metric learning with both proposal-level and prototype-level alignment in a metric space to learn more discriminative features. On the proposal level, we generate pseudo labels for the unlabeled data to align the proposal features with learnable categorical proxies derived from the labeled images. As the pseudo labels are possibly noisy, we further propose to align the labeled and unlabeled prototypes generated from each mini-batch of data. Moreover, we adopt a memory bank to store the labeled prototypes and hence enrich the metric learning information from larger batches. To the best of our knowledge, while metric learning studies are often on fully labeled data [27, 23, 2] or purely unlabeled data [1, 3], we explore to further unleash the potential of metric learning models by unifying both labeled and unlabeled data. To develop and validate the algorithm, we construct a large-scale pathology dataset with 240,560 cervical cell images for semi-supervised cervical cancer detection for the first time. Extensive experiments show our proposed method improves the fully-supervised baseline under various scenarios and outperforms other state-of-the-art semi-supervised detection approaches.

2 Method

Figure 1:

Overview of our proposed framework. For supervised learning, the standard supervised losses are calculated to optimize the model. For unsupervised learning, we conduct the proposal-level metric learning and prototype alignment in the same embedding metric space generated by the projection head.

Let denote the labeled dataset and denote the unlabeled dataset. Usually, the size of is small and the data from can be numerous. The goal of semi-supervised cervical cancer detection is improving the performance by effectively leveraging without human annotating labor. Here, We construct our semi-supervised metric learning framework based on the Faster R-CNN [15] with Feature Pyramid Network (FPN) [9]. Fig. 1 illustrates the overview of our proposed method, which will be elaborated in the following sections.

2.1 Proxy-based Proposal Alignment

Instead of directly training the classifiers of model as proposed in existing semi-supervised methods, we conduct the distance metric learning on a learned embedding space to achieve the semantic alignment of the proposal features between the labeled and unlabeled data. Specifically, we first incorporate a projection head to map the proposal features into an embedding metric space. Formally, let

denote the proposal feature generated by the detection head, the projection feature is obtained by


where and denote two fully connected layers for dimension reduction, and

is the ReLU nonlinearity function

[12]. Next, we adopt the proxy-based loss by using semantic proxies as the class representatives to compute the distances between proposals [11]. Note that the proxies are defined as a part of the network parameters which can be optimized during the training stage. To alleviate the negative influence of pseudo labels from unlabeled data, the proxies are only updated from the labeled data. For labeled data, the proxy-based metric learning loss is defined as:


where denotes the projected features belong to category out of in total classes, and is the proxy belongs to category .

represents the distance measuring function, which computes the Euclidean distance in our case. For unlabeled data, we take the probabilities generated by the model as the pseudo label for each proposal embedding. Then, the proxy-based loss for unlabeled data is as follows:


where is the projected feature for unlabeled data, of which the label is determined by thresholding the network’s prediction.

2.2 Memory Bank-based Prototype Alignment

Pseudo label assignment is a hard assumption which could easily result in overfitting on model predictions. To enable smoother regularization, we adopt the prototype alignment loss to further maintain the semantic consistency between the labeled and unlabeled data. Specifically, the mean features of labeled proposals for each class is computed as, namely, ground truth-guided prototypes . We then aggregate proposals with a confidence-based fusion strategy to be, namely, confidence-guided prototypes , which can alleviate the influence of low-confidence samples as suggested in [31, 18]. Formally, the two types of prototypes are obtained as follows:


where is the number of proposals in a mini batch, and is the prediction score for the proposal projection . We then align the two types of prototype in the embedding metric space:


However, the generated prototypes of labeled and unlabeled images in a mini batch may have the problem of category mismatch. In other words, some classes of the labeled data in one mini batch may not exist in the unlabeled data, so that we can not conduct the prototype alignment of these classes. Therefore, we further adopt a memory bank [28] to store the ground-truth-guided prototypes () from previous steps, hence effectively enrich the metric learning information from larger batches. Moreover, in order to enhance the semantic consistency of the prototypes from training data, we conduct the alignment loss between the confidence-guided prototypes (including labeled data and unlabeled data ) and the mean prototypes of the memory bank. Denote the mean prototypes of the memory bank as , we construct prototype-level alignment as follows:


2.3 Optimization and Implementation Details

The final training loss is defined as follows:


where , which is the conventional supervised loss of Faster R-CNN. is a piece-wise weight function following typical semi-supervised learning methods [5, 6, 20] for stable training. We use Faster-RCNN[15]

with ImageNet-pre-trained ResNet-101

[4] backbone and FPN [9]

as our base model. All experiments are conducted based on Pytorch


. During training, we adopt 2 TITAN Xp GPUs with a batch size of 8, where the labeled data and unlabeled data are equally sampled. For proposal alignment, the confidence threshold used for the pseudo label is 0.5, the proxies are randomly initialized and updated by Stochastic Gradient Descent (SGD) with the initial learning rate of 0.01. For prototype alignment, the size of memory bank is set to 1024. Random flip and color jittering are applied for data augmentation. The model weights are updated by SGD with a momentum of 0.9, and the initial learning rate is 0.005 and multiplied by 0.1 every 40000 iterations.

3 Experiments and Results

3.1 Dataset and Evaluation Metrics

A large-scale cervical pathology dataset is collected with 240,860 images in total. Among those, 42,073 images with sizes 12001200 are conducted by the liquid-based Pap test specimens from 997 patients and used as the labeled data, where 116,919 annotations were given by pathologists. These images were divided into labeled training set, validation set, and testing set with a ratio of 7:1:2 without overlapping of patients. Then, the fully supervised detection network (Faster R-CNN) is trained to screen regions of interests from the whole slide images of other 1427 patients, which generates our unlabeled dataset

(a total of 198,787 patches with a size of 1200x1200). The mean Average Precision (mAP) from AP10 to AP70 with an interval of 5 is adopted as the evaluation metric


labeled Method mAP[]
Faster R-CNN[15] 7.2 23.0 11.8 20.7 15.7
ours 7.7 26.4 11.3 22.6 17.0
Faster R-CNN[15] 10.1 31.3 16.8 16.1 18.6
ours 9.9 31.9 18.0 18.2 19.5
Faster R-CNN[15] 12.2 38.1 24.8 22.5 24.4
ours 11.1 40.0 23.0 27.3 25.4
Faster R-CNN[15] 12.8 40.9 24.0 24.0 25.4
ours 13.1 42.0 25.1 27.7 27.0
Table 1: Evaluation of the proposed method with different numbers of labeled data.

3.2 Evaluation on Different Dataset Settings

In this experiment, we compare the performance of our proposed method with the Faster R-CNN, a widely-adopted fully-supervised object detection model. We vary the number of labeled data for training our method while keeping the unlabeled data as the same amount under different settings. Meanwhile, Faster R-CNN is provided with the same labeled data only. The results are reported in Table 1. As a reference, Faster R-CNN [15] achieves mAP of 15.7%, 18.6%, 24.4%, and 25.4% when provided with 25%, 50%, 75%, and 100% of the total labeled data, respectively. Our proposed method consistently improves the baseline under all settings, with mAP increasing of 1.3%, 0.9%, 1.0%, and 1.6%, respectively. Moreover, our method achieves comparable average mAP to that of Faster R-CNN with 25% less labeled data (row 6 vs. 7), which demonstrates that our method can effectively mine the knowledge from unlabeled data and relieve the annotating labor.

Method mAP[]
Faster R-CNN[15] 12.8 40.9 24.0 24.0 25.4
CSD [5] 11.6 38.3 22.7 30.9 25.9
MT [20] 11.5 42.7 23.4 27.3 26.2
Ours 13.1 42.0 25.1 27.7 27.0
Table 2: Quantitative comparisons with state-of-the-arts on the test set.

3.3 Comparison with Other Semi-supervised Methods

We compare our proposed method with the widely-used Faster R-CNN [15] as well as two state-of-the-art semi-supervised object detection methods: 1) consistency-based semi-supervised detection (CSD) [5] model that constraints consistent prediction on both classification and regression outputs for an image and its flipped version; and 2) Mean Teacher (MT) [20] that constructs knowledge distillation on the classification outputs from a self-ensembled model. For a fair comparison, all the semi-supervised methods are implemented with the same backbone and trained with all labeled and unlabeled data.

Quantitative results are reported in Table 2. All the semi-supervised methods improve the fully supervised baseline, demonstrating effectiveness in utilizing the unlabeled data. Compared with Faster R-CNN, the consistency constraints-based method CSD [5] shows a significant improvement () on AGC, while the performance on the other categories (ASC-US, LSIL, HSIL) has a certain decrease. The knowledge distillation-based method MT [20] achieves an improvement of and on LSIL and AGC, while the performance on ASC-US and HSIL also decreases compared to the baseline model. Notably, our proposed method shows improved performance on all the four classes from the baseline model, achieving mAP of 13.1% in ASC-US, 42.0% in LSIL, 25.1% in HSIL, and 27.7% in AGC. Our method also achieves the best average mAP over all classes, demonstrating more efficient unsupervised data learning capability.

Figure 2: Qualitative comparisons of semi-supervised cervical cancer cell detection on the testing set. Red rectangles stand for ground truths, green rectangles stand for true positives, and blue rectangles stand for false positives.

Qualitative results are illustrated in Fig. 2. We show the bounding box predictions from all the compared methods. As can be observed from the visualization, compared with other approaches, our proposed method yields more accurate predictions for cervical cancer cell in the liquid-based Pap images.

Method mAP[]
Faster RCNN [15] 12.8 40.9 24.0 24.0 25.4
Ours (w/o prototype alignment) 11.4 40.0 23.6 28.5 25.9
Ours (w/o proposal alignment) 13.4 41.1 24.0 27.7 26.6
Ours 13.1 42.0 25.1 27.7 27.0
Table 3: Quantitative analysis of different components on the test set.

3.4 Ablation Study of the Proposed Method

We also conduct ablation studies to investigate the efficacy of different proposed components. As shown in Table 3, removing either level of alignment results in performance decrease. On the other hand, each alignment loss also leads to mAP improvement from the baseline Faster R-CNN. By combining both components, our complete model achieves the best mAP on the average performance. These findings demonstrate that the proposed proposal-level alignment and prototype-level alignment have complementary contributions to semi-supervised cervical cancer cell detection.

4 Conclusion

In this paper, we propose a novel semi-supervised deep metric learning framework with dual alignment for cervical cancer cell detection. The proposed method learns a metric space to conduct complementary alignment on both the proposal level and prototype level. Extensive experiments demonstrate the effectiveness and robustness of our method on the task of semi-supervised cervical cancer cell detection. Moreover, the proposed method is general and can be easily extended to other tasks of semi-supervised object detection.

4.0.1 Acknowledgement.

This work was supported by Key-Area Research and Development Program of Guangdong Province, China (2020B010165004), Hong Kong Innovation and Technology Fund (Project No. ITS/311/18FP and Project No. ITS/426/17FP.), and National Natural Science Foundation of China with Project No. U1813204.


  • [1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    pp. 1597–1607. Cited by: §1.
  • [2] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    Arcface: additive angular margin loss for deep face recognition


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 4690–4699. Cited by: §1.
  • [3] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1.
  • [4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.3.
  • [5] J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. Cited by: §2.3, §3.3, §3.3, Table 2.
  • [6] S. M. Laine and T. O. Aila (2018-April 12) Temporal ensembling for semi-supervised learning. Google Patents. Note: US Patent App. 15/721,433 Cited by: §2.3.
  • [7] H. Lin, H. Chen, S. Graham, Q. Dou, N. Rajpoot, and P. Heng (2019) Fast scannet: fast and dense analysis of multi-gigapixel whole-slide images for cancer metastasis detection. IEEE transactions on medical imaging 38 (8), pp. 1948–1958. Cited by: §1.
  • [8] H. Lin, H. Chen, X. Wang, Q. Wang, L. Wang, and P. Heng (2021) Dual-path network with synergistic grouping loss and evidence driven risk stratification for whole slide cervical image analysis. Medical Image Analysis, pp. 101955. Cited by: §1.
  • [9] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.3, §2.
  • [10] Y. Liu, C. Ma, Z. He, C. Kuo, K. Chen, P. Zhang, B. Wu, Z. Kira, and P. Vajda (2021) Unbiased teacher for semi-supervised object detection. arXiv preprint arXiv:2102.09480. Cited by: §1.
  • [11] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2.1.
  • [12] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §2.1.
  • [13] R. Nayar and D. C. Wilbur (2015) The bethesda system for reporting cervical cytology: definitions, criteria, and explanatory notes. Springer. Cited by: §1.
  • [14] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §2.3.
  • [15] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §2.3, §2, §3.2, §3.3, Table 1, Table 2, Table 3.
  • [16] X. Shi, H. Su, F. Xing, Y. Liang, G. Qu, and L. Yang (2020)

    Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis

    Medical image analysis 60, pp. 101624. Cited by: §1.
  • [17] X. Shi, F. Xing, K. Xu, Y. Xie, H. Su, and L. Yang (2017)

    Supervised graph hashing for histopathology image retrieval and classification

    Medical image analysis 42, pp. 117–128. Cited by: §1.
  • [18] Y. Shi, X. Yu, K. Sohn, M. Chandraker, and A. K. Jain (2020) Towards universal representation learning for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6817–6826. Cited by: §2.2.
  • [19] Y. Song, L. Zhu, J. Qin, B. Lei, B. Sheng, and K. Choi (2019) Segmentation of overlapping cytoplasm in cervical smear images via adaptive shape priors extracted from contour fragments. IEEE transactions on medical imaging 38 (12), pp. 2849–2862. Cited by: §1.
  • [20] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS 30, pp. 1195–1204. Cited by: §2.3, §3.3, §3.3, Table 2.
  • [21] D. Tellez, M. Balkenhol, I. Otte-Höller, R. van de Loo, R. Vogels, P. Bult, C. Wauters, W. Vreuls, S. Mol, N. Karssemeijer, et al. (2018) Whole-slide mitosis detection in h&e breast histology using phh3 as a reference to train distilled stain-invariant convolutional networks. IEEE transactions on medical imaging 37 (9), pp. 2126–2136. Cited by: §1.
  • [22] D. Wang, Y. Zhang, K. Zhang, and L. Wang (2020) FocalMix: semi-supervised learning for 3d medical image detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3951–3960. Cited by: §1.
  • [23] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §1.
  • [24] T. Wang, T. Yang, J. Cao, and X. Zhang (2021) Co-mining: self-supervised learning for sparsely annotated object detection. In AAAI, Cited by: §1.
  • [25] X. Wang, H. Chen, C. Gan, H. Lin, Q. Dou, E. Tsougenis, Q. Huang, M. Cai, and P. Heng (2019) Weakly supervised deep learning for whole slide lung cancer image analysis. IEEE transactions on cybernetics 50 (9), pp. 3950–3962. Cited by: §1.
  • [26] Y. Wang, K. Zheng, C. Chang, X. Zhou, Z. Zheng, L. Huang, J. Xiao, L. Lu, C. Liao, and S. Miao (2021) Knowledge distillation with adaptive asymmetric label sharpening for semi-supervised fracture detection in chest x-rays. In IPMI, Cited by: §1.
  • [27] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pp. 499–515. Cited by: §1.
  • [28] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §2.2.
  • [29] Y. Xie, H. Lu, J. Zhang, C. Shen, and Y. Xia (2019) Deep segmentation-emendation model for gland instance segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 469–477. Cited by: §1.
  • [30] Y. Xie, J. Zhang, Z. Liao, J. Verjans, C. Shen, and Y. Xia (2020) Pairwise relation learning for semi-supervised gland segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 417–427. Cited by: §1.
  • [31] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang (2020) Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12355–12364. Cited by: §2.2.
  • [32] Y. Zhou, H. Chen, H. Lin, and P. Heng (2020) Deep semi-supervised knowledge distillation for overlapping cervical cell instance segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 521–531. Cited by: §1.