[MICCAI2020] Code for paper : Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation
Deep learning methods show promising results for overlapping cervical cell instance segmentation. However, in order to train a model with good generalization ability, voluminous pixel-level annotations are demanded which is quite expensive and time-consuming for acquisition. In this paper, we propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining (MMT-PSM), which consists of a teacher and a student network during training. Two networks are encouraged to be consistent both in feature and semantic level under small perturbations. The teacher's self-ensemble predictions from K-time augmented samples are used to construct the reliable pseudo-labels for optimizing the student. We design a novel strategy to estimate the sensitivity to perturbations for each proposal and select informative samples from massive cases to facilitate fast and effective semantic distillation. In addition, to eliminate the unavoidable noise from the background region, we propose to use the predicted segmentation mask as guidance to enforce the feature distillation in the foreground region. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only, and outperforms state-of-the-art semi-supervised methods.READ FULL TEXT VIEW PDF
[MICCAI2020] Code for paper : Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation
Pap smear test is the recommended procedure for earlier cervical cancer screening worldwide . By estimating the cell type and the cytological features, e.g., nuclei size, nuclear cytoplasmic ratio and multi-nucleation, it provides clear guidance for clinical management and further treatment . Automatic cervical cell segmentation can free doctors from time-consuming work and reduce the intra-/inter-observer variability [10, 15, 23, 32]. Specifically, Deep Learning (DL) methods show promising results for cell nuclei segmentation [1, 19, 32]. However, optimizing the DL methods heavily relies on numerous data with expensively dense annotations by experts, which limits the model to acquire higher accuracy and better generalization ability. Since unlabeled data is easily accessible, how to leverage both limited labeled and large amounts of unlabeled data raises researchers’ attention to improve the performance further for medical image analysis .
Several works have been done in medical image community for Semi-Supervised Learning (SSL) on classification and segmentation[25, 16, 2, 3, 17, 30, 8, 24]. Bai et al.  proposed a self-training strategy by alternatively assigning labels to unlabeled data and optimizing the model parameters. Nie et al.  introduced an adversarial learning training strategy by selecting informative regions in unlabeled data to train the segmentation network. Shi et al.  created more reliable ensemble targets for feature and label predictions via the graph to encourage features mapped in the same cluster being more compact. Knowledge distillation , which was first used in model compression by encouraging the small model to mimic the behavior of a deeper model, has demonstrated excellent improvements mostly for classification setups [20, 6, 26] and shown the potential benefit for semi-supervised learning  and domain adaptation . Chen et al.  extended it to the detection scenario with proposal-based method, and presented to learn a compact detector by distilling from both features and predictions. However, directly using entire feature maps will inevitably introduce the noise from the background. To eliminate the noise in background, Wang et al.  conducted feature distillation within the region close to objects based on prior knowledge. Other approaches [13, 5] added consistent regularization either in region-based or relation-based. Although achieving promising progress, they do not consider the informative degree for each sample, which is one of the bottlenecks for further improving the performance. In medical imaging, researchers attempted to apply knowledge distillation to segmentation problems. Wang et al.  employed the teacher student network in 3D optical microscope images via knowledge distillation. Another approach  introduced uncertainty estimation into knowledge distillation for 3D left atrium segmentation. Instance segmentation, however, is a more challenging task that requires an additional detection step to distinguish the individual instances . The potential of the knowledge distillation has not been well explored on it.
In this paper, we propose a novel deep semi-supervised knowledge distillation framework called Mask-guided Mean Teacher with Perturbation-sensitive Sample Mining (MMT-PSM) for overlapping cervical cell instance segmentation, which conducts both semantic and feature distillation. The proposed end-to-end trainable framework consists of a teacher model and a student model under the same backbone. Given a sample with different small perturbations, the proposed method encourages the predictions from two networks being consistent. The mean prediction of the -time augmented samples from the teacher network are considered as the pseudo-label to supervise the student network. A perturbation-sensitive sample mining strategy is used to resolve the meaningless guidance from easy cases in unbalanced and massive data. Furthermore, we propose the mask-guided feature distillation which encourages the feature consistency only for the foreground region to alleviate the side effect in the noisy background. We perform comprehensive evaluation on cervical cell segmentation task. Results indicate that the proposed algorithm significantly improves the instance segmentation accuracy, consistently across different numbers of labeled data, and also outperforms other state-of-the-art semi-supervised methods.
Formally, let denote the labeled set and denote the unlabeled set. The goal of semi-supervised learning is to improve the performance by leveraging the hidden information in . In this work, we adopt Mask R-CNN 
as the instance segmentation model for both the student and the teacher, which consists of four modules: 1) A shared Feature Pyramid Network (FPN) extracts features as inputs for the other modules, 2) a Region Proposal Network (RPN) equipped with RoI Align layer to generate the object proposals, 3) a detection branch (Det) and 4) a segmentation branch (Seg) which take features and proposals as inputs, and then predict the detection scores, the spacial revision vectors and the segmentation results, respectively. We use Mean Teacher algorithm (MT) as our basic framework, which consists of a teacher and a student model sharing the same architecture and encourages the predictions being consistent under small perturbations. Instead of optimizing the teacher by SGD, exponential moving average (EMA) weight in the student is used to form a better teacher model : , where and are the teacher’s and student’s weights in step, and controls the updating speed.
One difficulty in applying MT on the instance segmentation is the sample-imbalanced problem in proposals. Directly computing loss on all predictions is not effective because most samples lie in background regions and can be easily distinguished, which overwhelms the useful information. We propose to use the mean predictions of -time augmented samples as more reliable targets from the teacher and select samples based on its sensitivity to perturbations.
Self-ensembling pseudo-label. Specifically, for each image in , a stochastic Augmentor (A) is used to augment samples for the teacher and augmented samples for the student. To acquire the same candidates for further loss calculation between networks, the proposals generated from teacher’s RPN are shared for both teacher and student. Self-ensemble predictions from a collection of augmented data have been considered as a more reliable target in classification . Here, we calculate the average predictions across augmented samples to generate the soft pseudo-label in teacher network:
denotes the classification sub-branch in teacher Det. A sharpen function is further used to implicitly achieve entropy minimization , in which denotes the number of categories. We set in our study. See the supplementary material for augmentation details and ablation study of the temperature.
Perturbation-sensitive sample mining.
We hypothesize that perturbation-sensitive samples, which have larger prediction accuracy gaps between teacher and student, are more informative and beneficial for training. Firstly, the class with the maximum categorical probability in the self-ensembling prediction is assigned as its hard pseudo-label. Then we calculate the variance amongaugmented samples as its degree of perturbation sensitivity:
All samples whose hard pseudo-labels are foreground classes remain. Meanwhile, background samples are sorted by descending according to the variances and kept the Top-, where is the number of foreground samples. The perturbation-sensitive sample mining loss is calculated on the selected samples as follow:
denotes the classification sub-branch in student Det, and denote the proposals and soft pseudo-labels for remained perturbation-sensitive samples, and is the cross-entropy loss. is a class-balanced weight and is set empirically as for the background and 1 for others.
Study shows  that intermediate representations from the teacher can also improve the training process and final performance of the student in the classification task. However, directly minimize the difference in entire feature maps could harm the performance since it would introduce the noise in the background region. Therefore, we design to force the student only mimicking the teacher under the guidance of semantic segmentation results.
Firstly, an adaptation layer is added after each output stage of FPN, which is proved to be advantageous for feature distillation . Here we use a convolution as the adaptation layer and reduce the input feature dimension by half. Then the instance masks and bounding box’s locations from the teacher are used to generate binary semantic masks. Let and denotes the student’s and the teacher’s feature value in the -th channel at location from the adaptation layer after FPN’s -th stage. We aim to encourage the consistency by minimizing feature distance through the mask-guided distillation loss:
Here denotes the corresponding semantic mask.
Total loss for optimization. The total loss can be defined as:
where , is a balanced weight which we set to 5. is a piecewise weight function that guarantees the loss dominated by at beginning, gradually increases during training, and declines slowly at last.
We use IR-Net  as our base model, which utilizes instance relations on Mask R-CNN . The Augmentor (A) consists of both color and location transformations. Specifically, each sample is first randomly adjusted brightness, contrast and Hue, and then conducted random erasing . After that, half of them are flipped. For the first 1000 iterations, only is used. The teacher model is initiated by copying the parameters in the student at the 990th iter, which prevents the framework from degenerating by a poor teacher. We set to let the teacher have a larger update rate at the beginning when the student improves quickly. During training, each mini-batch includes both labeled and unlabeled images with a ratio of . The sigmoid-shaped function is used for and for , where
is the total iterations. Pytorch is adopted to implement our framework. The learning rate is initiated to 1e-2 and decayed to 1e-3 and 1e-4 after 5000 and 7000 iterations. We adopt SGD algorithm to optimize the network ,and one Titan XP GPU is used for training. The pseudo code of the proposed MMT-PSM can be found in the supplementary material.
Dataset and evaluation metrics.
Dataset and evaluation metrics.The liquid-based Pap test specimen was collected from 82 patients and imaged in resolution with m per pixel. This is used as labeled dataset with totally 4439 cytoplasm and 4789 nuclei annotations. Then the dataset is divided in patient-level with the ratio of for train, valid and test set. An overlapping ratio of 0.75 is used to crop images into for the training set, while the valid, as well as the test set, are non-overlapping cropped. In sum, the number of images for train, valid and test is 961, 50 and 98, respectively. Apart from that, 4371 images from other patients with a resolution of are randomly cropped from whole slide images as the unlabeled dataset .
We use Average Jaccard Index (AJI) and mean Average Precision (mAP)  for quantitative evaluation. Results are calculated on cytoplasm (Cyto.), nuclei (Nuc.) and the average (Avg.). AJI is commonly used in cell nuclei segmentation task, which measures the ratio of the aggregated intersection and aggregated union for all the predictions and ground truths in the image. mAP is the mean of the average precision under different IOU thresholds, which is widely used in the general detection and instance segmentation tasks.
Evaluation on different dataset settings. Firstly, we evaluate the impact of leveraging unlabeled images by our proposed model under different amounts of labeled samples. Our proposed method (MMT-PSM) is compared with the state-of-the-art fully supervised method, named IR-Net , which utilizes the instance relation for mask refinement and duplication removal based on the Mask R-CNN  structure. We evaluate the performance of the proposed MMT-PSM with a varying number of labeled data from 96 to 961 and 4371 unlabeled data. The IR-Net is trained with the same labeled data only. As shown in Table 1, results from the proposed MMT-PSM achieve relatively consistent improvements on both metrics. It improves average AJI by , , , , and , and also improves average mAP by , , , , and for mAP compared with those only trained on the same number of labeled data, which demonstrates the effectiveness of the proposed SSL method.
|MMT-PSM (w/o )||75.01||59.23||67.12||45.58||33.16||39.37|
|MMT-PSM (w/o )||74.38||59.46||66.92||44.39||33.75||39.07|
Comparison with other semi-supervised methods. We implement and adapt several state-of-the-arts methods for comparison: 1). Chen et al.  improved object detection by knowledge distillation (ODKD) with weighted cross-entropy loss for the imbalanced data problem. Meanwhile, feature imitation is conducted in all regions. 2). Wang et al.  proposed fine-grained feature imitation (FFI) which firstly estimated the object anchor locations and then let the student’s features be closed to teacher’s on the selected regions. Note that we used the same network backbone  on these methods with labeled data and unlabeled data for fair comparison. As can be seen in Table 2, all the SSL methods outperforms the supervised method on most of the evaluation indicators. Compared with fully supervised methods, results from ODKD improves mAP but decreases AJI. The reason is it penalizes the classification and feature discrepancy in all regions. Therefore it is inevitable to introduce the noise. FFI selects the proposals closed to objects for feature distillation, hence achieves better results. Furthermore, the proposed MMT-PSM achieves the best performance over the state-of-the-art SSL methods, illustrating that our method has the keen ability to distillate the information both in feature space and semantic predictions.
Ablation study of the proposed method. We also conduct the ablation study for the impact of proposed components: 1). MMT-PSM (w/o ) denotes the proposed method without the mask-guided feature distillation, and 2). MMT-PSM (w/o ) denotes the proposed method without the perturbation-sensitive sample mining for knowledge distillation. Results are shown in Table 3. Utilizing perturbation-sensitive samples measured in the teacher network as the pseudo-labels for optimizing the student improves for average AJI and for mAP. Meanwhile, forcing features from the teacher and the student being consistent in the foreground region also increases average AJI and mAP. Lastly, combining two components in our mean teacher framework achieves the competitive performance by AJI and mAP.
Qualitative evaluation. We also visualize different methods’ results from challenging cases including the heavily occlusion of cytoplasm and blurred regions. As can be seen in Fig. 2, each closed curve denotes an individual instance. Compared with other methods, our proposed MMT-PSM has the better ability to recognize the translucent cervical cells in low contrast areas.
In this paper, we propose a novel mask-guided mean teacher framework with perturbation-sensitive sample mining which conducts knowledge distillation for semi-supervised cervical cell instance segmentation. The proposed method encourages the network to output consistent feature maps and predictions under small perturbations. Only samples with high grade of perturbation sensitivity are selected for semantic distillation, which prevents the meaningless guidance from easy background cases. In addition, the segmentation mask is used as guidance for better feature distillation. Experiments demonstrate our proposed method effectively leverage the unlabeled data and outperforms other SSL methods. Our proposed MMT-PSM framework is general and can be easily adapted to other semi-supervised medical image instance segmentation tasks.
The work described in the paper was supported in parts by the following grants from
Key-Area Research and Development Program of Guangdong Province, China (2020B010165004),
Hong Kong Innovation and Technology Fund (Project No. ITS/041/16), National Natural Science Foundation of China (Project No. U1813204) and Shenzhen Science and Technology Program (JCYJ20170413162
NeurIPS Workshop on Machine Learning for Healthcare. Cited by: §1.
Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1.
Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis. Medical Image Anal. 60, pp. 101624. Cited by: §1.
Segmenting neuronal structure in 3d optical microscope images via knowledge distillation with teacher-student network. In ISBI, pp. 228–231. Cited by: §1.
Towards a new generation of artificial intelligence in china. Nature Machine Intelligence 2, pp. 312–316. Cited by: §1.