Generalizing Nucleus Recognition Model in Multi-source Images via Pruning

by   Jiatong Cai, et al.

Ki67 is a significant biomarker in the diagnosis and prognosis of cancer, whose index can be evaluated by quantifying its expression in Ki67 immunohistochemistry (IHC) stained images. However, quantitative analysis on multi-source Ki67 images is yet a challenging task in practice due to cross-domain distribution differences, which result from imaging variation, staining styles, and lesion types. Many recent studies have made some efforts on domain generalization (DG), whereas there are still some noteworthy limitations. Specifically in the case of Ki67 images, learning invariant representation is at the mercy of the insufficient number of domains and the cell categories mismatching in different domains. In this paper, we propose a novel method to improve DG by searching the domain-agnostic subnetwork in a domain merging scenario. Partial model parameters are iteratively pruned according to the domain gap, which is caused by the data converting from a single domain into merged domains during training. In addition, the model is optimized by fine-tuning on merged domains to eliminate the interference of class mismatching among various domains. Furthermore, an appropriate implementation is attained by applying the pruning method to different parts of the framework. Compared with known DG methods, our method yields excellent performance in multiclass nucleus recognition of Ki67 IHC images, especially in the lost category cases. Moreover, our competitive results are also evaluated on the public dataset over the state-of-the-art DG methods.


DLOW: Domain Flow for Adaptation and Generalization

In this work, we propose a domain flow generation(DLOW) approach to mode...

Cross-Domain Few-Shot Classification via Inter-Source Stylization

Cross-Domain Few Shot Classification (CDFSC) leverages prior knowledge l...

Label Distribution Learning for Generalizable Multi-source Person Re-identification

Person re-identification (Re-ID) is a critical technique in the video su...

Fine-Tuning StyleGAN2 For Cartoon Face Generation

Recent studies have shown remarkable success in the unsupervised image t...

Teach an all-rounder with experts in different domains

In many automatic speech recognition (ASR) tasks, an ideal model has to ...

Cross-Domain Style Mixing for Face Cartoonization

Cartoon domain has recently gained increasing popularity. Previous studi...

Domain Agnostic Learning for Unbiased Authentication

Authentication is the task of confirming the matching relationship betwe...

1 Introduction

In the diagnosis and prognosis of cancer, Ki67 is a significant biomarker [reis2020ki67, yang2018ki67, yerushalmi2010ki67] that can evaluate tumor cell proliferation and growth by quantifying its expression in Ki67 immunohistochemistry stained images[kloppel2018ki67, volynskaya2019ki67]

. With the increase of data volume, deep learning methods have provided strong advantages in nucleus recognition

[xie2018efficient, xing2019pixel]. However, the lack of standardized multicenter data could bring heterogeneous Ki67 interpretations from different stainings, scanners and environments [focke2017interlaboratory, rimm2019international]. Applying methods trained on a single domain can result in a biased prediction. It is commonly required in practice that the robust model trained on multi-source data should have a reliable prediction on unseen domains. Empirical Risk Minimization (ERM) [vapnik1999overview] is a convenient solution, however, the gap still exists due to the complexity of the merged domains.

Methods of cross-domain training can be simply divided into domain adaptation (DA) and domain generalization (DG). Specifically, DA aims to solve the covariate shift between the source domain and target domain. A popular representation learning approach named domain-adversarial neural networks (DANN) was proposed by setting a domain discriminator to confuse domain features in an indiscriminate mechanism

[ganin2016domain]. In histopathological image analysis, DANN was applied to organ-level, slide-level [lafarge2019learning] and patient-level domains [hashimoto2020multi]. Similarly, adversarial discriminative domain adaptation (ADDA) [tzeng2017adversarial] is achieved by replacing the weight-shared backbone with exclusive domain-specific encoders, which was introduced into the quality assessment task of fundus images [shen2020domain]

. Besides, some studies implemented unsupervised DA by utilizing generative adversarial networks in liver segmentation

[yang2019unsupervised], microscopy image quantification [xing2019adversarial] and diabetic retinopathy detection [yang2020residual]. Yet, since the model is barely trained on the source domain and target domain, the performance is flawed on unseen domains.

To achieve a desirable generalization performance, data augmentation is an intuitive strategy. For instance, adversarial samples are generated by adaptive data augmentation and deep-stacked transformation in [volpi2018generalizing] and [zhang2020generalizing], respectively. Nevertheless, model convergence could be influenced by the generated unrealistic data. Instead, some researchers studied DG with learning domain-agnostic representations. Albuquerque et al. improve generalization by minimizing discrepancies of the pairwise domain [albuquerque2019generalizing]. Sikaroudi et al. utilize meta-learning to train a magnification-agnostic histopathology image classification model [sikaroudi2021magnification]. Mahajan et al. give a causal interpretation to DG and generalize based on invariant representations [mahajan2020domain]. The works above achieved optimistic results on data from unseen domains whereas there are still some non-negligible limitations. First of all, DANN-based methods could not function well with limited data in the source domain. What’s more, some studies are not applicable when certain categories do not exist overall training domains.

Based on this, we present a novel DG approach for nucleus recognition in Ki67 immunohistochemistry stained images. The proposed method transforms DG into a domain-agnostic subnetwork searching problem. Invariant representations are learned by iteratively pruning domain-specific model parameters under a domain merging scenario. Moreover, the pruned model is fine-tuned on merged domains, which solves the class mismatch problem to some extent. Furthermore, the appropriate implementation is achieved by applying our pruning method in different modules of the nucleus recognition framework. Compared with the state-of-the-art (SOTA) methods in the Ki67 IHC image and public data set, our method achieves competitive performance in the unseen domains.

2 Methods

2.1 Overview

2.1.1 Motivation.

The lottery ticket hypothesis [frankle2018lottery] claims only a subset of parameters account for model performance in an over-parameterized neural network. Without any degradation of performance, the subnetwork compression can be still achieved by iteratively pruning trivial parameters. Under some circumstances, the model may not generalize well to unseen domains because of overfitting on domain-specific representations. Consequently, DG could be considered as a domain-agnostic subnetwork searching problem, which keeps domain-agnostic parameters while pruning domain-specific parameters.

2.1.2 Workflow.

The proposed prune-based DG framework is illustrated in Fig.1. Initially, parameters

are obtained by the convergent model (Model-S) trained on a single domain. When training in merged domains, model gradients are collected and sorted globally by their magnitude after the loss backpropagation. Subsequently the subset of

will be achieved by pruning weights with the top gradient magnitude and resetting the remaining parameters without updating in by iterating n times. Finally, the pruned model (Model-M) will be fine-tuned in merged domains.

Figure 1: A workflow of prune-based generalization. We use model-S and model-M to represent the model converged on a single domain and model pruned on merged domains.

2.2 Pruning-based Algorithm

Our method aims to filter out a fraction of parameters that the remaining parameters could preserve the accuracy on the training domains while performing well on the unseen domain(s). To achieve this goal, we construct a domain merging scenario. That is, the model is firstly trained on a single source domain. After it is converged, we replace the training data with merged domains, which contain the source domain and domain(s) never seen before. For the latter, we denote it as invasion domain(s). Under this scenario, the target problem corresponds to the optimization as follows.


where and are samples from the source domain and invasion domains respectively, is the expected , and denotes the parameters trained on . The cost was computed on samples of under parameters . Generally, of pruned model is larger than . The Eq. 1 can be simplified as follows:


represents a batch of merged samples that consist of and . Subsequently, the second term of Eq. 2 has no contribution to the selection of thus can be removed, and the optimization of can be defined as Eq. 3,


The above subnetwork searching problem can be solved by exploring the contribution of each parameter to . More concretely, while trained on , model parameters are updated through loss backpropagation according to the gradient descent. In this procedure, gradients are calculated as derivatives of the cost with respect to as . Correspondingly, the gradient magnitude of the individual parameter can be regarded as its contribution to . Intuitively, model convergence depends on the updating of the parameters. Parameters requiring a large variation to fit on merged domains can be considered highly domain-related. The expected remaining parameters can be approximated by preserving parameters with a relatively small as Eq. 4,


where is a scalar related to remaining percentage of parameters.

Generally, a rather small pruning rate is set for preventing performance degradation by using a large pruning rate. Moreover, the same setting as [frankle2018lottery] is adopted through iterative training and pruning followed by resetting the remaining parameters over n steps with each step prunes of the surviving parameters. A few iterations of model pruning is usually enough, therefore a small amount of merged data is enough in the pruning procedure. Although the pruned model suffers degraded performance, it is provided with the potentiality of learning domain-invariant representations. Finally, further fine-tuning is applied to surviving parameters on merged domains for recovering accuracy.

3 Experiments

3.1 Nucleus Recognition

3.1.1 Ki67 Dataset.

All Ki67 IHC images were collected under magnification and resolution from 16 domains through data anonymization. Nuclei of each slice were annotated by certified pathologists in seven categories: negative fibroblasts, negative lymphocyte, negative tumor cell, positive fibroblasts, positive lymphocyte, positive tumor cell and other cells. The data set was built under the mismatched categories in different domains as shown in Table.1.

Properties Merged training domains Testing unseen domains
  D1             D2
No. of images 95 51 41
No. of categories 7 5 4,5,7
Neuclei count 35,948 35,948 35,948
Radius of cells 46 pixel 712 pixel 514 pixel
Type of tumor breast breast breast, cervix, pancreas, urethra
Hue partial blue partial orange partial violet, brown red, yellow green
Table 1: Properties of different domains in Ki67 data set

3.1.2 Settings and Implementations.

Nucleus recognition was formulated to a structured regression problem by implementing a weakly supervised segmentation [xing2019pixel]. The ground truth mask is a set of continuous-valued proximity maps ranging from 0 to 1, generated by applying a 2D Gaussian filter on the annotations. Image patches are randomly cropped with resolution and fed into the variant of Albunet (U-Net [ronneberger2015u] with ResNet34 [he2016deep]

pre-trained on ImageNet as the encoder). The model was trained by Adam optimizer under the sum of cross-entropy loss and IOU loss with adjusted hyper-parameters: batch size 8, iterations 500, initial learning rate

with gradually decreasing after every 100 epochs. Our base model was pre-trained on D1 and pruned on the merged domain (D1 and D2) with pruning ratio

, prune-reset iterations 4. Besides, an ablation experiment was arranged to observe the effect of pruning, which intervened the encoder, the decoder and all modules under the same pruning rate. In addition, data augmentation was also adopted including random rotation, crop, flip, mirroring and color jitter. The optimal model could be selected if the accuracy no longer improved within 30 epochs. The final prediction was achieved by non-maximum suppression in the probability map.

Algorithm Test (domains) Detection Classificaion
P R F1 P R F1
ERM merge 82.78 83.97
unseen 89.56 82.92 86.11 90.45 87.65 88.35
ERM-F merge 85.22 83.53 84.37 84.43 84.10 84.19
unseen 88.66 81.75 85.06 90.99 88.57 89.38
DANN[ganin2016domain] merge 75.43 80.90 78.07 77.63 72.16 74.09
unseen 81.73 78.94 80.31 86.84 68.31 74.77
Adv-Aug[volpi2018generalizing] merge 80.53 60.65 69.19 69.45 76.40 72.45
unseen 61.92 73.66 88.17 89.14
Ours-Encoder merge 85.30 85.76 85.53 84.93 84.24 84.39
unseen 89.26 83.11 86.08 89.03 89.72
Ours-Decoder merge 84.86 84.96 84.91 84.85 82.81 83.52
unseen 87.99 83.01 85.43 91.37 87.82 88.89
Ours-All merge 85.95 85.19 85.57 86.43
unseen 88.83 91.04 89.42
Table 2: Nucleus Recognition DG results on unseen & merge domains of Ki67

3.1.3 Evaluation.

The results were evaluated on precision, recall and F1-score under the two tasks as follows. The final score is averaged over 41 cases.

  • Detection. Hungarian matching algorithm [kuhn1955hungarian] was applied to the predictions and annotations under the valid area with a radius of 16 pixels to calculate true positive (TP), false positive (FP) and false negative (FN). The average distance

    and standard deviation

    of paired results were computed as an additional indicator.

  • Classification. The pair-wise results in detection were further assessed according to whether their categories are matched using metrics in [sirinukunwattana2016locality].

The competitive methods and results were shown in Table 2. We take ERM-F(trained and fine-tuned the same as our method but without pruning) as a reference to guarantee the improvement is not gained from fine-tuning. Moreover, DANN[ganin2016domain] was implemented by embedding a domain discriminator following the model encoder. Adv-Aug[volpi2018generalizing] was adjusted with the adversarial learning rate ranging from 0 to , step 10.

3.1.4 Results and Discussions.

Nucleus Recognition results are shown in Table 2. ‘Merge’ and ‘unseen’ represent domains participating in training and never involved in training, respectively. ‘Ours-Encoder’ and ‘Ours-Decoder’ denote applying the pruning method only on the ‘encoder’ and ‘decoder’ modules. Our method exhibits the best comprehensive performance on merged domains and unseen domains. Considering the performance on unseen domains, metrics of recall and F1-score show competitive results of our method on both nucleus detection and classification. Particularly, the proposed method improve the detection F1-score by 0.35 for ERM and 1.40 for ERM-F and for nuclues classification our method outperforms ERM and ERM-F in terms of F1-score by 1.56 and 0.53, respectively. On merge domains, the proposed method slightly under-performs the ERM algorithm in nucleus detection. Such performance degradation is reasonable. Because ERM, with limited generalization, tends to over-fit on domain-specific features which may help with the accuracy of merge domains. Even so, our methods surpass ERM and ERM-F by a non-trivial margin of 1.73 and 1.51(F1-score), respectively on merge domains in classification.

It is worth noting that DANN shows rather worse metrics than ERM in our experiment, which indicates its vulnerability in coping with the class mismatching problem both on merged domains and the generalization on unseen domains. The expected effect of Adv-Aug is marginal on unseen domains because of the complexity of histopathological images. It is observed that model training interferes with the generated unrealistic images. The above two methods utilize adversarial mechanisms for the expansion of available domains, whose limitation appears when only a small number of domains is available for training. On the contrary, our pruning method attempts to learn the domain-agnostic representations by preventing the model’s over-fitting on training domains.

Figure 2: Comparison of nuclei detection and classification with different methods

Fig. 2 shows representative Ki67 recognition results of the above methods in three distinct domains for comparison. Blue, yellow, green, red and violet represent negative fibroblasts, negative lymphocytes, negative tumor cells, positive tumor cells and other cells. The remaining two categories do not show up in these cases. Areas with a significant difference among methods are highlighted with black dash ellipses. Compared with other approaches, the proposed method produces relatively better results in classification (case 1, 3 in row 1 and 3) and detection (case 2 in row 2). To be specific, in case 1, our method shows competitive classification results given the mixture of negative tumors, negative fibroblasts and other cells. The improvements of detection are also noticeable in case 2 even though the color of negative tumor cells is too shallow to be figured out. Case 3 further shows the robustness of our method on hollow cells.

The ablation experiment explores the pruning availability in different modules. Generally, pruning on all modules achieves the best performance. This can be caused by the remaining percentage of parameters within each layer. It was observed that the proportion of remaining parameters has a wide variance in different network depths although pruned with a tiny rate. The Decoder-only approach prunes the final layer to a large sparsity (70

) and Encoder-only reserves 30 parameters for shallow layers. By contrast, pruning on all modules preserves 40 parameters for the final layer and 40 parameters for the shallow layers. All evaluations illustrate the importance of allocating parameters to prune in a balanced manner.

3.2 PACS Classificaion

Algorithm Art Cartoon Photo Sketch Average
ERM [vapnik1999overview] 0.4 82.56 0.6 96.73 0.4 80.02 0.6 86.68
IRM [arjovsky2019invariant] 86.37 0.8 80.12 1.2 97.24 0.4 76.38 1.1 85.02
DRO [sagawa2019distributionally] 87.53 0.7 81.74 0.7 0.2 79.80 0.8 86.67
MLDG [li2018learning] 87.17 0.8 81.35 1.4 97.36 0.4 80.94 1.0 86.71
DANN [ganin2016domain] 86.52 1.2 80.25 1.1 97.58 0.2 77.26 1.2 85.40
CORAL [sun2016deep] 87.21 0.6 82.16 0.4 97.52 0.1 80.01 0.3 86.77
85.33 0.5 0.6 96.11 0.4 0.2
Table 3: Domain generalization on PACS, Model selection: oracle

PACS (photo, art painting, cartoon, sketches) classification benchmark was also introduced to evaluate our method with ResNet50 backbone on test domains validation set (oracle) [gulrajani2020search]. The results are shown in Table. 3. is the maximum value over 3 experiments, and within each experiment, the source domain and the invasion domain are arranged alternately in a ratio of 1:2. It is observed that our method boosts the performance on the sketches domain, outperforming the SOTA methods by 3.01. The proposed method also surpasses other approaches on the domain of cartoon by a margin of 2.27

. Note that skeletons in sketches and contours in cartoons are cross-domain structural information. Therefore, the above results could well demonstrate the effectiveness of our approach in learning domain-agnostic representations. The potential reason for our mediocre results on the photos might be the pruning method is applied on the model pre-trained on the photo domain (ImageNet). Consequently, the corresponding results are biased and have limited reference value. In addition, the factor that leads to our imperfect performance is probably the huge variance between the art painting domain and other domains. Even so, our method still achieves the best performance in the average accuracy.

4 Conclusion

In this paper, we propose a novel DG method based on subnetwork searching, which generalizes the model by discarding the domain-specific representations and reserving the invariant representations in a domain merging scenario. Our method attempts to solve practical limitations of DG including the insufficient number of domains and the class mismatching problem cross domains. Competitive results demonstrate our method’s superiority and robustness on multiclass nucleus recognition in Ki67 IHC images. In addition, the proposed method achieves good performance by effectively learning generic representations the on PACS classification benchmark.