In the diagnosis and prognosis of cancer, Ki67 is a significant biomarker [reis2020ki67, yang2018ki67, yerushalmi2010ki67] that can evaluate tumor cell proliferation and growth by quantifying its expression in Ki67 immunohistochemistry stained images[kloppel2018ki67, volynskaya2019ki67]
. With the increase of data volume, deep learning methods have provided strong advantages in nucleus recognition[xie2018efficient, xing2019pixel]. However, the lack of standardized multicenter data could bring heterogeneous Ki67 interpretations from different stainings, scanners and environments [focke2017interlaboratory, rimm2019international]. Applying methods trained on a single domain can result in a biased prediction. It is commonly required in practice that the robust model trained on multi-source data should have a reliable prediction on unseen domains. Empirical Risk Minimization (ERM) [vapnik1999overview] is a convenient solution, however, the gap still exists due to the complexity of the merged domains.
Methods of cross-domain training can be simply divided into domain adaptation (DA) and domain generalization (DG). Specifically, DA aims to solve the covariate shift between the source domain and target domain. A popular representation learning approach named domain-adversarial neural networks (DANN) was proposed by setting a domain discriminator to confuse domain features in an indiscriminate mechanism[ganin2016domain]. In histopathological image analysis, DANN was applied to organ-level, slide-level [lafarge2019learning] and patient-level domains [hashimoto2020multi]. Similarly, adversarial discriminative domain adaptation (ADDA) [tzeng2017adversarial] is achieved by replacing the weight-shared backbone with exclusive domain-specific encoders, which was introduced into the quality assessment task of fundus images [shen2020domain]
. Besides, some studies implemented unsupervised DA by utilizing generative adversarial networks in liver segmentation[yang2019unsupervised], microscopy image quantification [xing2019adversarial] and diabetic retinopathy detection [yang2020residual]. Yet, since the model is barely trained on the source domain and target domain, the performance is flawed on unseen domains.
To achieve a desirable generalization performance, data augmentation is an intuitive strategy. For instance, adversarial samples are generated by adaptive data augmentation and deep-stacked transformation in [volpi2018generalizing] and [zhang2020generalizing], respectively. Nevertheless, model convergence could be influenced by the generated unrealistic data. Instead, some researchers studied DG with learning domain-agnostic representations. Albuquerque et al. improve generalization by minimizing discrepancies of the pairwise domain [albuquerque2019generalizing]. Sikaroudi et al. utilize meta-learning to train a magnification-agnostic histopathology image classification model [sikaroudi2021magnification]. Mahajan et al. give a causal interpretation to DG and generalize based on invariant representations [mahajan2020domain]. The works above achieved optimistic results on data from unseen domains whereas there are still some non-negligible limitations. First of all, DANN-based methods could not function well with limited data in the source domain. What’s more, some studies are not applicable when certain categories do not exist overall training domains.
Based on this, we present a novel DG approach for nucleus recognition in Ki67 immunohistochemistry stained images. The proposed method transforms DG into a domain-agnostic subnetwork searching problem. Invariant representations are learned by iteratively pruning domain-specific model parameters under a domain merging scenario. Moreover, the pruned model is fine-tuned on merged domains, which solves the class mismatch problem to some extent. Furthermore, the appropriate implementation is achieved by applying our pruning method in different modules of the nucleus recognition framework. Compared with the state-of-the-art (SOTA) methods in the Ki67 IHC image and public data set, our method achieves competitive performance in the unseen domains.
The lottery ticket hypothesis [frankle2018lottery] claims only a subset of parameters account for model performance in an over-parameterized neural network. Without any degradation of performance, the subnetwork compression can be still achieved by iteratively pruning trivial parameters. Under some circumstances, the model may not generalize well to unseen domains because of overfitting on domain-specific representations. Consequently, DG could be considered as a domain-agnostic subnetwork searching problem, which keeps domain-agnostic parameters while pruning domain-specific parameters.
The proposed prune-based DG framework is illustrated in Fig.1. Initially, parameters
are obtained by the convergent model (Model-S) trained on a single domain. When training in merged domains, model gradients are collected and sorted globally by their magnitude after the loss backpropagation. Subsequently the subset ofwill be achieved by pruning weights with the top gradient magnitude and resetting the remaining parameters without updating in by iterating n times. Finally, the pruned model (Model-M) will be fine-tuned in merged domains.
2.2 Pruning-based Algorithm
Our method aims to filter out a fraction of parameters that the remaining parameters could preserve the accuracy on the training domains while performing well on the unseen domain(s). To achieve this goal, we construct a domain merging scenario. That is, the model is firstly trained on a single source domain. After it is converged, we replace the training data with merged domains, which contain the source domain and domain(s) never seen before. For the latter, we denote it as invasion domain(s). Under this scenario, the target problem corresponds to the optimization as follows.
where and are samples from the source domain and invasion domains respectively, is the expected , and denotes the parameters trained on . The cost was computed on samples of under parameters . Generally, of pruned model is larger than . The Eq. 1 can be simplified as follows:
represents a batch of merged samples that consist of and . Subsequently, the second term of Eq. 2 has no contribution to the selection of thus can be removed, and the optimization of can be defined as Eq. 3,
The above subnetwork searching problem can be solved by exploring the contribution of each parameter to . More concretely, while trained on , model parameters are updated through loss backpropagation according to the gradient descent. In this procedure, gradients are calculated as derivatives of the cost with respect to as . Correspondingly, the gradient magnitude of the individual parameter can be regarded as its contribution to . Intuitively, model convergence depends on the updating of the parameters. Parameters requiring a large variation to fit on merged domains can be considered highly domain-related. The expected remaining parameters can be approximated by preserving parameters with a relatively small as Eq. 4,
where is a scalar related to remaining percentage of parameters.
Generally, a rather small pruning rate is set for preventing performance degradation by using a large pruning rate. Moreover, the same setting as [frankle2018lottery] is adopted through iterative training and pruning followed by resetting the remaining parameters over n steps with each step prunes of the surviving parameters. A few iterations of model pruning is usually enough, therefore a small amount of merged data is enough in the pruning procedure. Although the pruned model suffers degraded performance, it is provided with the potentiality of learning domain-invariant representations. Finally, further fine-tuning is applied to surviving parameters on merged domains for recovering accuracy.
3.1 Nucleus Recognition
3.1.1 Ki67 Dataset.
All Ki67 IHC images were collected under magnification and resolution from 16 domains through data anonymization. Nuclei of each slice were annotated by certified pathologists in seven categories: negative fibroblasts, negative lymphocyte, negative tumor cell, positive fibroblasts, positive lymphocyte, positive tumor cell and other cells. The data set was built under the mismatched categories in different domains as shown in Table.1.
|Properties||Merged training domains||Testing unseen domains|
|No. of images||95||51||41|
|No. of categories||7||5||4,5,7|
|Radius of cells||46 pixel||712 pixel||514 pixel|
|Type of tumor||breast||breast||breast, cervix, pancreas, urethra|
|Hue||partial blue||partial orange||partial violet, brown red, yellow green|
3.1.2 Settings and Implementations.
Nucleus recognition was formulated to a structured regression problem by implementing a weakly supervised segmentation [xing2019pixel]. The ground truth mask is a set of continuous-valued proximity maps ranging from 0 to 1, generated by applying a 2D Gaussian filter on the annotations. Image patches are randomly cropped with resolution and fed into the variant of Albunet (U-Net [ronneberger2015u] with ResNet34 [he2016deep]
pre-trained on ImageNet as the encoder). The model was trained by Adam optimizer under the sum of cross-entropy loss and IOU loss with adjusted hyper-parameters: batch size 8, iterations 500, initial learning rate
with gradually decreasing after every 100 epochs. Our base model was pre-trained on D1 and pruned on the merged domain (D1 and D2) with pruning ratio
, prune-reset iterations 4. Besides, an ablation experiment was arranged to observe the effect of pruning, which intervened the encoder, the decoder and all modules under the same pruning rate. In addition, data augmentation was also adopted including random rotation, crop, flip, mirroring and color jitter. The optimal model could be selected if the accuracy no longer improved within 30 epochs. The final prediction was achieved by non-maximum suppression in the probability map.
The results were evaluated on precision, recall and F1-score under the two tasks as follows. The final score is averaged over 41 cases.
Detection. Hungarian matching algorithm [kuhn1955hungarian] was applied to the predictions and annotations under the valid area with a radius of 16 pixels to calculate true positive (TP), false positive (FP) and false negative (FN). The average distanceof paired results were computed as an additional indicator.
Classification. The pair-wise results in detection were further assessed according to whether their categories are matched using metrics in [sirinukunwattana2016locality].
The competitive methods and results were shown in Table 2. We take ERM-F(trained and fine-tuned the same as our method but without pruning) as a reference to guarantee the improvement is not gained from fine-tuning. Moreover, DANN[ganin2016domain] was implemented by embedding a domain discriminator following the model encoder. Adv-Aug[volpi2018generalizing] was adjusted with the adversarial learning rate ranging from 0 to , step 10.
3.1.4 Results and Discussions.
Nucleus Recognition results are shown in Table 2. ‘Merge’ and ‘unseen’ represent domains participating in training and never involved in training, respectively. ‘Ours-Encoder’ and ‘Ours-Decoder’ denote applying the pruning method only on the ‘encoder’ and ‘decoder’ modules. Our method exhibits the best comprehensive performance on merged domains and unseen domains. Considering the performance on unseen domains, metrics of recall and F1-score show competitive results of our method on both nucleus detection and classification. Particularly, the proposed method improve the detection F1-score by 0.35 for ERM and 1.40 for ERM-F and for nuclues classification our method outperforms ERM and ERM-F in terms of F1-score by 1.56 and 0.53, respectively. On merge domains, the proposed method slightly under-performs the ERM algorithm in nucleus detection. Such performance degradation is reasonable. Because ERM, with limited generalization, tends to over-fit on domain-specific features which may help with the accuracy of merge domains. Even so, our methods surpass ERM and ERM-F by a non-trivial margin of 1.73 and 1.51(F1-score), respectively on merge domains in classification.
It is worth noting that DANN shows rather worse metrics than ERM in our experiment, which indicates its vulnerability in coping with the class mismatching problem both on merged domains and the generalization on unseen domains. The expected effect of Adv-Aug is marginal on unseen domains because of the complexity of histopathological images. It is observed that model training interferes with the generated unrealistic images. The above two methods utilize adversarial mechanisms for the expansion of available domains, whose limitation appears when only a small number of domains is available for training. On the contrary, our pruning method attempts to learn the domain-agnostic representations by preventing the model’s over-fitting on training domains.
Fig. 2 shows representative Ki67 recognition results of the above methods in three distinct domains for comparison. Blue, yellow, green, red and violet represent negative fibroblasts, negative lymphocytes, negative tumor cells, positive tumor cells and other cells. The remaining two categories do not show up in these cases. Areas with a significant difference among methods are highlighted with black dash ellipses. Compared with other approaches, the proposed method produces relatively better results in classification (case 1, 3 in row 1 and 3) and detection (case 2 in row 2). To be specific, in case 1, our method shows competitive classification results given the mixture of negative tumors, negative fibroblasts and other cells. The improvements of detection are also noticeable in case 2 even though the color of negative tumor cells is too shallow to be figured out. Case 3 further shows the robustness of our method on hollow cells.
The ablation experiment explores the pruning availability in different modules. Generally, pruning on all modules achieves the best performance. This can be caused by the remaining percentage of parameters within each layer. It was observed that the proportion of remaining parameters has a wide variance in different network depths although pruned with a tiny rate. The Decoder-only approach prunes the final layer to a large sparsity (70) and Encoder-only reserves 30 parameters for shallow layers. By contrast, pruning on all modules preserves 40 parameters for the final layer and 40 parameters for the shallow layers. All evaluations illustrate the importance of allocating parameters to prune in a balanced manner.
3.2 PACS Classificaion
|ERM [vapnik1999overview]||0.4||82.56 0.6||96.73 0.4||80.02 0.6||86.68|
|IRM [arjovsky2019invariant]||86.37 0.8||80.12 1.2||97.24 0.4||76.38 1.1||85.02|
|DRO [sagawa2019distributionally]||87.53 0.7||81.74 0.7||0.2||79.80 0.8||86.67|
|MLDG [li2018learning]||87.17 0.8||81.35 1.4||97.36 0.4||80.94 1.0||86.71|
|DANN [ganin2016domain]||86.52 1.2||80.25 1.1||97.58 0.2||77.26 1.2||85.40|
|CORAL [sun2016deep]||87.21 0.6||82.16 0.4||97.52 0.1||80.01 0.3||86.77|
|85.33 0.5||0.6||96.11 0.4||0.2|
PACS (photo, art painting, cartoon, sketches) classification benchmark was also introduced to evaluate our method with ResNet50 backbone on test domains validation set (oracle) [gulrajani2020search]. The results are shown in Table. 3. is the maximum value over 3 experiments, and within each experiment, the source domain and the invasion domain are arranged alternately in a ratio of 1:2. It is observed that our method boosts the performance on the sketches domain, outperforming the SOTA methods by 3.01. The proposed method also surpasses other approaches on the domain of cartoon by a margin of 2.27
. Note that skeletons in sketches and contours in cartoons are cross-domain structural information. Therefore, the above results could well demonstrate the effectiveness of our approach in learning domain-agnostic representations. The potential reason for our mediocre results on the photos might be the pruning method is applied on the model pre-trained on the photo domain (ImageNet). Consequently, the corresponding results are biased and have limited reference value. In addition, the factor that leads to our imperfect performance is probably the huge variance between the art painting domain and other domains. Even so, our method still achieves the best performance in the average accuracy.
In this paper, we propose a novel DG method based on subnetwork searching, which generalizes the model by discarding the domain-specific representations and reserving the invariant representations in a domain merging scenario. Our method attempts to solve practical limitations of DG including the insufficient number of domains and the class mismatching problem cross domains. Competitive results demonstrate our method’s superiority and robustness on multiclass nucleus recognition in Ki67 IHC images. In addition, the proposed method achieves good performance by effectively learning generic representations the on PACS classification benchmark.