Deep learning methods have achieved unprecedented high performance in segmenting histopathology images [1, 2]. Still, the generalizability of such methods is hindered by limited available annotated training data, because medical experts are commonly required for annotation and there are high variations of image characteristics due to scanner effects, different staining protocols, patients, and disease states. With limited annotation, deep learning model training often covers only a limited fraction of the histopathology data space. There could exist considerable discrepancy (in appearance) between the labeled and unlabeled sets. Thus, the trained segmentation models are at risk of over-fitting and do not generalize well to unseen data.
An array of methods based on semi-supervised learning (SSL) has been proposed to make the most of limited training data and improve the model generalizability. The assumption is that unlabeled images are commonly from the original data distribution and contain useful information. In practice, there is often a large amount of unlabeled data available which are free to use. Some powerful SSL methods used the feature distribution of unlabeled images to reduce the need for labeling. For example, images are projected to low-dimensional feature space and pseudo-labels are assigned to unlabeld images based on clustering features [3, 4]. In [5, 6], images were intentionally perturbed to explore the decision boundary for adversarial training. While such SSL methods are commonly used for classification tasks, applying them to segmentation tasks is not a straightforward process because it is hard to define and utilize the distribution/clusters of unlabeled images due to the high-dimensional feature space of images. Dominant SSL methods for segmentation include using auxiliary loss , consistency learning , and pseudo-labeling . For example, auxiliary loss was applied to encourage the model to produce output of plausible shapes ; an ensemble method  was proposed to train a meta-learner with generated pseudo-labels; some methods [8, 11, 12] aimed to make the models give consistent predictions on random perturbations (e.g., color jitting, rotation). For histopathology images, the aforementioned methods have a main drawback: Image characteristics different from the labeled samples cannot be effectively and efficiently utilized in the training process.
To deal with diverse characteristic variations of unlabeled data, an intuitive and effective way is to adapt style-transfer based data generation methods for SSL [13, 14, 15]. Specifically, these methods generated new pairs by transferring styles extracted from unlabeled images to labeled images. However, there are two unexplored issues in the previous work. First, neither image-based [16, 17] nor domain-based [18, 19] style transfer methods could concurrently provide a lightweight style representation and transfer styles between images in one (the same) domain, which is important for exploiting intrinsic data diversity and distributions of one dataset. Second, the associated augmentation policies of the known data generation methods were not carefully designed. For example, the methods [20, 21] filtered the generated images with higher feature similarity to the training images; in [22, 23], complex training procedures were leveraged to search for an optimal combination of very basic image augmentation operations (e.g., rotation, flip, color jitting). These policies did not consider characteristics of unlabeled data and the discrepancy between labeled and unlabeled data, which can provide helpful information for data generation.
In this paper, we propose a unified framework to exploit image characteristics and effectively employ it to guide data generation. We contemplate two main challenges: (1) image distribution is hard to observe and explicitly define; (2) an appropriate data generation policy needs to be designed for sampling from the augmented dataset to assist segmentation model training efficiently.
Suppose our dataset consists of labeled and unlabeled sets. To address the first challenge, we exploit the relation between segmentation and image characteristics. We develop an Image Generation Module to learn and disentangle image representations and define characteristics distributions. Specifically, we consider two complementary key image characteristics in histopathology images: style and content. Style variation is caused by technical issues, including variations of staining protocols and scanner effects. Although it may not affect human judgment a lot, it can lead to reduced deep learning segmentation model performance (e.g., see Fig. 1
). Content variation is directly related to specific segmentation tasks, which attributes to object shapes and distribution patterns related to disease types. Our module disentangles each histopathology image into style and content representations, whose distributions can be easily exploited by clustering. To make the features clustering-friendly, we embed style and content into low-dimensional spaces, for which an interpolation property is enforced to explore image similarity distances. We then generate new images by combining judiciously-selected pairs of style and content.
For the second challenge, we develop a new data generation policy to handle underrepresented and “hard-case” images (which usually impair model generalizability the most). To remedy the unbalanced data distributions in the labeled set, we propose a distribution matching strategy to match the statistics of the labeled set with those of the unlabeled set across clusters of images by adding generated images to the underrepresented clusters. Moreover, “hard-cases” are highly valuable for training networks but are often only a small fraction of the whole dataset and not sufficiently represented in the labeled data. For this, we propose a hard-case covering strategy to identify hard cases via output variations in terms of different style changes and oversample them by referring to the unlabeled data. To make the generated images meaningful to segmentation, we only transfer styles between images that are sampled from the same content cluster.
We conduct extensive experiments on two public histopathology datasets of nuclei  and glands , which show that our new SSL method achieves large segmentation improvement using common segmentation models through data generation. For both inductive and transductive settings, we attain state-of-the-art performance.
Fig. 2 gives an overview of our proposed SSL framework, which consists of two parts. First, we introduce our image generation module with an interpolation property enforced in latent space to exploit image characteristics distributions. Second, using the image generation module, we propose a new strategy to sample from the generated data to bridge the distribution discrepancy between the labeled and unlabeled data, and a strategy to identify and handle the hard cases for segmentation based on a devised uncertainty metric.
Ii-a Image Generation Module
MUNIT  is a state-of-the-art approach for cross-domain transformation, which explicitly disentangles image representation into content and style and provides a certain degree of explainability of the extracted latent image representation. It assumes that the source domain and target domain have distinguishable style spaces. But, in our problem, this is not the case (i.e., our source and target domains significantly overlap), and MUNIT cannot yield diversified images by transferring style from one image to another since both images are from the same dataset. Image-based methods can transfer styles between two arbitrary images; but, their representations of style and content are both in high-dimensional latent space (e.g., feature maps  or matrices [27, 16]), and thus it is hard to explore the image characteristics distributions. We will show how to utilize domain-based and image-based methods to (1) generate style-diversified images for model training and (2) make the image characteristics concisely represented so that their distributions can be effectively explored for segmentation.
We extract features with an Image Generation Module. To capture the local contents, we consider image characteristics distributions on image patches uniformly cropped. Given a set of image patches, , from our dataset, each patch
can be encoded into a style vectorby a style encoder and a content vector by a content encoder . All the ’s and ’s form a style space and a content space , respectively. A new image patch can be synthesized by a generator combining any content vector and style vector . The total loss consists of our style matching loss, GAN loss, and reconstruction loss , defined as:
where the ’s are hyper-parameters for controlling the weight of each term. We design the style matching loss for achieving two goals: (1) preventing the training collapse (which would cause all information being encoded into content and the generator ignoring the style); (2) making the point distribution in the low-dimensional latent space reflect the image style distribution. The core of the style matching loss is an interpolation relation, which associates the point-wise distance in the embedded space with the similarity between image patches.
To attain the first goal, when transferring a patch to the style of a patch , we encourage the generated patch to have the target style , by minimizing a style similarity metric  between and :
where is a Gram matrix calculated on the vectorized VGG features of layer , is the weight of layer , and is the number of filters in layer .
For the second goal, we encourage that two patches with a higher similarity attain a smaller distance in . We train the model by enforcing the interpolation property: For a linear interpolation in the latent space, , where , the style of the generated patch smoothly transfers to as the distance between and gets closer. This property can be learned by optimizing the following style matching loss:
where is uniformly selected from in training. By the interpolation property, those patches with similar styles tend to have closer style vectors, yet dissimilar patches tend to have style vectors far apart from one another. Hence, the distribution of style vectors in the style space is encouraged to reflect the patch style distribution in the given dataset. Only Eq. (3) is practically used in our training, as Eq. (2) is a special case of Eq. (3) with .
After obtaining the extracted content set and style set from all the patches, we conduct agglomerative clustering to attain content clusters of and style clusters of , where and are the numbers of such clusters, respectively. A patch space is defined by the Cartesian product of and : . Each corresponds to possible patches whose content vectors belong to content cluster and style vectors belong to style cluster . We explore the statistics of the patch space based on the information of each , whose numbers of labeled patches and unlabeled patches are denoted by and , respectively.
The same reconstruction loss for images and the latent space as in MUNIT  is applied to the generated images. The content and style should be consistent after decoding and encoding. The latent reconstruction loss is computed as:
where , , and
are the image, content, and style reconstruction losses, respectively.
We seek to attain appearance realism of the generated images by training the discriminator using samples generated with interpolated styles. The generated image distribution is imposed to be the same as the original image distribution. The realism loss function is:
where follows the original image distribution (), with , and is encoded from with .
Ii-B Generation Policy
Using extracted and , we can generate a set of patches that may potentially help segmentation models in a basic random generation setting (by uniform sampling). However, by doing so, (1) biomedically invalid patches may be generated without considering content similarity between the source images and target images, and (2) underrepresented and rare hard cases are overwhelmed by other “common” image characteristics. In this section, we first discuss a basic random generation strategy with content matching, and then further improve the effectiveness by proposing our distribution matching policy and hard case covering policy.
Random Generation with Content Matching. In common practice, one may generate data by transferring a labeled image to any other styles with the original segmentation label preserved. This generates an augmented set (denoted as ) from the original labeled set (). However, invalid images can be generated when a target style is not biomedically valid with respect to the original content (e.g., an image of irregular-shaped cancerous glands with benign tubular texture). To handle this, we propose a content matching strategy that exchanges styles only between patches from the same content cluster, as follows:
Still, , and we need to strike a balance between the effectiveness of the augmented set and avoiding excessively perturbing the original labeled (training) data.
is used with a probability(a probability for ).
Policy 1: Generation with Distribution Matching. To improve segmentation performance, we seek to remedy the observed statistics discrepancy between the labeled and unlabeled sets (i.e., and ). In preliminary study, we found that the numbers of patches in different ’s are highly unbalanced and show significant statistic differences between the labeled and unlabeled sets. Randomly sampling patches could bury rare image characteristics into other image characteristics, causing the segmentation models to be trained inadequately.
Thus, we propose to sample patches based on the statistics of the unlabeled data. Especially, we focus on the underrepresented clusters with image characteristics that appear frequently in the unlabeled set but rarely in the labeled set. Each is selected with a probability . Then we uniformly select samples from (or ) with probability (or ). In this way, the underrepresented clusters benefit the most as they get higher chances to be included in the augmented training data.
Policy 2: Generation with Hard Case Covering. To improve segmentation, we further identify and handle the “hard cases”. In our preliminary experiments, we found that the segmentation performances across different clusters can be different. For example, the performance is not good enough for irregular-shaped or highly stained abnormal tissues for all the segmentation models that we considered.
, we quantify the uncertainty of a segmentation model by its prediction variance when the input is transferred to different target styles. Forstyle clusters , we select one representative target style from each with the minimum sum of distances to all the other styles . For each , its uncertainty for a segmentation model is calculated as:
Selecting uncertain clusters more frequently will help the segmentation model reduce potential prediction errors. Thus, we upweight the probability of sampling uncertain clusters. Each is selected with a probability . This policy has two advantages comparing to : (1) our method does not require training separate versions of the segmentation model; (2) our uncertainty value can better assess the segmentation quality, as the experiments show that our coefficient is 15% higher.
Overall, these two policies are complementary to each other and can be combined with Mixed policy: sample patches with an average probability of distribution matching and hard case covering (i.e., the probability is ), which comprehensively covers the unlabeled data especially for the hard cases.
Iii-a Datasets and Implementation Details
We use two main histopathology image datasets in our experiments: GlaS  of glands and MoNuSeg  of nuclei. We uniformly crop patches of size , with a fixed step size 64 along the width and height, from all the images. The GlaS dataset has 85 images (1697 patches) for training and 80 images (1559 patches) for test. We use 8 style clusters and 5 content clusters to study the data distributions. The MoNuSeg dataset has 30 high-resolution images, with 16 images (1296 patches) of 4 organs for training and 14 images (1134 patches) of 7 organs for test (note that 3 test organs are not seen in training). We use 7 style clusters and 3 content clusters. In both the datasets, we treat either the training set or a subset of the training set as the labeled set and the test set or a subset of the training set (by ignoring the labels of this subset) as the unlabeled set in our experiments.
We use four segmentation models: DCN , MildNet , FullNet , and CIA-Net . Here, DCN and MildNet are common segmentation models. FullNet and CIA-Net attain state-of-the-art performance on the GlaS and MoNuSeg datasets, respectively. The default value of the probability is set as 0.15. The weights , , and in the loss function of Eq. (1) are set as 0.002, 1, and 10, respectively. Basic augmentation operations (e.g., flipping, rotation) are applied in all the experiments (denoted as ).
Our experiments consist of three parts. In qualitative results, we show that our image generation module can effectively generate realistic images and cluster patches based on the style and content similarity. In quantitative results, we evaluate the effectiveness of our new method on improving segmentation performance under both the inductive learning and transductive learning settings of SSL scenarios. In ablation study, we evaluate the model sensitivity to different generation algorithms and policies.
Qualitative Results. Fig. 3 gives examples of generated images for the GlaS dataset. The patches in the same row are from the same content reference image patch (i.e., using the same given by the leftmost patch). The patches in the same column are generated with the same style. The results show our method can generate diversified realistic-looking images.
Quantitative Results. In the inductive learning setting, we randomly choose 50% of the patches from the original training set as labeled data , with the rest of the training set ignoring their annotation as the unlabeled data , for extracting style information to help improve segmentation performance. Tables I and II show the results. First, one can observe that with only 50% labeled data, our SSL method can attain performance better than the results with full annotation for both the networks used. Second, comparing to the other known SSL methods (RA  and CCT ), our method yields the best segmentation performance with 50% labeled data used. Third, our method can be applied to any segmentation networks. In comparison, the SSL algorithm in CCT  was designed upon a specific auto-encoder segmentation model structure.
Transductive learning means in the training process, the test images are shown as unlabeled images to the model (i.e., is the original test set). In biomedical imaging, transductive learning has found wide applications. For example, in high-throughput experiments, a large amount of images (with potentially different styles) needs to be accurately segmented for further analysis. Transductive learning can be very useful in such scenarios. Tables II shows the segmentation results. We evaluate the segmentation models (DCN , MildNet , and CIA-Net ) with on our proposed random generation and mixed policy. First, our method can effectively boost the performance of all the models, yielding better results than the state-of-the-art performance. Second, our method works well especially for the rare hard cases (e.g., unseen organs in MoNuSeg).
|(Seen Organ)||(Unseen Organ)||(Seen Organ)||(Unseen Organ)|
|DCN Mix. Policy||0.594||0.579||0.829||0.806|
|MildNet Mix. Policy||0.601||0.594||0.841||0.833|
|CIA-Net Mix. Policy||0.632||0.650||0.857||0.851|
|CIA-Net Mix. Policy (50%)||0.625||0.615||0.850||0.831|
|CIA-Net Mix. Policy (30%)||0.507||0.441||0.767||0.762|
Ablation Study. Tables III shows the DCN model’s performance on the GlaS dataset. The benefit to segmentation performance under the transductive setting of our Image Generation Module is compared to other known style transfer methods, including image-based (e.g., Gaytz et al.  and domain-based (e.g., MUNIT ) methods. For fair comparison, we use our random generation policy. One can see that our framework can achieve the best performance.
The contribution of each component in our generation policy is evaluated, including content matching (CM), distribution matching (DM), and hard case covering (HC). The performance is affected when only one policy is used or CM is not applied. The three components of our policy can help improve segmentation performance in a complementary manner.
|Gaytz et al. ||0.917||0.829||0.904||0.827|
|Ours (Random Policy)||0.924||0.824||0.918||0.839|
|Ours (CM + DM)||0.925||0.841||0.915||0.833|
|Ours (CM + HC)||0.922||0.832||0.918||0.836|
|Ours (DM + HC)||0.925||0.843||0.913||0.841|
|Ours (Mix. Policy)||0.926||0.850||0.915||0.848|
In this paper, we proposed a new unlabeled data guided semi-supervised learning framework for histopathology image segmentation. We designed (1) a style matching loss in our image generation module for image-based style transfer and exploring data distributions with concise style representations, and (2) new policies for guiding our image generation procedure. The effectiveness of our method was demonstrated by comprehensive experiments on two datasets.
Acknowledgement. This research was supported in part by NSF Grant CCF-1617735.
L. Yang, Y. Zhang, J. Chen, S. Zhang, and D. Z. Chen, “Suggestive annotation: A deep active learning framework for biomedical image segmentation,” inMICCAI. Springer, 2017, pp. 399–407.
-  S. Graham, H. Chen, J. Gamper, Q. Dou, P.-A. Heng, D. Snead, Y. W. Tsang, and N. Rajpoot, “MILD-Net: Minimal information loss dilated network for gland instance segmentation in colon histology images,” Medical Image Analysis, vol. 52, pp. 199–211, 2019.
-  A. Iscen, G. Tolias, Y. Avrithis, and O. Chum, “Label propagation for deep semi-supervised learning,” in
-  W. Shi, Y. Gong, C. Ding, Z. MaXiaoyu Tao, and N. Zheng, “Transductive semi-supervised deep learning using min-max features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 299–315.
-  W. Li, Z. Wang, Y. Yue, J. Li, W. Speier, M. Zhou, and C. Arnold, “Semi-supervised learning using adversarial training with good and bad samples,” Machine Vision and Applications, vol. 31, no. 6, pp. 1–11, 2020.
-  W. He, B. Li, and D. Song, “Decision boundary analysis of adversarial examples,” in ICLR, 2018.
-  Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang, “Weakly-supervised semantic segmentation network with deep seeded region growing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7014–7023.
-  G. French, T. Aila, S. Laine, M. Mackiewicz, and G. Finlayson, “Semi-supervised semantic segmentation needs strong, high-dimensional perturbations,” arXiv preprint arXiv:1906.01916, 2019.
-  Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang, “Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7268–7277.
H. Zheng, Y. Zhang, L. Yang, P. Liang, Z. Zhao, C. Wang, and D. Z. Chen, “A
new ensemble learning framework for 3D biomedical image segmentation,” in
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5909–5916.
-  Y. Ouali, C. Hudelot, and M. Tami, “Semi-supervised semantic segmentation with cross-consistency training,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
-  D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “MixMatch: A holistic approach to semi-supervised learning,” in NeurIPS, 2019, pp. 5049–5059.
-  A. BenTaieb and G. Hamarneh, “Adversarial stain transfer for histopathology image analysis,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 792–802, 2017.
-  M. T. Shaban, C. Baur, N. Navab, and S. Albarqouni, “StainGan: Stain style transfer for digital histological images,” arXiv preprint arXiv:1804.01601, 2018.
-  C. Ma, Z. Ji, and M. Gao, “Neural style transfer improves 3D cardiovascular MR image segmentation on inconsistent data,” in MICCAI. Springer, 2019, pp. 128–136.
-  Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic image stylization,” in ECCV, 2018, pp. 453–468.
L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” inCVPR, 2016, pp. 2414–2423.
X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” inECCV, 2018, pp. 172–189.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017.
-  M. Tamura and T. Murakami, “Augmented hard example mining for generalizable person re-identification,” arXiv preprint arXiv:1910.05280, 2019.
-  Y. Xue, J. Ye, R. Long, S. Antani, Z. Xue, and X. Huang, “Selective synthetic augmentation with quality assurance,” arXiv preprint arXiv:1912.03837, 2019.
-  E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “AutoAugment: Learning augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
-  D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen, “Population based augmentation: Efficient learning of augmentation policy schedules,” arXiv preprint arXiv:1905.05393, 2019.
H. Chen, X. J. Qi, J. Z. Cheng, and P. A. Heng, “Deep contextual networks for neuronal structure segmentation,” inAAAI, 2016.
-  N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, and A. Sethi, “A dataset and a technique for generalized nuclear segmentation for computational pathology,” IEEE Transactions on Medical Imaging, vol. 36, no. 7, pp. 1550–1560, 2017.
-  K. Sirinukunwattana, J. P. Pluim, H. Chen, X. Qi, P.-A. Heng, Y. B. Guo, L. Y. Wang, B. J. Matuszewski, E. Bruni, U. Sanchez et al., “Gland segmentation in colon histology images: The GlaS challenge contest,” Medical Image Analysis, vol. 35, pp. 489–502, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” in NIPS, 2017, pp. 386–396.
-  R. W. Johnson, “An introduction to the bootstrap,” Teaching Statistics, vol. 23, no. 2, pp. 49–54, 2001.
H. Qu, Z. Yan, G. M. Riedlinger, S. De, and D. N. Metaxas, “Improving nuclei/gland instance segmentation in histopathology images by full resolution neural network and spatial constrained loss,” inMedical Image Computing and Computer Assisted Intervention – MICCAI 2019, 2019, pp. 378–386.
-  Y. Zhou, O. F. Onder, Q. Dou, E. Tsougenis, H. Chen, and P.-A. Heng, “CIA-Net: Robust nuclei instance segmentation with contour-aware information aggregation,” in International Conference on Information Processing in Medical Imaging. Springer, 2019, pp. 682–693.
-  H. Zheng, L. Yang, J. Chen, J. Han, Y. Zhang, P. Liang, Z. Zhao, C. Wang, and D. Z. Chen, “Biomedical image segmentation via representative annotation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 5901–5908.