Labeled (i.e., annotated) data is critical to the performance of most machine learning approaches. Automated image analysis tasks require vast amounts of annotated data, where the annotation process is laborious and expensive. In tasks involving natural scenes and simple categorization of common objects (e.g., comparison of cats and dogs in ILSVR challenge(Deng et al., 2009)), annotation efforts can be crowdsourced (Crowston, 2012). However, due to the complexity of medical imaging data as well as regulations limiting data sharing, crowdsourcing medical imaging tasks is more challenging (Ørting et al., 2019), and tasks such as breast cancer classification require expert annotations.
In medical image analysis, image annotation can refer to either patch level annotations, where an image patch of a certain size is given a single discrete class label (e.g., cancer or no cancer), or a segmentation mask, where each pixel in an image of an arbitrary size is given a label. While both need to be performed by a trained expert, generating fine boundaries to delineate regions of interest is much more time-consuming. Furthermore, in such tasks it is not uncommon for inter-expert agreement to be quite low, whereas it is less likely for two experts to disagree on the dominant class observed in a patch. In segmentation tasks, training data is assumed to be exhaustively annotated, i.e., each pixel on an image is assigned its correct class In histopathology, however, errors at the pixel level are inevitable. For example in annotating slides for the task of ductal carcinoma in situ (DCIS) annotation, pathologists may include background pixels between adjacent ducts in the DCIS region as this makes annotation much faster, or they may miss-classify regions of DCIS elsewhere on the WSI. This can lead to training with incorrectly labeled data and, since the training algorithm receives conflicting information, generalization performance will be degraded.
Therefore, it is desirable to combine these two sets of data to expedite the data annotation, and to have more reliable ground truth. With the currently available neural network architectures, however, utilizing data from both patch and pixel level annotations to improve performance is not straightforward. Image level patches cannot be used to train a fully connected segmentation network and, although it is possible to convert segmentation masks to labelled image patches in order to use a classification network, valuable pixel level boundary information will be ignored. At test time, classifying patches on a WSI and tiling them to form a segmentation mask will lead to overestimating the structure. As the training is done using rectangular patches, each region will tend to have rectangular shape, even when a sliding window is used to obtain finer boundaries.
We propose a method to train both a segmentation network (primary task) using a limited number of segmented WSIs, and a classification network (auxiliary task) using labelled image patches. Our method simultaneously trains for the primary and auxiliary tasks in order to force the network to learn features useful for each task. The aim is to utilize the rich information content in (easier to obtain) patch level images to generate a feature representation that is then used to “decode” this representation into a segmentation mask. In order to prevent overfitting to either task, both tasks share a pathway (network parameters or weights) which encodes the feature representation. The pathway may encode additional useful information into the representation with the help of additional data that is used to train the auxiliary task, which may have not been possible with just the limited primary task dataset.
1.1 Related work
Unsupervised, self-supervised, and weakly supervised learning methods that attempt to reduce the burden of annotation by incorporating unlabeled data have been proposed previously.
Unsupervised learning refers to categorizing unlabeled data without supervision to form clusters that correlate with the desired task objective (e.g., clustering chest X-rays into tuberculosis vs. healthy). Ahn et al. (2019)
proposed a hierarchical unsupervised feature learning method using a sparse convolutional kernel network to identify invariant characteristics on X-ray images by leveraging the sparsity inherent to medical image data. In histopathological image analysis, sparse autoencoders have been utilized for unsupervised nuclei detection(Xu et al., 2015; Hou et al., 2019), and for more complex tasks such as cell-level classification, nuclei segmentation, and cell counting, Generative Adversarial Networks (GANs) have also been employed (Hu et al., 2018).
In self-supervised learning, the aim is to use the raw input data to generate artificial and cost-free ground truth labels in order to train a network, which can then be used as initialization for training on a separate task with limited data. Noroozi and Favaro (2016); Gidaris et al. (2018) exploited the spatial ordering of a natural scene image to generate labels. This form of training was also adopted in medical image analysis, where Taleb et al. (2019) spatially reordered multi-organ MR images to train an auxiliary reordering self-supervision network, and used it to train a network for tasks such as brain tumor segmentation with limited data. Tajbakhsh et al. (2019)
utilized image reconstruction, rotations, and grayscale to RGB colorization schemes as supervision signal for chest computed tomography images.Chen et al. (2019) used context restoration of images from different modalities including 2D ultrasound images, abdominal organs in CT images, and brain tumours in multi-modal MR images.
We consider techniques where labeled data is insufficient in amount, or where labels are noisy or inaccurate, or where annotations do not directly correspond to the task at hand (e.g., using coarse, image-level annotations to train a semantic segmentation network) under weakly supervised learning. Li et al. (2019); Qu et al. (2019) trained models with sparse sets of annotations, consisting only single-pixel annotations to perform mitosis segmentation, whereas (Yang et al., 2018) utilized a coarse bounding box to perform gland segmentation in histology images. Similarly, (Bokhorst et al., 2019) utilized a U-Net (Ronneberger et al., 2015) and sparsely annotated various structures of interest in colorectal cancer patient WSIs to achieve semantic segmentation. In an effort to combine classification with segmentation, Mehta et al. (2018) trained a network where they tiled patch level classification outputs with the segmentation outputs to refine the segmentation output. Wong et al. (2018) trained a U-Net segmentation network, and used its encoder’s features to fine-tune a convolutional network on a nine-class cardiac classification task that showed high performance in a low data setting, compared to a network trained from scratch.
Current unsupervised methods are not capable of processing complex images, and generally cannot be applied to images larger than pixels, whereas in histopathological images dimensions are much larger. Similarly, most of the existing self-supervised techniques are not applicable to histopathological images, since structures in these images are elastic and may form infinitely many valid groupings (e.g., a fat cell can be next to, above, below, or be surrounded by stroma). In contrast, weakly supervised learning methods we have reviewed have been previously successful in various digital histopathology tasks including classification and segmentation, and are capable of working with larger images sizes. Therefore in this work, we focus on a method that is based on weakly supervised learning, as opposed to tackling segmentation with limited data with unsupervised or self-supervised techniques.
In this work, we propose a simple architecture to alleviate the burden on the annotator by combining data acquired from two different processes, either by classification labels on patches, or segmentation masks from either whole slide images or image patches. We consider our method as a form of weakly supervised learning where the training data includes different levels of information, namely on patch- and pixel-level. We believe our work is clinically relevant and can help expedite the process of data acquisition to bridge the gap between advancements in the deep learning which heavily rely on data with clinical applicability. Our contributions are listed as follows.
A simple modification to the existing Resnet architecture to perform segmentation and classification simultaneously, while leveraging easier-to-label classification patches to improve segmentation performance with small amounts of labeled segmentation data,
Two data preprocessing techniques that aim to alleviate the class imbalance problem (mainly due to the dominant background) observed in whole slide images in digital pathology: a simple background thresholding procedure in HSV color space, and a method to extract ground truth segmentation mask more efficiently for faster and more reliable training performance.
2.1 Data processing
We propose two data preprocessing techniques that can be used in segmentation tasks on WSIs. Our aim is to remove healthy tissue that is not relevant to tasks such as cancer segmentation by thresholding, and to alleviate class imbalance observed in WSIs, using a novel ground truth extraction technique. For further details, see B.
Our architecture is summarized in Fig. 1. We use the third layer output (out of a total of four layers) of the Resnet-18 architecture (He et al., 2016) for encoding input images, and use architectures given in tables 1 and 2 for decoding (i.e., generating segmentation mask) and classification, respectively. Note that the classification path is going through
the segmentation network. This was made to first obtain a segmentation output for the classification images, and use this prediction to infer the classification label. We reason that segmentation network should be partially capable of segmenting structures even in the low data settings. Then, a transformation network (the classification module) is used to aggregate the information in this prediction to infer a class. Using the image level training data, the network is trained to discard or modify incorrect class assignments in segmentation outputs while keeping and reinforcing correct assignments, by modifying the feature representation of input images.
An overview is given in Fig. 2. For pixel level images, the network is trained with input images and segmentation masks (
) with the standard backpropagation algorithm on the segmentation network (encoder+decoder) without passing through the classification layer. The data with only image level labels (
) are also passed through the segmentation network to obtain a segmentation mask output, which is then transformed by the classification layer to obtain the classification output vector, where is the number of classes. This vector is used for backpropagation with cross entropy loss as an error signal to update segmentation network weights to correct the segmentation mask for the given image. The middle part in Fig. 2 depicts the case where the network incorrectly assigns DCIS (green) to the large region. Given the ground truth label is benign, this assignment is invalid, and the errors are corrected during training with backpropagation.
2.3 Implementation details
We use the Adam optimizer with ,
, learning rate of 0.0001, batch size of 20, and weighted cross entropy loss function with weights proportional to the pixelwise class distributions among training images. Cross entropy loss is applied pixelwise for segmentation, and per item for classification. We use stain normalization(Macenko et al., 2009) based on a single reference image selected from the training set, and simultaneously optimize classification loss and segmentation loss by minimizing the quantity .
For our experiments, we use the BreAst Cancer Histology images (BACH) dataset from a grand challenge in breast cancer classification and segmentation (Aresta et al., 2019), where the dataset is composed of both microscopy and whole slide images. The challenge is split into two parts, A and B. For part A, the aim is to classify each microscopy image into four classes, normal tissue, benign tissue, ductal carcinoma in situ (DCIS), and invasive carcinoma, whereas in part B, the task is to predict the pixelwise labeling of WSI into same four classes, i.e., the segmentation of the WSI. The classes are mapped to labels 0, 1, 2, 3 for the classification task, and white, red, green, blue in the following figures, representing normal, benign, DCIS, and invasive classes in both tasks, respectively.
The dataset consists of 400 training and 100 test microscopy images, where each image has a single label, and 20 labeled WSIs with segmentation masks (split into 10 training and 10 testing images), in addition to 20 unlabeled WSIs with possible pathological sites. Microscopy images are provided in RGB .tiff format and with dimensions 20481536 pixels, and with a pixel scale of 0.42 0.42 . The WSIs are provided in .svs format, with a pixel scale of 0.467 /pixel, and with variable width ( [39980, 62952] pixels) and height ( [27972, 44889] pixels) per WSI. Both microscopic and the WSIs are annotated by two experts. The test data for the microscopic images were collected from a completely different set of patients, in an effort to prevent reliance on case or institution specific features that may correlate with the classes of the patches.
We report the metrics defined by the Equations 1-5, in addition to foreground accuracy (obtained from pixels with only non-normal ground truth labels), class-wise F1 scores for foreground classes, and the score function given in Eqn. 6, as suggested by (Aresta et al., 2019) to account for the dominant background observed in WSIs. Class imbalance can obscure the differences between methods for popular metrics such as accuracy or intersection over union (IoU). However, as these metrics are commonly used for medical imaging tasks involving WSIs, we include them in our results. We report two variants of F1 scores called the macro and micro F1. Both metrics are calculated class-wise, and the macro weighs each class-wise score equally whereas the micro considers the class imbalance, weighing the scores per ground truth ratios on the WSI. These metrics are equivalent when the F1 measure is calculated on a two-class problem, as in figures 2(g)-2(i).
Eqn. 6 ignores correctly predicted background (or normal) classes and penalizes misclassified background and foreground classes in a weighted manner. For instance, classifying normal regions (class label 0) as benign (1) incurs less penalty compared to classifying as invasive (3), and vice versa. Abbreviations , , , stand for true positive, true negative, false positive and false negative, , indicate the ground truth and prediction classes on (linearly indexed) pixel , respectively. Notations and indicate whether the value on pixel is a foreground class (i.e., 1 if the ground truth/prediction class is not background, and 0 otherwise). denotes the number of classes (4 in our experiments), is the number of pixels in the WSI, denote the cardinality of a set, and is a small constant to avoid division by zero.
3.3 Experimental setting
We use 3 WSIs from the part B (converted into segmentation patches) and the complete dataset from the part A (classification patches) of the BACH challenge for training, 1 WSI that includes all four classes as the validation set and evaluate our method on the remaining 6 WSIs. We conduct a total of 195 experiments, each run for 100 epochs. We useof segmentation patches for S (only segmentation), and of classification patches in the case of S+C (segmentation + classification) experiments, where . In addition, we use 100% classification patches and of segmentation patches in S+C* setting, to examine if addition of classification patches improve or degrade the performance. For , we only use classification patches, hence for the S setting, the network predicts random outputs. We run 5 iterations per percentage value, randomly picking the same of training data for S, S+C, and S+C*, each time to accurately reflect the performance for each setting. In each iteration, we use 1 WSI for validation interchangeably to be able to test on all non-training WSIs. For instance for iteration 1, we pick WSI#2 as the validation image to pick the best epoch (out of 100), and evaluate the performance on the remaining 6 WSIs. The training WSIs are fixed in all iterations.
4 Results and Discussion
are mean values obtained from 5 experiments, and the shaded regions represent the standard deviations. The horizontal axis indicates the, whereas the vertical axis is the relative performance with respect to the performance at , the case where we only use segmentation patches. We chose this presentation as the absolute values are not as significant as difference in performance as we use more segmentation patches.
We achieve a large improvement in performance when the percentage of segmentation level patches are low compared to the patch level images, and performances of the two methods converge as we increase the number of segmentation patches. This is desirable since we are able to achieve similar performance with data acquired much more cheaply. More specifically, in almost all metrics, our method (S+C or S+C*) achieves peak performance (w.r.t. the in S setting) around , whereas the S setting achieves around these percentages. In addition, if we keep the classification patches while increasing the amount of segmentation patches (S+C*), we still observe gains, indicating that the method can work either in low or high data settings. We also observe less variation in performance with our method, since the use of more data acts as a regularizer. This can also be visually observed in Fig. 7, where the predictions are more stable as we force a more general feature representation with image level data that prevents radical changes in decisions when the network is trained with more data. For instance, predictions of the S setting exhibit large variability in close spatial proximity, such as the blue to green class change between neighboring regions that visually appear similar on the whole slide image. In contrast, in S+C setting such variability is less apparent, as the trained network is able to extract semantic information more robustly to prevent inconsistent decisions for visually and contextually similar regions.
Finally, the proposed method is not capable of correcting errors if the incorrect annotations are of the same class as the image level label for the given patch, or if there are multiple incorrect annotations on a predicted patch. In our experiments, we observed that the probability of making this type of error decreases as the training set size increases. In addition, we observed that the data collection process is crucial to the success of the pipeline. The model is able to distinguish between the background and the foreground given only a few segmentation patches, by learning low-level features (e.g., edge or blob extractors) to outline the boundaries of the foreground structure, however, the correct class assignments of the segmented regions are more challenging compared to the background and foreground separation. Since the error signal from the classification layer does not specify which pixels should be corrected, image level data should be collected where the majority of the pixels on the patch are either background or the foreground class assigned to that patch.
In this paper, we presented a method that can expedite medical image annotation by labeling a few segmentation level patches, and the majority of the training data is composed of easier-to-label image level patches. With the currently available architectures, training with two types of patches is not possible, hence we hope that our work can help researchers in expediting medical imaging tasks, and can be used as a baseline for improving techniques that amalgamate different types of training data. We validated the efficacy of our method in settings where we have a large imbalance between segmentation and image level patches. Our method can be used to expedite tasks at the data acquisition stage, or it can be used for utilizing previously acquired data that only includes image level patches for segmentation tasks by drawing boundaries for a few samples from each class in the dataset such as BreakHis cancer classification task (Spanhol et al., 2015). Finally, we acknowledge the shortcomings of our method, namely failure to separate the feature representation of background and the foreground when training patches include very small regions of interest, and the inability to train for large patches (e.g., pixels) that is critical in breast histopathology, and aim to address these issues in future work.
Appendix A Architectures
We use the third layer output (out of a total of four layers) of the Resnet-18 architecture (pretrained on ILSVRC dataset) for encoding input images due to the small input size ( pixels) we use in our experiments, as opposed to the default input size for Resnet ( pixels). The classification and segmentation modules are given below, where we replace the fully connected linear layers with convolutions.
|Conv2d , no bias|
|ReLU activation layer|
|2d upsampling ()|
|Conv2d , stride 1, padding 0, , with bias|
Spatial (nearest) interpolation
|Adaptive 2d average pooling|
|Conv2d , stride 1, padding 0, , no bias|
|ReLU activation layer|
|Conv2d , stride 1, padding 0, , no bias|
Appendix B Data preprocessing
b.1 Background removal
Given the segmentation mask of the WSI, one can build training data by extracting square patches of images and their corresponding ground truth mask as inputs to the training algorithm. A common method used is to slide a window over the WSI to get the patches. This method however, is very prone to cause imbalance issues. Insufficient training data and class imbalance are common problems in medical imaging. The problem is exacerbated in histopathology, and specifically WSIs, where image dimensions are extremely large, yet the labeled regions of interest can occupy less than 1% of the image. Sliding a window over the whole image will likely generate many patches that only contain white background or stroma, and very few patches that contain non-background regions (e.g., invasive cancer). To alleviate this, we first generate a “foreground mask” by applying a binary threshold of the saturation channel on the HSV color space image of the WSI. Specifically, pixels below 10% of the maximum possible saturation value were considered as background. Then, the holes in the resulting mask are filled, as some of the ducts that are surrounded by stroma can be regions of interest (e.g., may contain ductal carcinoma in situ). Finally, opening operation is applied to remove the remaining regions that do not contain a consistently large foreground region (i.e., salt and pepper noise). The whole process is visualized in Fig. 4. This foreground mask is used to discard patches that contain less than 75% foreground, or white pixels shown in Fig. 4.
b.2 Ground truth extraction
In order to prevent a large class imbalance between foreground and background classes, we propose centered ground truth extraction. Unlike sliding a window with an arbitrary step size over the image, we use the center coordinates of each labeled connected component. Then each patch is extracted with endpoints , where are and coordinates of the center of the patch, respectively, and is the square side length of the patch. In our experiments, we use a patch size 128, and we perform experiments in magnification of the WSI. The differences between sliding window approach and ours can be viewed in Fig. 5, where we center on the green region (DCIS), whereas the sliding window is moved with a constant step size. Former is prone to extracting regions with vast background, and may only include small parts of regions of interest. As neural networks rely heavily on boundaries and context of the desired structure, it is hard for a network to learn from an incomplete picture. One may argue that this method will undersample the background regions, however even with this method, our dataset (number of total pixels in all patches) contains more than 90% background.
In case of the region of interest is larger than our predefined patch size, we split the region into equal areas with the K-Means algorithm using the foreground pairs of coordinates as our inputs, and iterate the above procedure on each center. The number of “clusters”, or the centers, are determined based on the rule , where is the ceiling function. The output of this process is visualized in Fig. 6.
Appendix C Example segmentation results
We present output segmentation masks for settings with , and compare S, S+C and S+C* in Fig. 7. We omit sample outputs for settings with even though our method remains comparable or better to S, since beyond a threshold, performance improvements remain marginal when the additional time spent for annotating a segmentation vs. classification patch is considered. Whereas the S setting only is able to achieve reasonable performance at , S+C consistently performs well under . S+C is also less prone to make mistakes, such as labeling two very close regions that look visually similar as different classes, which is an indication of noisy feature learning that suffers from overfitting.
Conflict of interest
We have no conflict of interest to declare.
This work was funded by Canadian Cancer Society (grant #705772) and NSERC RGPIN-2016-06283.
- Convolutional sparse kernel network for unsupervised medical image analysis. Medical Image Analysis 56, pp. 140 – 151. External Links: Cited by: §1.1.
- Bach: grand challenge on breast cancer histology images. Medical image analysis. Cited by: §3.1, §3.2.
- Learning from sparsely annotated data for semantic segmentation in histopathology images. In Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning, Vol. 102, pp. 84–91. Cited by: §1.1.
- Self-supervised learning for medical image analysis using image context restoration. Medical Image Analysis 58, pp. 101539. External Links: Cited by: §1.1.
- Amazon mechanical turk: a research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches, A. Bhattacherjee and B. Fitzgerald (Eds.), Berlin, Heidelberg, pp. 210–221. External Links: Cited by: §1.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §1.1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §2.2.
- Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. Pattern Recognition 86, pp. 188–200. Cited by: §1.1.
- Unsupervised learning for cell-level visual representation in histopathology images with generative adversarial networks. IEEE Journal of Biomedical and Health Informatics 23 (3), pp. 1316–1328. Cited by: §1.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. pp. 448–456. External Links: Cited by: Figure 1.
- Weakly supervised mitosis detection in breast histopathology images using concentric loss. Medical image analysis 53, pp. 165–178. Cited by: §1.1.
- A method for normalizing histology slides for quantitative analysis. In 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 1107–1110. Cited by: §2.3.
- Y-net: joint segmentation and classification for diagnosis of breast biopsy images: 21st international conference, granada, spain, september 16–20, 2018, proceedings, part ii. pp. 893–901. External Links: Cited by: §1.1.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §1.1.
- A survey of crowdsourcing in medical image analysis. ArXiv abs/1902.09159. Cited by: §1.
- Weakly supervised deep nuclei segmentation using points annotation in histopathology images. In International Conference on Medical Imaging with Deep Learning, pp. 390–400. Cited by: §1.1.
- U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: Cited by: §1.1.
- A dataset for breast cancer histopathological image classification. IEEE Transactions on Biomedical Engineering 63 (7), pp. 1455–1462. Cited by: §5.
- Surrogate supervision for medical image analysis: effective deep learning from limited quantities of labeled data. In ISBI 2019 - 2019 IEEE International Symposium on Biomedical Imaging, Proceedings - International Symposium on Biomedical Imaging, pp. 1251–1255 (English (US)). External Links: Cited by: §1.1.
- Multimodal self-supervised learning for medical image analysis. External Links: Cited by: §1.1.
- Building medical image classifiers with very limited data using segmentation networks. Medical Image Analysis 49, pp. 105 – 116. External Links: Cited by: §1.1.
- Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE Transactions on Medical Imaging 35 (1), pp. 119–130. Cited by: §1.1.
- Boxnet: deep learning based biomedical image segmentation using boxes only annotation. arXiv preprint arXiv:1806.00593. Cited by: §1.1.