Coupling weak and strong supervision for classification of prostate cancer histopathology images

11/16/2018 ∙ by Eirini Arvaniti, et al. ∙ ETH Zurich 0

Automated grading of prostate cancer histopathology images is a challenging task, with one key challenge being the scarcity of annotations down to the level of regions of interest (strong labels), as typically the prostate cancer Gleason score is known only for entire tissue slides (weak labels). In this study, we focus on automated Gleason score assignment of prostate cancer whole-slide images on the basis of a large weakly-labeled dataset and a smaller strongly-labeled one. We efficiently leverage information from both label sources by jointly training a classifier on the two datasets and by introducing a gradient update scheme that assigns different relative importances to each training example, as a means of self-controlling the weak supervision signal. Our approach achieves superior performance when compared with standard Gleason scoring methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Therapeutic decisions for prostate cancer, the second most common cancer in men Cancer_Genome_Atlas_Research_Network2015-aw, are to a large extent determined by the Gleason score Gleason1992-ji; Epstein2016-cj. The Gleason score, assigned by pathologists on the basis of microscopic examination of patient tissue slides, is computed as the sum of the primary and secondary Gleason patterns observed in the tissue. Gleason patterns are local prostate gland formations characteristic of different cancer grades. They are assigned a numeric value in the range indicating the grade, with 5 being the highest grade. The Gleason score is the sum of the two most prevalent Gleason patterns, and thus takes a numeric value in the range . Gleason score assignment is a time-consuming and challenging task even for domain experts, with high inter-pathologist variability rates Gleason1992-ji; Salmo2015-jj; Arvaniti2018-fw due to the need to examine large and heterogeneous areas of tissue. Thus, an automated decision-support solution would be highly desirable.

One of the main obstacles in designing such a solution is the scarcity of high-quality pixel- or ROI-level Gleason pattern annotations, as typically only the overall Gleason score is reported for each patient, e.g. Gleason 6=3+3, Gleason 7=4+3 etc. (a.k.a. weak labels

). Recent work on whole-slide image (WSI) Gleason score classification has focused on standard supervised learning 

Kallen2016-nv; Del_Toro2017-gk and unsupervised domain adaptation Ren2018-rg, always on the basis of Gleason score weak labels available from either The Cancer Genome Atlas (TCGA) Cancer_Genome_Atlas_Research_Network2015-aw or in-house private datasets. Recently, Arvaniti2018-fw published a Tissue Microarray (TMA) dataset with ROI-level Gleason pattern annotations, which is a complementary resource providing a smaller but strongly-labeled dataset. In this study, we focus on classification of TCGA whole-slide images into low () and high () Gleason score classes on the basis of the TCGA and TMA datasets, and propose a new approach that efficently leverages information from the two sources, combining weak and strong labels. Our approach is similar in nature to Dehghani2017-pq, where a confidence network was used to provide weights for controlling the weak supervision signal. Here, we do not have access to a confidence score-generating mechanism and, as an alternative, propose to obtain per-example confidence weights from the target network itself.

2 Methods

Problem formulation. Let denote a strongly-labeled dataset with examples and a weakly-labeled dataset with examples . In our case, correspond to TMA image patches with local Gleason annotations and correspond to WSI image patches that adopt the Gleason label of the entire WSI. The task is to build a classifier on the categories in , using the data in . The challenges associated with this task are that (a) the images in and do not follow the same distribution and (b) the labels in and describe the corresponding images at a different level (locally in vs globally in ).

Addressing the covariate shift. A classifier is likely to benefit from additional training data when this new data follows the distribution of the domain of interest. However, the TMA and TCGA datasets were generated by different institutions, which implies possible differences in tissue preparation, staining and digitization. As a first step, we investigated which data transformations/training strategies are necessary for obtaining good predictions on the TCGA dataset (target domain), if we exclusively use the TMA labels (source domain) for training. We considered the following approaches and applied them incrementally (e.g. stain transfer also performs color augmentation etc.):

  • [leftmargin=0.5cm, itemsep=0.1pt, topsep=0pt]

  • only w/o color augmentation. Trains a classifier using exclusively data from .

  • only w/ color augmentation. Performs random image color perturbations prior to training.

  • only w/ stain transfer. For each source domain example, randomly selects an example from the target domain and transfers its stain colors Vahadane2016-fs to the source domain example.

  • Symmetric domain adaptation. Obtains a domain-invariant classifier by jointly training on the source and target domain. We considered methods that encourage domain-invariant features via MMD minimization  Tzeng2014-bm, feature covariance alignment (CORAL) Sun2016-jy and domain adversarial training Ganin2016-gn.

Combining weak and strong supervision. As a second step, and assuming we have already selected a strategy for reducing the covariate shift, we investigated how to best integrate the two available data sources. The following approaches were considered:

  • [leftmargin=0.5cm, itemsep=0.1pt, topsep=0pt]

  • only, . Supervised learning using exclusively weakly-labeled data (only), or both weakly- and strongly-labeled data ().

  • MIL-based Weak Supervision (MIL-WS). In the spirit of recent multiple instance learning (MIL)-based approaches in computational pathology Hou2016-ws; Campanella2018-dt, back-propagates only the top most confident predictions within each weakly-labeled mini-batch. We set to of the mini-batch size.

  • Self-Weighted Weak Supervision (SW-WS). Our proposed approach is described in Algorithm 1. The classifier can be trained using exclusively weak labels (only) or both weak and strong labels (

    ). In either case, the weak supervision signal is weighted by self-computed confidence scores, corresponding to the predicted probability for the correct (weak) label. Therefore, image patches that are not characterized by the overall Gleason score label assigned to their respective WSI may contribute less to the gradient updates.

1:Input: model with parameters , strongly-labeled dataset , weakly-labeled dataset , learning rate , number of classes .
2:for each training iteration  do
3:     Sample data batches
4:     Compute model predictions
5:     Back-propagate strong labels
6:     Compute model predictions
7:     Compute per-example confidence scores
8:     Back-propagate weak labels
Algorithm 1 Training algorithm combining weak and strong labels

3 Results

Data preprocessing. Both datasets (Table 1) comprised formalin-fixed paraffin-embedded (FFPE) tissue samples, stained with H&E and digitized at / resolution. We extracted image patches of size at resolution and downsized them to . For the TMA data, we used all patches from the annotated ROIs. For the WSIs, we used the Blue Ratio transform Del_Toro2017-gk to prioritize patches with high concentration of cell nuclei and extracted the top 2000 patches per image.

dataset patches cases Gleason low/high 6 7=3+4 7=4+3 8 9-10
TCGA 300’000 447 261/186 44 125 92 65 121
TMA 25’000 886 524/362 403 121 226 136
Table 1: Summary statistics of the TCGA and TMA datasets.

Model training & evaluation. In all experiments, we used CNNs with ResNet18 He2016-zi architecture, categorical cross-entropy loss, Adam Kingma2014-ff optimization with base learning rate and batch sizes 32/128 for the TMA/TCGA data, respectively. Model performance was evaluated via 5-fold cross validation. Within each CV fold, of the training TCGA data was held out and used for early stopping. For testing, a final score was derived for each WSI as the ratio of predicted high-grade patches over all patches. We report ROC AUC and accuracy for the binary classification task, as well as the Kendall’s correlation coefficient between the WSI rankings produced by (a) the predicted scores and (b) the Gleason score groups (6, 7=3+4, 7=4+3, 8, 9-10) used in clinical practice Epstein2016-cj.

In the covariate shift reduction benchmark (Table 2), we observed that the color-jittering and stain transfer data augmentations improved model predictions on the target domain, whereas additional domain adaptation constraints did not have a big impact. Thus, we adopted the stain transfer and color-jittering augmentations in our subsequent data integration experiments.

AUC (stdev) accuracy (stdev) Kendall’s (stdev)
w/o color augm. 0.738 ( 0.062) 0.682 ( 0.049) 0.370 ( 0.078)
w/ color augm. 0.771 ( 0.042) 0.721 ( 0.032) 0.408 ( 0.041)
stain transfer 0.799 ( 0.034) 0.734 ( 0.045) 0.445 ( 0.049)
MMD 0.786 ( 0.037) 0.723 ( 0.043) 0.434 ( 0.046)
CORAL 0.797 ( 0.030) 0.741 ( 0.033) 0.438 ( 0.030)
adversarial 0.802 ( 0.032) 0.738 ( 0.023) 0.447 ( 0.044)
Table 2: Results of the covariate shift reduction benchmark on TCGA WSIs (5-fold CV).

In the data integration experiments (Table 3), we observed that the classifiers utilizing TMA labels performed better than the ones trained exclusively on TCGA weak labels. Furthermore, adding self-weighted weak supervision to the jointly-trained classifier resulted in the best overall performance.

AUC (stdev) accuracy (stdev) Kendall’s (stdev)
only 0.814 ( 0.057) 0.754 ( 0.042) 0.445 ( 0.071)
only (MIL-WS) 0.815 ( 0.030) 0.756 ( 0.032) 0.446 ( 0.038)
only (SW-WS) 0.778 ( 0.055) 0.734 ( 0.047) 0.385 ( 0.097)
0.845 ( 0.065) 0.799 ( 0.051) 0.510 ( 0.072)
(MIL-WS) 0.839 ( 0.020) 0.792 ( 0.011) 0.504 ( 0.027)
(SW-WS) 0.882 ( 0.024) 0.848 ( 0.010) 0.540 ( 0.032)
Table 3: Results of the data integration benchmark on TCGA WSIs (5-fold CV).

4 Discussion

We have presented an approach that efficiently leverages weak and strong supervision signal for histopathology image classification. We believe that other medical image analysis tasks are likely to benefit from similar approaches, as medical expert annotations at the ROI-level are challenging to obtain, whereas patient-level information is often readily available.