UniToPatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading

01/25/2021 ∙ by Carlo Alberto Barbano, et al. ∙ 0

Histopathological characterization of colorectal polyps allows to tailor patients' management and follow up with the ultimate aim of avoiding or promptly detecting an invasive carcinoma. Colorectal polyps characterization relies on the histological analysis of tissue samples to determine the polyps malignancy and dysplasia grade. Deep neural networks achieve outstanding accuracy in medical patterns recognition, however they require large sets of annotated training images. We introduce UniToPatho, an annotated dataset of 9536 hematoxylin and eosin (H E) stained patches extracted from 292 whole-slide images, meant for training deep neural networks for colorectal polyps classification and adenomas grading. We present our dataset and provide insights on how to tackle the problem of automatic colorectal polyps characterization.



There are no comments yet.


page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The demand for gastrointestinal histopathology is on the rise [gonzalez2020updates], fostered by the widespreading of cancer screening programs. Gastrointestinal histopathologists inspect tissue samples collected during colonoscopies, looking for hints that can predict the insurgence of invasive carcinoma [bevan2018colorectal]. Colorectal polyps are pre-malignant lesions found in the intestinal mucosa that pathologysts analyze to i) ascertain the polyp type (hyperplastic, adenoma) and ii) assess the dysplasia grade in case of adenomas. Examination of colorectal polyps represents a large share of histopathologists workload, thus methods for automating these tasks are highly sought. Despite such clinical relevance, the concordance rate even among expert pathologists, in the diagnostic assessment of colorectal polyps, is far from optimal [denis2009diagnostic, mollasharifi2020interobserver]. Although the distinction between non-adenomatous and adenomatous tissue is usually reliable, the inter-observer agreement between different histological types and dysplasia grades is sub-optimal. For instance, the concordance in assessing a tubulo-villous polyp or low grade dysplasia ranged around 70% [denis2009diagnostic].

Deep learning-based methods have shown promising results towards automating the pathologists’ work [Janowczyk16]. Korbar et al. [korbar2017deep] present a patch-based framework, developed using a ResNet architecture [he2016deep]

, to classify different types of colorectal polyps from whole-slide images. Their work provides empirical suggestion that residual architectures are better suited at this task. Wei 

et al. [wei2020evaluation] propose an analysis model for annotated tissue samples and perform a study on the generalization of neural models with external medical institutions. Their work describes a hierarchical evaluation mechanism to extend the classification of tissue fragments to the entire slide. Song et al. [song2020automatic] propose a patch-based fully-convolutional approach for the classification and grading of adenomas, with a strong focus on model interpretability. They also highlight how different patch sizes should be used for adenomas classification and grading.
However, the scarcity of datasets large enough and suitably labeled represents a major a hurdle for training deep-learning based algorithms to predict polyp type and adenoma dysplasia grade.

This work provides the following contributions towards automatic colorectal polyps characterization, in the framework of the DeepHealth [DeepHealth] project.
First, we make available UniToPatho, a high-resolution annotated dataset of of Hematoxylin and Eosin (H&E)-stained colorectal images. UniToPatho enables training deep neural networks to classify different colorectal polyps types and adenomas grading. We make available our annotated dataset as a collection of high-resolution patches extracted at different scales.
Second, we show that the direct application of a deep neural network fails to classify both the tissue type and adenoma dysplasia grade.
Lastly, we propose a multi-resolution deep learning approach solving the previous issues, that achieves significant accuracy in the characterization of colorectal polyps.

2 The UniToPatho dataset

Slides 41 21 26 146 20 38 292
= 7000 59 74 98 411 93 132 867
= 800 545 950 454 3618 916 2186 8699
Total 604 1024 552 4029 1009 2318 9536
Table 1: UniToPatho class distribution for whole image slides (top) and the two patch scales made available (bottom).
(a) NORM
(b) TA.LG
(c) TA.HG
(d) HP
(e) TVA.LG
(f) TVA.HG
Figure 1: Example of 800800m patches for the six UniToPatho colorectal polyps classes.

UniToPatho is a dataset of annotated high-resolution H&E-stained images, comprising different histological samples of colorectal polyps, collected from patients undergoing cancer screening. The dataset is a collection of the most relevant patch images extracted from whole-slide images (simply slides in the following), in accordance with UniTo pathologists’ evaluation. The slides are acquired through a Hamamatsu Nanozoomer S210 scanner at magnification (/px), as exemplified in Fig. 1. Each slide belongs to a different patient and is annotated by expert UniTo pathologists, according to six classes as follows:

NORM - Normal tissue
HP - Hyperplastic Polyp
TA.HG - Tubular Adenoma, High-Grade dysplasia
TA.LG - Tubular Adenoma, Low-Grade dysplasia
TVA.HG - Tubulo-Villous Adenoma, High-Grade dysplasia
TVA.LG - Tubulo-Villous Adenoma, Low-Grade dysplasia

Hyperplastic polyps usually exhibit no malignant potential [Tseung2005RobbinsAC], while adenomas are more likely to progress into invasive carcinomas. Tubular and tubulo-villous are common colorectal adenomas, with villous adenomas generally presenting higher malignant potential given the larger surface [Tseung2005RobbinsAC]. Adenomas are associated with a grade of dysplasia, which measures the abnormality in cellular growth and differentiation [NCI]. Higher grade dysplasia indicates higher malignant potential.

We split the slides into a train set and a test set with a % to % ratio, resulting in slides used for training and 88 for testing. Therefore, each slide is represented either in the train set or in the test set, but not in both. Following the approach in [song2020automatic, kather2018], we release square patches cropped from the slides. From each slide, we crop multiple non-overlapping square patches at different scales. We denote the side of the underlying physical area with , measured in m, which we vary from 100 to 8000. The number of patches obtained for each slide hence depends on multiple factors including , the slide size and the polyp type. As common in similar datasets, the different classes are highly unbalanced in the dataset. We make publicly available a total of patches, of which extracted at ( pixels patches) and at ( pixels patches). These are the sizes that will be used in Sec. 4. Tab. 1 provides a summary of the class distribution for the whole slides and the released patches.

3 Preliminary Analysis

We perform our preliminary experiments on the UniToPatho dataset with a baseline strategy owing to state-of-the-art methods [korbar2017deep, wei2020evaluation]. The method which will be described in the next section builds upon the lessons learned during these experiments.

First, we randomly color-jitter the patches described in the previous section to augment the dataset, as proposed in [wei2020evaluation, korbar2017deep]. Next, as in [korbar2017deep, wei2020evaluation]

, we train a deep convolutional neural network for image classification on the patches belonging to the train slides. Namely, we train an ImageNet-pretrained ResNet-18 


with SGD for 50 epochs, with an initial learning rate of 0.01 decayed by a factor of 0.1 every 20 epochs. The output layer is reshaped to match the UniToPatho 6-class classification problem. Each patch is downsampled to the standard 224

224 pixels ImageNet size prior being fed into the ResNet. We repeat the whole training process for different scales, choosing

in the [100, 8000] range. For each , the trained network is eventually tested on the patches extracted at the same scale from the test slides.

Patch scale [m]
Type 100 800 1500 4000 7000 8000
BA (6-class) 0.40 0.45 0.46 0.41 0.37 0.38
NORM 0.70 0.66 0.72 0.76 0.78 0.71
HP 0.81 0.92 0.85 0.70 0.60 0.69
TA (HG+LG) 0.65 0.66 0.65 0.71 0.76 0.70
TVA (HG+LG) 0.64 0.67 0.68 0.74 0.84 0.76
Table 2: Preliminary experiments: overall BA for all of the six classes (first row) and BA for each polyp type, plus normal tissue.

We analyze the the results of our preliminary experiments first in terms of polyp type classification accuracy, next in terms of adenoma grade prediction for adenoma samples only. In the following, the classification accuracy will be defined in terms of Balanced Accuracy (BA) to cope with the unbalanced samples in the dataset. In Tab. 2 (first row) we show the BA achieved when attempting to discriminate all 6 classes with the baseline approach as function of the patch scale : it can be noted that even the best accuracy achieved at =1500 is quite low. Nonetheless, we conjecture that different polyp types can be better recognized at different scales: as a consequence, in Tab. 2 we also show the BA concerning the classification of each single polyp type HP, TA, TVA plus NORM. The TA and TVA classes encompass the respective low grade (LG) and high grade (HG) subclasses. Indeed, breaking down the accuracy on a per-class basis reveals that different types of polyps achieve top-classification accuracy at different scales.

Hyperplastic Polyps (HP) are best classified at a finer 800m scale: we hypothesize these types of benign polyps are best discriminated by looking at smaller-scale details such as gland edges [taherian2020tubular]. Conversely, Tubular Adenomas (TA) and Tubulo-Villous Adenomas (TVA) are best classified at a coarser 7000m scale: we hypothesize this type of polyps is best discriminated by looking at large-scale macro structures such as entire glands shapes [mehtat2020moleculary].

Coming to the problem of predicting the grade (LG or HG) of TA and TVA adenomas, we investigate whether that could be best predicted at some particular scale . Our experiments proved however inconclusive, i.e. the adenoma grade classification accuracy does not appear being a function of the patch scale. In fact, visual inspection of the 224224px downsampled patches reveals that the downsampling trashes discriminative details in the cells nuclei upon which pathologists are known to rely.

Concluding, our preliminary experiments show that i) different polyp types are best classified at different scales (in our case for HP and for TA and TVA) and ii) adenoma grade prediction may be jeopardized if the details in cells nuclei are lost. These findings are exploited to devise our proposed method detailed in the next section.

4 Proposed Method

Figure 2: The proposed multi-resolution ensemble of cascaded classifiers. The Hyperplastic and Adenoma classifiers (yellow) are trained with inputs downsampled to 224224 pixels, while the Dysplasia classifier (green) uses full resolution images.

This section details our proposed approach towards classification of UniToPatho images: a multi-resolution ensemble of cascaded classifiers. The method relies on three cascaded ResNet-18 classifiers, having the output layer specifically adapted for each classification task. The classifiers are trained on patches extracted at either or , following the procedure described in the previous section.

The overall process of inference is depicted in Fig. 2, with the input being a single 70007000m patch which is used by the three ResNet-18 classifiers mentioned above. For classifiers working at , we crop the input image into smaller 800800m sub-patches which are used to generate a prediction for the entire image. Also, all of the patches are downsampled to 224224 pixels, unless stated otherwise.

4.1 HP polyps detection

First, HP polyps are discriminated from adenomas and normal images via a binary classifier that takes as input sub-patches of size 800800m extracted from the larger input image. In fact, Tab. 2 shows that HP polyps are identified with top accuracy (0.92 BA) at scale

. To infer the probability of predicting HP for entire image, we compute the average predicted probability of the sub-patches of being HP. In the case the patch content is not classified as an HP polyp, the second classifier in the cascade is invoked.

4.2 Adenoma detection

Second, TA adenomas are discriminated from TVA adenomas and from normal images via a ternary classifier taking as input the entire 70007000m patch. In fact, Tab. 2 shows that TA and TVA adenomas are identified with top accuracy at this scale. In the case the patch content is classified as a TA or TVA polyp, the third and last classifier is invoked to infer its grade, otherwise the tissue will be labeled as NORM.

4.3 Dysplasia grading

Finally, a binary classifier is used to determine the dysplasia grade for TA and TVA adenomas. Sec. 3 suggested that downsampling the patches to 224224px might be detrimental towards inferring the dysplasia grade due to the loss of important features such as cells nuclei, as also observed by [song2020automatic]. Thus, only for this specific classification task, we skip downsampling to prevent the loss of fine grained details. To account for the increased size of the feature space, we add an adaptive average pooling layer [he2015spatial, van2019evolutionary] before the fully connected layer of the ResNet-18 network. We repeat the experiment of Sec. 3 sweeping the patch scale this time without downsampling and we find that adenoma grade classification peaks (0.81 BA) for . As a consequence, we apply the grade classifier on sub-patches extracted from the input image at the scale . Finally, we infer the dysplasia grade for the entire input image according to a threshold : if the ratio of high grade predictions on the sub-patches is higher than , the image is labeled as HG, otherwise as LG. This choice is motivated by pathologists grading guidelines, where HG can be decided on the base of small portions of tissue; clearly, setting smaller values for can mimic this behaviour [Tseung2005RobbinsAC].

(a)   Baseline
(b)   Multi-resolution Ensemble
Figure 3: Confusion matrices for the (a) baseline at (b) multi-resolution ensemble, reaching a BA of 0.46 and 0.67 respectively.

5 Experimental Results

We evaluate our proposed method on the 70007000m test patches of UniToPatho. As a preliminary step, we tune the classifier responsible for the dysplasia grade prediction to find a suitable threshold. In the following experiments we choose , as it strikes a favorable balance between false positives and sensitivity. Lower thresholds may be preferred, for example, in clinical applications where minimizing the false negatives rate is more important.

Fig. 3

shows the 6-class confusion matrix of the proposed method compared to the baseline approach described in Sec.

3; in this latter case we employ which, as already discussed, yields the best overall BA. Looking at the diagonal, it is quite evident that the proposed method significantly improves in average accuracy, that leaps from 0.46 to 0.67 (50% relative increase). We notice how the baseline model is biased towards the lower grade classes: this represents further proof that subsampling the images results in the lack of useful features to distinguish high grade from lower grade and normal tissue. On the other hand, the multi-resolution approach shows remarkable improvements in assessing the correct grade.

Sensitivity 0.86 0.79 0.60 0.50 0.78 0.52
Specificity 0.93 0.87 0.92 0.94 0.96 0.92
BA 0.89 0.83 0.76 0.72 0.87 0.72
Table 3: Sensitivity, Specificity and BA per class.
Baseline 800 0.92 0.66 0.66 0.67
Baseline 1500 0.85 0.72 0.65 0.68
Baseline 7000 0.60 0.78 0.76 0.84
Multi-resolution - 0.89 0.83 0.81 0.87
Table 4: Comparison of the class BA between the baseline and the proposed multi-resolution approach.

Tab. 3 reports other metrics that are common in the related literature. Our proposed method achieves quite high specificity for all classes, and we also observe promising sensitivity values especially for the higher-risk TA.HG and TVA.HG classes.

Finally, we also analyze the per-type performance, as done in Sec. 3. The results are shown Tab. 4. Despite the small HP class accuracy drop - which could be due to the simple inference method - we observe an increase for all of the other tissue classes. Notably, we obtain a great reduction in false positive adenoma predictions, and, most importantly, a more precise distinction between TA and TVA adenomas.

6 Conclusions

This work presents UniToPatho, a histopathological dataset of colorectal polypss obtained from 292 high-resolution annotated images. UniToPatho provides annotation for hyperplastic polyps, adenomas (tubular or tubulo-villous) and their dysplasia grade (low or high). We show that a single deep neural network fails at correctly classifying the tissue type. We highlight that each of the classes discriminant features are extracted at different resolutions, making a direct classification of the tissue at a fixed scale a sub-optimal approach. This observation allowed us to design a multi-resolution deep learning strategy that, employing an ensemble of classifiers, achieves 67% accuracy. From a clinical perspective, the most relevant results are the differentiation capability between tubular and tubulo-villous adenomas and the dysplasia grading, which are the most difficult tasks for pathologists [gupta2020recommendations, hassan2020post, matsuda2016surveillance, rutter2020british]. The challenge to improve the automatic diagnostic accuracy is however still open, as UniToPatho is publicly available and still growing. Our future work will focus on collecting samples from other institutions to assess the cross-laboratory generalization capability.