Redesigning Fully Convolutional DenseUNets for Large Histopathology Images

08/05/2021
by   Juan P. Vigueras-Guillén, et al.
AstraZeneca
0

The automated segmentation of cancer tissue in histopathology images can help clinicians to detect, diagnose, and analyze such disease. Different from other natural images used in many convolutional networks for benchmark, histopathology images can be extremely large, and the cancerous patterns can reach beyond 1000 pixels. Therefore, the well-known networks in the literature were never conceived to handle these peculiarities. In this work, we propose a Fully Convolutional DenseUNet that is particularly designed to solve histopathology problems. We evaluated our network in two public pathology datasets published as challenges in the recent MICCAI 2019: binary segmentation in colon cancer images (DigestPath2019), and multi-class segmentation in prostate cancer images (Gleason2019), achieving similar and better results than the winners of the challenges, respectively. Furthermore, we discussed some good practices in the training setup to yield the best performance and the main challenges in these histopathology datasets.

READ FULL TEXT VIEW PDF

Authors

page 4

page 11

page 12

page 13

01/22/2019

Multi-Task Learning with a Fully Convolutional Network for Rectum and Rectal Cancer Segmentation

In a rectal cancer treatment planning, the location of rectum and rectal...
06/25/2021

A Novel Self-Learning Framework for Bladder Cancer Grading Using Histopathological Images

Recently, bladder cancer has been significantly increased in terms of in...
01/22/2019

Fully Convolutional Network-based Multi-Task Learning for Rectum and Rectal Cancer Segmentation

In this study, we present a fully automatic method to segment both rectu...
12/02/2016

A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images

Colorectal cancer (CRC) is the third cause of cancer death worldwide. Cu...
02/26/2017

Adversarial Networks for the Detection of Aggressive Prostate Cancer

Semantic segmentation constitutes an integral part of medical image anal...
07/19/2017

Fast, Simple Calcium Imaging Segmentation with Fully Convolutional Networks

Calcium imaging is a technique for observing neuron activity as a series...
11/01/2020

Autonomous Extraction of Gleason Patterns for Grading Prostate Cancer using Multi-Gigapixel Whole Slide Images

Prostate cancer (PCa) is the second deadliest form of cancer in males. T...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fully Convolutional Networks (FCNs) [11, 16]

were introduced as a way to expand Convolutional Neural Networks (CNNs) for semantic image segmentation, where the neural network (NN) layers at the end of the CNN were substituted with an upsampling path to recover the spatial resolution of the input image. One major contribution in these networks was the introduction of skip-connections between downsampling and upsampling paths

[16], which have proven to be effective in recovering fine-grained details of the images [4].

Many CNNs architectures have been extended to FCNs. This is the example of the so-called Tiramisu network [10], which uses the design of DenseNets [7]. DenseNets exploit the idea of dense connections, which consists of the concatenation of all previous feature maps (within a resolution/dense block) before sending them to the next convolutional layer. This concept could be interpreted as the use of skip-connections within a dense block, and it shows many similarities with another well-known architecture named Residual Networks (ResNets) [6], although ResNets sum the feature maps instead of concatenate them. Overall, DenseNets have shown many benefits: (1) features are reused along a resolution block, where all convolutional layers can see the preceding feature maps; (2) there is an implicit deep supervision; (3) the network is robust against overfitting [17]

; (4) since the information of the gradients or input is sent through the short-connections, it avoids the vanishing gradient problem

[6].

When aiming for semantic segmentation in histopathology problems, images are usually much larger than the ones in the public datasets used for benchmark, like ImageNet, whose images are 256

256 pixels. Since current GPUs do not have enough memory to handle large images, it is common to subdivide the image in patches and solve them independently. However, this raises the questions of how large the patches should be and what would be the correct balance between network size (in parameters) and patch size. We hypothesize that, for best results, a patch should cover the whole area where abnormal, growing cancer tissue occurs such that the network could understand the limits of the cancerous tissue.

In this work, we propose a Fully Convolutional DenseNet designed for histopathology images. While many details in the Tiramisu network [10] can be optimal for small images, they might require many computational resources and their contribution might not be relevant. Thus, we redesigned FC-DenseNets for better performance in large pathology images. To prove the potential of our network, we experimented with two public histopathology datasets presented as challenges in the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2019): (1) a dataset of colon cancer, named DigestPath2019, to perform binary segmentation; and (2) a dataset of prostate cancer, named Gleason2019, to perform grading (multi-class segmentation).

The contributions of this work are as follows: (1) we propose a new Fully Convolutional DenseNet designed for large histopathology images; (2) we evaluate the method in two different datasets and compare them with the well-known Tiramisu network [10]; (3) we discuss the training details that are key for a good performance, such as class balance and image sampling.

2 Methods

2.1 Datasets

2.1.1 DigestPath2019.

The ‘colonoscopy tissue segment dataset’ was part of the DigestPath2019 Challenge [12]. According to the authors, 660 color tissue slices of an average size of 5000x5000 pixels from 324 patients were provided as training, from which 250 images had a lesion annotation (positive images) and the remaining 410 contained healthy tissue (negative images). This was a very unbalanced two-class problem: the pixel ratio between the malignant tissue (positive class) and healthy tissue or non-tissue (negative class) within the positive images was 1:8, which increased to 1:14 if the negative images were considered. There was a single annotation per image, although it was not specified whether the same pathologist annotated all images. The data showed large variations in terms of appearance because it was collected from several medical centers in developing countries (Fig. 1). All whole slide images were stained with hematoxylin and eosin (H&E) and scanned at X20. The malignant lesions in the dataset were high grade intraepithelial neoplasia and adenocarcinoma, including papillary adenocarcinoma, mucinous adenocarcinoma, poorly cohesive carcinoma and signet ring cell carcinoma.

The testing dataset was not released. Thus, we performed a 10-fold cross-validation in the training set, where fold 1 was considered the validation set used to compare the different networks. Once we established the best network, the remaining 9 folds were cross-validated and the final test values were computed as the average between the 10 folds. To evaluate the segmentation results, the metrics chosen was accuracy and Dice Similarity Coefficient (DICE), where the latter measures the area overlap between segmentation results and annotations and is defined as

(1)

where is the set of foreground pixels in the annotation and is the set of foreground pixels in the segmentation result. More than 500 participants registered to the challenge, and the winner obtained a .

Overall, the quality of the annotations was good, with a precise delineation of the tumors. However, we noted that some non-tissue areas within tumors were annotated as cancer pixels (Fig. 1). While this is an understandable decision if the annotations were used for discussion between pathologists, it seems rather counterproductive for an algorithm and it raises the question of whether a CNN could understand those human-made patterns in the labels. In summary, the two main challenges in this dataset were the high class imbalance and the diverse appearance of the images (color, contrast, etc.).

Figure 1: Four representative examples of the DigestPath dataset; from left to right, two positive images (the perimeter of the lesions are highlighted in green), and two negatives images. These images were scaled for illustrative purposes (leftmost image had a size 59537294 pixels whereas rightmost image had a size of 33124096).

2.1.2 Gleason2019.

This challenge dealt with the automatic Gleason grading of prostate cancer from H&E-stained histopathology images [13]. The Gleason grading is a 5-grade system named after the pathologist who developed it in the 1960s, Dr. Donald Gleason, and indicates the distinct patterns as the cancerous cells change from normal to tumor cells [15], where grade 1 refers to cancer cells that resemble normal prostate tissue and grade 5 indicates highly mutated cancer cells. This dataset contained 244 training images of an average size of 51205120 pixels, and six pathologists performed the grading annotations, although not in all images. Exactly, the six pathologists performed 242, 136, 238, 240, 244, and 65 annotations, respectively (a total of six annotations were discarded due to clear mistakes). In this challenge, the goal was to segment the grades 3, 4, and 5; thus, labels 1 and 2 were merged with label 0 (healthy tissue and non-tissue). To obtain the gold standard labels, the authors defined its computation as the pixel-based majority voting among the available annotations, although they did not disclose these majority-vote labels and did not explain how to proceed when a tie occurs or whether labels 1 and 2 were considered for the majority vote and later discarded or vice versa. Therefore, we decided to first remove labels 1–2 and then compute the majority-vote labels, setting the higher grade in case of a tie (Fig. 2).

Figure 2: One representative example of the pathologists’ annotations in the Gleason Challenge. The majority-vote label is the one used to evaluate the models.

When observing the annotations, we noted that most of the annotators made a rough, imprecise work (Fig. 2). Indeed, each one of the pathologists showed a different way of annotating: pathologist #2 was the exception by being very detailed oriented; pathologists #5 and #6 tended to label the whole tissue area (#5 barely left any area unlabeled, whereas the others liked to leave unlabeled stripes between labeled areas), and pathologists #1, #3 and #4 were somewhere in between. Furthermore, the labeled areas sometimes reached beyond the tissue, which suggests that the pathologists did not pay attention to the tissue borders and preferred to do a quick job. To mitigate this problem, we automatically created binary masks differentiating the tissue area from the outer-area and we applied it to the label images, thus removing any label wrongly applied outside the tissue area. This also helped to, later in the training, extract patches from the healthy tissue (white pixels in Fig. 2) and not from the outer-area (black pixels in Fig. 2).

The labels of the test set were not released. Thus, we also created a 10-fold cross-validation in the training set and performed exactly as described in the DigestPath dataset. To evaluate the segmentation results, the metric chosen by the authors was a combination of Cohen’s kappa and the F1-scores. Specifically, the metric was defined as

(2)

where Cohen’s kappa expresses the level of agreement between two annotators, computed as , being the observed agreement ratio and the expected agreement when both annotators assign labels randomly. F1-score is computed as , being and , using the global computation of true positives (TP), false negatives (FN) and false positives (FP) to yield the ‘micro F1’, and using the computation for each label to obtain the ‘macro F1’. We used the implementation of this formulas from the Scikit-learn python API [14]. Finally, more than 100 participants registered to this challenge, and the winner obtained a .

Overall, this was a more complex problem than the DigestPath challenge. For instance, the cases where non-tissue inner-areas were labeled as cancer occurred more often in this dataset, with extreme cases such as the one displayed in the upper part of the core image in Fig. 2. The grades were also highly unbalanced, being the proportion of grades in the majority-vote labels for healthy-tissue (including grades 1 and 2) and grades 3, 4, and 5, respectively; this is, for one pixel with grade 5, there were 36 pixels with grade 4. In summary, the main challenges in this dataset were the high class imbalance, the poorly annotated labels, and the discrepancies of opinions between pathologists.

2.2 Network

Our proposed network is depicted in Fig. 3, named 1BN-DenseUnet, and it could be seen as a different adaptation of DenseNets [7] for semantic segmentation than the Tiramisu network [10]. Many changes were particularly suitable for our histopathology problems and, thus, they might not be extrapolated to other type of natural images. These were:

Figure 3: Schematic overview of the 1BN-DenseUnet network.
  1. The main hyperparameters were set so that the network would fit within our limited resources (GPU with 16GB RAM). This involved a growth rate of GR=6 (feature maps at each convolutional layer Conv2D(3

    3)) and dense blocks with the following number of convolutional blocks .

  2. Different from both [7, 10], we employed the ELU function activation [1] instead of ReLUs [5], and they were set within the convolutional layer simply to save resources.

  3. Batch Normalization (BN) [8]

    layers were only used before the feature reduction layer, Conv2D(1

    1). As later discussed, BN layers occupied many memory resources and could be reduced in number without decrease in performance.

  4. Similar to [7], we used feature reduction layers, BN+Conv2D(11)+ELU, with the same feature outputs of 4GR as in [7]. We used these layers in both, convolutional and upsampling blocks. In contrast, Tiramisu network does not use them.

  5. Instead of a initial convolutional layer with larger filters as in [7, 10], we set a first dense block with simply three convolutional layers (no BN or feature reduction layer).

  6. There is no reduction block in the first dense block. We observed that having the input image in the last dense block helped with the fine-grained details.

  7. Different from [10], we up-sampled with all the concatenated features from the output of the previous dense block. Since we applied feature reduction before upsampling, the memory demanding was similar to Tiramisu, which only uses the output of the last convolutional block.

  8. Different from both [7, 10], we used a compression rate of C=0.5 in the reduction and upsampling block but only considering the number of features created in the previous dense block, computed as . In practice, we applied higher compression than [7, 10], reducing the feature maps to approximately 1/3 in the deepest layer.

  9. New to both [7, 10], our network added the input image at the beginning of each dense block in the downsampling path (properly reduced in size). This idea has been previously suggested [18] and here it boosted the performance with just a very small increase in computational cost.

  10. New to both [7, 10], our network introduced the images with two color spaces, RGB+HSV. The specific use of a color space is mainly data and problem related. In our case, we observed that adding both spaces helped the network to yield a slightly better accuracy than any of the two spaces independently and, most interestingly, the training showed a smoother loss and accuracy curve, suggesting a better convergence.

For comparison purposes, we implemented the Tiramisu network [10], setting the same depth and width as in Fig. 3, yielding 329K parameters. In contrast, our network had 294K parameters.

These networks were employed in both datasets: in the DigestPath2019 we used a 2-class output (binary class), whereas in the Gleason2019 we employed a one-hot encoding and a probability encoding.

2.3 Training Details

2.3.1 DigestPath2019.

This dataset had two types of images (positives and negatives) and, within the positive images, two classes. To balance the classes, we extracted four patches of 756756 pixels to build the batch in the following way: two patches from the positive images whose center pixel was class 1 (cancer), one patch from the positive images whose center pixel was class 0 (non-cancer), and one patch from the negative images. Importantly, each patch was obtained from a different image, the non-cancer patches were extracted from tissue areas, and the patches were shuffled within the batch to avoid any learning bias. While this setting did not ensure a perfect 50/50 class balance in the batches, we observed that any attempt to further weight the classes did not improve the performance. Furthermore, the fact that we only sampled one patch from the negative images did not diminish the performance in that subset.

For data augmentation, we performed only flipping and four rotations over angles of

in the batch. Thus, eight orientations were possible without introducing interpolation. The remaining hyperparameters were: binary cross-entropy as loss function, nadam optimizer

[3]

, 400 iterations for one epoch (approximately the number of negative images), 250 epochs, a initial learning rate of

, and a rate decay of , being the updated rate at each new epoch . Since overfitting was not observed, we employed the last model (no early-stopping).

2.3.2 Gleason2019.

In this problem, we had four classes, with one heavily unrepresented (grade 5). Thus, we balanced the classes by constructing a batch with an example of each class. Similar to DigestPath2019, each patch was obtained from a different image, we extracted the patches for the non-cancer class from the healthy tissue (white pixels in Fig. 2), and we also shuffled the patches within the batch. The key aspect in this problem was deciding how to combine the different pathologists’ annotations, knowing that the majority-vote would be used for evaluation. Thus, we performed several experiments:

  • We trained using only the annotations by one pathologist (all of them were tested), using one-hot encoding.

  • We trained with the majority-vote (one-hot encoding).

  • We trained with all available annotations; thus we would randomly select a pathologist’s annotation for each patch (one-hot encoding).

  • We trained with a probabilistic encoding. This means that, for each patch, we would consider the different opinions of the pathologists to build a probabilistic vector for each pixel. For example, if a pixel was given grade 0 by one pathologist, grade 3 by two pathologists, grade 4 by three pathologists, and grade 5 by no one, the encoding would be

Similar to DigestPath2019, we employed the same type of data augmentation and the following hyperparameters: categorical cross-entropy as loss function, nadam optimizer [3], 250 iterations for one epoch (approximately the number of images), 400 epochs, an initial learning rate of , a rate decay of , being the updated rate at each new epoch , and no early-stopping.

3 Results

3.1 DigestPath2019

For the validation set, we obtained DICE of 82.49%, whereas Timarisu network yielded a DICE of 80.84%. To evaluate the importance of using a large patch, we tested the network with patches of 512512 pixels (doubling the GR=12, with 1.1M parameters) and 256256 pixels (setting the GR=24, with 4.3M parameters). The smaller patch resulted in a unstable, deficient training, with a DICE of 75.65%. The 512512 patch yielded reasonable results, with a DICE 80.21%. Therefore, it was beneficial to increase the patch size to 768768 even though that entailed to reduce the number of parameters to 294K. We also attempted to reduce the resolution of the images so that larger patches (10241024 converted into 512512) could be used with bigger networks, but we observed that it affected the performance and quality of the results. Alternatively, we also tested reducing only the output image and later scaling it up, but that did not show any difference with respect to our propose setup. Eventually, the proposed balance between batch size and network depth was optimal considering the limit of 16 GB of GPU RAM.

Finally, we performed a 10-fold cross validation with the 1BN-DenseUnet (Fig. 3) and we obtained the following average metrics: an accuracy of 99.93% for the negative images, 94.76% for the positive images, a total accuracy of 97.98%, a F1-score of 79.17%, and a DICE of 79.82%, which was slightly lower than the challenge winner.

Qualitatively, we observed a very good segmentation (Fig. 5):

  • Overall, the cancer areas were well detected in the positive images.

  • Negative images were almost perfectly identified as non-cancer (99.93% pixel accuracy).

  • Some areas appeared blurred (mainly in positive images), which was an indication of complicated tissue morphology.

  • The non-tissue areas were perfectly classified even though we built our batches without directly sampling from it.

  • Our method tended to be more precise in non-tissue pixels that the gold standard wrongly indicated as cancer because of the way pathologist made the annotations (blue arrows and the whole B3 annotation in Fig. 4).

  • Some large non-cancer areas were classified as cancer with high certainty (green arrows in Fig. 4), which made us wonder whether the annotations were correct in those cases.

3.2 Gleason2019

This dataset became a more complex problem due to the inconsistencies between pathologists. If the network was trained with the annotations of only one pathologist, the selection of the pathologist was key to a better performance (Table 1), being pathologists #3 and #5 the ones with greater accuracy. Since the evaluation was done with the majority-vote labels, this was not an indication of the quality of the pathologists’ skills but an sign of which pathologists had similar annotation labels to the resulting majority-vote labels (Fig. 4).

We also tested an alternative setup where the network was trained and tested on the annotations of the same pathologist, and the results indicated that most pathologists scored higher in the majority-vote labels than in their own labels, with the exception pathologists #2 and #6 (Table 1). This could suggest the existence of inconsistent labels within the annotations of the same pathologist and the benefit of using the majority-vote as the best gold standard. More interestingly, the experiment where pathologist #2 was used for training and testing was the only case where grade 5 had a non-zero DICE score (32.5%), suggesting that he was precise and consistent in his annotations such that the network could differentiate all the Gleason grades, even with the existence of a high label unbalance.

As expected, training with the annotations of all pathologists was slightly detrimental, as random label sampling introduces contradictory annotations. In contrast, training with the majority-vote labels provided better performance, although this was not surprising (Table 1). Our proposed probability approach yielded the best categorical accuracy but not the best score 222Our score was considerably larger than the winner of the challenge, which made us wonder whether the metric could be wrongly defined. We attempted to contact the challenge organizers to clarify this, but without success.. We observed that the probability approach was more sensitive to the ‘patch effect’ in the output image, where the network was not capable of providing a smooth transition of grades between patches in the reconstructed output image (Fig. 4). Finally, Tiramisu network yielded slightly worse results.

P1 P2 P3 P4 P5 P6 All MV Prob Tir
No. images 242 136 238 240 244 36 244 244 244 244
Accuracy 78.58 74.84 85.21 84.09 85.22 76.42 78.99 85.56 85.66 85.14
Score 1.192 1.104 1.335 1.299 1.356 1.132 1.224 1.359 1.343 1.311
Own acc. 73.70 83.05 82.40 82.36 84.87 78.75 - - - -
Own score 1.088 1.256 1.232 1.243 1.358 1.088 - - - -
Table 1: Categorical accuracy and score in the validation set for the different setups where the majority-vote was employed as test labels and using as training: only the annotations of one pathologist (P1–P6), all annotations (All), the majority-vote (MV), the probabilistic approach (Prob), or the Tiramisu network with probabilistic approach (Tir). The best metric in bold. For comparative purposes, each pathologist was also trained and tested on their own labels (bottom rows).
Figure 4: The output images for different setups in the Gleason challenge. (A) The prostate tissue image (input). (B) The majority vote annotation (label). (C) The annotated image from the pathologists; if an image was not annotated by one pathologist, their number is indicated instead (#2 & #6). (P1-P6) The output for the setup where only the annotations of one pathologist were used for training. (All) The output using all annotations as training. (MV) The output using the majority-vote annotations as training. (Prob) The output using the probability approach.

Qualitatively, our model could identify the areas with abnormal cells reasonably well, but the grading was sometimes inconsistent with the majority vote (Fig. 6).

Figure 5: (A) Four representative examples of positive images with cancer tissue. (B) Pathologist’s annotations (binary images). (C) Output of our network (probability images). Blue arrows indicate annotations poorly made (non-tissue areas within the cancer annotation). Green arrows indicate areas classified with high certainty as cancer that were annotated as non-cancer, which suggest possible mistakes in the pathologist’s annotations.
Figure 6: (A) Four representative examples of prostate tissue biopsies. (B) The annotated images from the pathologists; if an image was not annotated by one pathologist, their number is indicated instead (#1,…). (C) The majority vote annotations; benign tissue (white pixels in Fig. 2) was merged with non-tissue (black pixels). (D) Output of our network (categorical classification) training with the majority-vote labels. (Bottom) Color code.

4 Discussion

In this work, we have presented a new DenseUnet that provides better performance than a Tiramisu network (with similar size) in two different histopathology problems, one being a two-class problem and another being a multi-class problem. These two datasets were presented as challenges in the MICCAI conference, which allows us to compare our performance with other teams. Indeed, our method, along with the training details that boost the network learning, provided a performance similar or better than the winners of the respective challenges, even though we did not add any fine-tune method specifically designed for one of the datasets since our goal was to present a generic network able to perform proficiently in different pathology problems.

In our experiments with the design of the network, we observed that including feature reduction layers in the beginning of the convolutional blocks (as originally designed in the DenseNets [7]) was beneficial, whereas Tiramisu network does not make use of them. BN layers were not crucial in our problem and they required many memory resources, so we reduced them to only the beginning of convolutional blocks instead of setting them before every convolutional layer. BN layers are largely used due to its benefits in accelerating the training while providing a small amount of regularization to the network [8]. We actually did not observe any difference in convergence speed or performance if BN layers were completely removed from the network, but it had a small regularization effect and, thus, we kept it in the network.

Our sampling method to build the batch, which consisted of selecting each patch from a different image, balancing the classes through that selection, and shuffling the order of the patches –avoiding the network to learn a specific distribution of classes in the batch–, had a large impact in the performance of the network. Indeed, if the patch sampling was reduced to a pair of positive-negative images (in DigestPath2019 dataset), the batch was not representative of the distribution of the whole dataset and the performance was highly affected. In this respect, batch renormalization (BRN) layers [9] were effective in battling this problem, as the parameters used for normalization were computed over the different batches. Nevertheless, once we sampled from four different images, no differences were observed between a network with BN or BRN. Since BRN needed to store more parameters and thus using more memory, BN layers were preferred. Furthermore, we tested whether increasing the batch size would further improve the performance, although that entailed to reduce the network size (to fit in the GPU memory); our experiment suggested that more than four images were not particularly necessary as the reduction of the number of network parameters diminished the performance.

We firmly believe that our network could learn the different cancer patterns and satisfactorily detected them (Fig. 5). This was particularly manifest in the DigestPath2019 dataset, where in many cases our segmentation seemed to correct clear mistakes or imprecise delineations in the pathologists’ annotations (blue arrows in Fig. 5). Let us take the Fig. 5-A3 as an example: the annotation was done in a rough, imprecise fashion, annotating many internal non-tissue areas as cancer; it could be argued that the presence of such non-tissue structure is an indication of cancer, but for a computer algorithm only the border of that area (which is the actual abnormal tissue structure) should be annotated. In other words, pathologists sometimes annotate images with the unconscious assumption that another human being would interpret their annotations or without having a basic understanding of how computer algorithms work. Given these inconsistencies, it is complicated to train a model that would be able to avoid being biased by the human mistakes in the annotations. In this respect, the visual analysis of our results made us believe our model was indeed robust against those inconsistencies.

Regarding the Gleason2019 dataset, our performance was rather suboptimal. By visual inspection (Fig. 6), we could corroborate that our method was overall successful in detecting the areas with presence of cancer, but it did not perform so well in grading the tissue. However, it was unclear whether this was a network problem or a dataset/label problem. Indeed, the majority vote could be interpreted as the best gold standard if no other information is given. If the expertise of the pathologists were provided, it would probably be optimal to weight the labels based on their years of experience. This was already proven to be correlated with the accuracy of the annotations in the field of ophthalmology, where ophthalmologists with 25 years of experience were substantially better than new clinicians with 5-10 years of experiences [2]. Nevertheless, we observed that the probabilistic approach was rather inferior qualitatively although not quantitatively. Therefore, we preferred the network that was trained with the majority-vote labels and one-hot encoding, although this case did not classify correctly the grade 5. Indeed, only pathologist #2, who had very detailed annotations, performed very well and properly balancing the grades if the network was trained and tested on their own labels, which clearly suggested that it is important to make precise and detailed-oriented labels to yield unbiased results. It is worth noting that the majority-vote labels showed sometimes abnormal, illogical patterns, which would never be annotated in that way for any pathologist, but that was an unavoidable flaw of the methodology for evaluation.

5 Conclusions

We have proposed a Fully Convolutional DenseNet particularly designed for large histopathology images. We have shown that our network performs better than the well-known Tiramisu network [10] for two different histopathology problems. These datasets were published in the MICCAI conference as two pathology challenges. Different from other natural images, histopathology images are considerably large, and the cancer patterns can reach large areas, therefore it is vital to solve these images with patches that cover such patterns. Given the currently limited resources in GPUs, it is important to adapt the default networks in the literature for this goal. Our proposed network, which was not fine-tuned to perform particularly better in one dataset, proved to be robust against different histopathology images, yielding similar or better results than challenges winners.

References

  • [1] D. Clevert, T. Unterthiner, and S. Hochreiter (2016) Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR), Cited by: item 2.
  • [2] J. De Fauw, J. R. Ledsam, B. Romera-Paredes, and et al. (2018)

    Clinically applicable deep learning for diagnosis and referral in retinal disease

    .
    Nature Medicine 24, pp. 1342––1350. Cited by: §4.
  • [3] T. Dozat (2016)

    Incorportating Nesterov momentum into Adam

    .
    In International Conference on Learning Representations (ICLR) Workshop, Vol. 1, pp. 2013–2016.. Cited by: §2.3.1, §2.3.2.
  • [4] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal (2016) The importance of skip connections in biomedical image segmentation. In Deep Learning and Data Labeling for Medical Applications. LNCS, Vol. 10008. Cited by: §1.
  • [5] X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Vol. 15, pp. 315–323. Cited by: item 2.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Las Vegas, NV, pp. 770–778. Cited by: §1.
  • [7] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, item 10, item 2, item 4, item 5, item 8, item 9, §2.2, §4.
  • [8] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning (ICML)

    ,
    Vol. 37, pp. 448–456. Cited by: item 3, §4.
  • [9] S. Ioffe (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA.. Cited by: §4.
  • [10] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio (2017) The one hundred layers tiramisu: fully convolutional DenseNets for semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, pp. 1175–1183. Cited by: §1, §1, §1, item 10, item 2, item 5, item 7, item 8, item 9, §2.2, §2.2, §5.
  • [11] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,, pp. 3431–3440. Cited by: §1.
  • [12] MICCAI 2019 DigestPath Challenge. Note: https://digestpath2019.grand-challenge.org[Online; accessed 15-December-2019] Cited by: §2.1.1.
  • [13] MICCAI 2019 Gleason Challenge. Note: https://gleason2019.grand-challenge.org[Online; accessed 15-December-2019] Cited by: §2.1.2.
  • [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §2.1.2.
  • [15] Prostate Cancer Foundation Gleason score. Note: https://www.pcf.org/about-prostate-cancer/diagnosis-staging-prostate-cancer/[Online; accessed 10-February-2020] Cited by: §2.1.2.
  • [16] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 9351, pp. 234–241. Cited by: §1.
  • [17] J. P. Vigueras-Guillén, H. G. Lemij, J. van Rooij, K. A. Vermeer, and L. J. van Vliet (2019) Automatic detection of the region of interest in corneal endothelium images using dense convolutional neural networks. In Medical Imaging 2019: Image Processing, Vol. 10949, pp. 779–789. Cited by: §1.
  • [18] G. Zeng and G. Zheng (2018) Multi-scale fully convolutional DenseNets for automated skin lesion segmentation in dermoscopy images. In 15th International Conference Image Analysis and Recognition, ICIAR 2018, LNCS, Vol. 10882, pp. 513–521. Cited by: item 9.