Deep neural networks for automated classification of colorectal polyps on histopathology slides: A multi-institutional evaluation

09/27/2019 ∙ by Jason W. Wei, et al. ∙ Dartmouth College 0

Histological classification of colorectal polyps plays a critical role in both screening for colorectal cancer and care of affected patients. In this study, we developed a deep neural network for classification of four major colorectal polyp types on digitized histopathology slides and compared its performance to local pathologists' diagnoses at the point-of-care retrieved from corresponding pathology labs. We evaluated the deep neural network on an internal dataset of 157 histopathology slides from the Dartmouth-Hitchcock Medical Center (DHMC) in New Hampshire, as well as an external dataset of 513 histopathology slides from 24 different institutions spanning 13 states in the United States. For the internal evaluation, the deep neural network had a mean accuracy of 93.5 accuracy of 91.4 neural network achieved an accuracy of 85.7 significantly outperforming the accuracy of local pathologists at 80.9 77.5 our model could assist pathologists by improving the diagnostic efficiency, reproducibility, and accuracy of colorectal cancer screenings.



There are no comments yet.


page 5

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the United States, colorectal cancer is expected to cause 51,020 deaths in 2019, making it the second-most common cause of cancer death [1]. This death rate, however, has been dropping in the past several decades probably due to successful cancer screening programs [2-5], of which colonoscopy is the most common screening test in the U.S. [6]. During colonoscopies, clinicians excise colorectal polyps from patients and visually examine them on histopathology slides for neoplasia, ultimately reducing the mortality rate by detecting cancer at an early, curable stage and removing preinvasive adenomas or serrated lesions [7-9]. Furthermore, the numbers and types of polyps detected can indicate future risk for malignancies and are therefore used as the basis for subsequent screening recommendations [6]. An algorithm for automated classification of colorectal polyps could potentially benefit cancer screening programs by improving efficiency, reproducibility, and accuracy, as well as reducing the access barrier to pathological services [10].

In the past five years, a class of computational models known as deep neural networks has driven substantial advances in the field of artificial intelligence. Comprising many processing layers, deep neural networks take a data-driven approach to automatically learn the most relevant features of input data for any given task, dramatically improving the state-of the-art in computer vision [11], natural language processing [12], speech recognition [13], strategy games such as Go [14], and biomedical data science [15]. For medical image analysis in particular, deep learning has achieved considerable performance in classification of images including chest radiographs [16], retinal fundus photographs [17], head CT scans [18], lung histopathology slides [19], and skin cancer images [20].

In this study, we used 326 slides from our institution to train a deep neural network for colorectal polyp classification. In addition to the standard internal evaluation on 157 slides, we tested our algorithm on an external test set of 513 slides from 24 institutions in the United States and compared our algorithm’s performance to that of local pathologists in terms of accuracy, sensitivity, and specificity. To our best knowledge, our work is the first to comprehensively evaluate a deep learning algorithm for colorectal polyp classification and show the generalizability of this model across multiple institutions.

2 Results

Deep neural networks for colorectal polyp classification.

We developed a deep neural network model to classify the four most common polyp types: tubular adenoma, tubulovillous/villous adenoma, hyperplastic polyp, and sessile serrated adenoma on digitized histopathology slides. We collected an internal set of 326 slides used for model training and two test sets for evaluation: an internal test set of 157 slides and an external test set of 513 slides from 24 institutions. For all slides in both test sets, five gastrointestinal (GI) pathologists independently provided diagnoses that were used to determine gold-standard labels for each slide. We also comprehensively compared the performance of our model with local pathologists’ diagnoses at the point-of-care retrieved from corresponding pathology labs.

Internal evaluation. We evaluated our model in terms of accuracy, sensitivity, and specificity on both an internal test set of 157 images and a multi-institutional external test set of 513 images. Table 1 shows the per-class performance metrics of local pathologists and our model for both test sets. For the internal dataset from DHMC, inter-observer agreement, measured by Cohen’s , was in the substantial range (0.61–0.80), with the five study GI pathologists scoring an average of 0.72 (95% CI 0.64–0.80) and an average agreement of 0.79 (95% CI 0.73–0.85). Our model outperformed local DHMC pathologists for all metrics, with a mean accuracy (the unweighted average of individual polyp type accuracies) of 93.5% (95% CI 89.6%–97.4%) compared with local pathologists’ accuracy of 91.4% (95% CI 87.1%–95.8%). A two-tailed -test for proportions revealed, however, that the differences in performance were not significant, with -values of 0.48, 0.14, and 0.80, for accuracy, sensitivity, and specificity, respectively.

Internal Test Set () External Test Set ()
Local Pathologists Deep Neural Network Local Pathologists Deep Neural Network
Class Acc Sens Spec Acc Sens Spec Acc Sens Spec Acc Sens Spec
TA 89.8 76.1 95.5 93.0 89.1 94.6 72.7 33.0 97.5 82.8 65.5 93.7
TVA 94.3 88.2 95.8 95.5 97.1 95.1 79.9 85.9 78.7 85.3 94.1 83.6
HP 89.8 76.9 94.1 92.4 82.1 95.8 82.7 50.6 97.2 86.9 67.1 96.0
SSA 91.7 81.6 95.0 93.0 78.9 97.5 88.3 87.1 88.5 87.5 70.0 90.3
Overall 91.4 80.4 95.1 93.5 86.8 95.7 80.9 64.2 90.5 85.7 74.2 90.9
Table 1: Per-class comparison between local pathologists and a deep neural network in classifying colorectal polyps on an internal test set of 157 slides and an external test set of 513 slides. TA: tubular adenoma; TVA: tubulovillous/villous adenoma; HP: hyperplastic polyp; SSA: sessile serrated adenoma. Acc: accuracy; Sens: sensitivity; Spec: specificity.

Multi-institutional external evaluation. Our external dataset, on the other hand, had slightly less agreement for both pathologists and our model. Here, the five study GI pathologists scored an average multi-class Cohen’s of 0.64 (95% CI 0.58–0.70) and agreement of 0.74 (95% CI 0.69–0.78). In terms of accuracy and sensitivity combined over four classes, we found that the model significantly outperformed local pathologists at the level, with -values of 0.039 and 0.00052, respectively. Our model also performed at a similar specificity (90.9%) compared with local pathologists (90.5%) (

). In our external dataset of 513 slides from 24 different pathology labs, nine institutions had more than 20 slides in our test set. In Figure 1, the box and whisker plot shows the accuracies, sensitivities and specificities for colorectal polyp classification of both local pathologists and our deep learning model on these nine institutions. The deep neural network outperformed pathologists on all three evaluation metrics for five of the nine institutions, although not a statistically significant level for any institution individually.

Figure 1:

Performance of local pathologists and a deep learning model for colorectal polyp classification in terms of accuracy, sensitivity, specificity, and F1-score for nine external institutions in the test set with more than 20 slides. In each plot, the box shows the median and upper/lower quartiles, and the whiskers show the minimum and maximum.

Performance at various operating points. To evaluate our model at different operating points of sensitivity and specificity on the external test set, we calculated Receiver Operating Characteristic (ROC) curves, which show how our model’s predicted patch areas translate to slide-level diagnoses. In Figure 2, we plot all possible operating points of sensitivity and specificity for our model for each polyp type, as well as areas under the curve (AUC).

Figure 2: Performance of a deep neural network compared with local pathologists (of which diagnoses were given at the point-of-care) and our study’s GI pathologists against the test set gold-standard diagnoses. We plotted sensitivity (the true positive rate) and specificity (the true negative rate) for each pathologist, as well as all possible operating points for our model.

Confusion matrices and error analysis. Moreover, in Figure 3, we calculated confusion matrices for both local pathologists at external institutions and our model to determine which polyp types were the most challenging to diagnose. Local pathologists often classified tubular adenomas as tubulovillous/villous adenomas (58%) and hyperplastic polyps as sessile serrated adenomas (37%). The deep neural network similarly classified many tubular adenomas as tubulovillous/villous adenomas (33%) and hyperplastic polyps as sessile serrated adenomas (27%). For further analysis of our model’s errors, Supplementary Figure 1 shows violin plots for predicted areas of each polyp type on slides.

Figure 3:

Confusion matrices for (A) local pathologists’ diagnoses given at the point-of-care and (B) the model’s predicted diagnoses in comparison with multi-pathologist gold-standard diagnoses for the external test set. Each cell in the confusion matrix is the agreement between multi-pathologist gold standard labels and local pathologists’ or our model’s diagnoses. TA: tubular adenoma; TVA: tubulovillous/villous adenoma; HP: hyperplastic polyp; SSA: sessile serrated adenoma.

Visualization. Finally, we visualized the classifications of our model in comparison with the regions of interest annotated by pathologists. Figure 4 shows examples of slides with pathologists’ annotations, the heatmap detected by our model, and the output of our model that can potentially be used to aid pathologists in clinical practice. Our study’s lead GI pathologist (AAS) subjectively examined the heatmaps produced by our model and confirmed that they are mostly on target.

Figure 4: Visualization of the classifications of the deep neural network model. The first column shows the original image, and the second column shows pathologists’ annotations of precancerous lesions. The third column depicts the model’s detected heatmap, where higher confidence predictions are shown in darker color. In the fourth column, we show the model’s final output, which highlights precancerous lesions that can potentially be used to aid pathologists in clinical practice.

3 Discussion

To our knowledge, our study is the first to evaluate a deep neural network for colorectal polyp classification on a large multi-institutional dataset with comparison to local diagnoses made at the point-of-care. On a test set comprising 513 images from 24 external institutions, our model achieved an accuracy of 87.3%, significantly outperforming the local pathologists’ accuracy of 80.9% at the

level. For detection of tubulovillous/villous adenoma, the model achieved a high AUC of 0.96, likely because calculating areas of tubulovillous or villous growths could be more precise than pathologists’ estimations. Regarding annotation agreement, the five study GI pathologist annotators had an average Cohen’s

of 0.72 (95% CI 0.64-0.80) on our internal test set and 0.64 (95% CI 0.58-0.70) on the external test set, much higher than previously reported Cohen’s scores of 0.46 [23], 0.31 [24], 0.55 [25], and 0.54 [26]. This difference in performance is likely due to differences in polyp type distributions in various datasets, inter-laboratory variations in tissue processing and staining, and institutional biases in the polyp classification criteria.

In terms of error analysis, the deep neural network made the same types of misclassifications as local pathologists, as shown by the similarities in their confusion matrices. Both the model and local pathologists distinguished between adenomatous (tubular/tubulovillous/villous) and serrated (hyperplastic/sessile serrated) polyps with high accuracy but found further sub-classification more challenging. We hypothesize that many of the mistakes in sub-classification occurred because thresholds for detection of tubulovillous/villous growths and of sessile serrated crypts vary among pathologists, as our head GI pathologist’s manual inspection of discordances found that many of the errors made by the deep neural network were similar to mistakes made by pathologists in practice.

Our study not only demonstrates a deep learning model for classification of colorectal polyps but also advances previous literature in terms of model evaluation and study design. The previously foremost study on deep learning for colorectal polyp classification, done in 2016 by Korbar et al. [27, 28], demonstrated good performance on an internal dataset but used a simple model and did not include pathologist-level performance or local diagnoses. Our study, on the other hand, evaluates a deep neural network on a multi-institutional external dataset and demonstrates a significant diagnostic advantage of deep neural networks compared with local pathologists at the point-of-care. Many previous papers demonstrated clinician-level performance of deep neural networks on various medical classification tasks [16-20, 29-30]. All of these studies, however, measured clinician-level performance on a predetermined number of clinicians from a few medical institutions in a controlled setting (i.e., their data was collected solely for the study). Although it is significant to measure retrospective clinician performance on classification tasks, we used previously collected diagnoses by local pathologists in clinical practice at the point-of-care in 24 external institutions for comparison against the deep neural network.

A deep learning model for colorectal polyp classification, if validated through clinical trials, has potential for widespread application in clinical settings. Our model could be implemented in laboratory information systems to guide pathologists by identifying areas of interest on digitized slides, which could improve work efficiency, reproducibility, and accuracy for colorectal polyp classification. Although expert clinician confirmation of diagnoses will still be appropriate in the near future, our model could triage slides with diagnoses that are more likely to be preinvasive for subsequent review by pathologists. Since the U.S. Preventive Services Task Force (USPSTF) recommends that all adults aged 50 to 75 undergo screening for colorectal cancer, an automated model for classification could be useful in relieving pathologists’ burden in slide review and ultimately reduce the barrier of access for colorectal cancer screening.

The clinical implementation of our proposed deep neural network must carefully consider the model’s task-specific weaknesses and their clinical implications. As a first step, our model’s application will primarily guide pathologists’ reading of digital slides. Classifying colorectal polyp slides is one of the highest-volume services in GI pathology, and so our proposed model can be used as a pre-screening step to facilitate the process. If our model is validated in forthcoming prospective clinical studies, it can help pathologists in clinical practice by identifying regions of interest in digitized slides, as depicted in the last column of Figure 4. This pre-screening step can potentially improve the pathologists’ efficiency in this task by prioritizing the review of the highlighted areas.

Our study has several notable limitations. Although our model outperformed local pathologists on the external test set, it did not perform as well as the study GI pathologists in terms of sensitivity and specificity, as shown by Figure 2. These results might suggest that pathologists of the same institution have higher inter-observer agreement compared with those of external institutions, but they also imply that our model can be further improved by training on a larger dataset or by advances in neural network architectures. Furthermore, although our model identifies the most common polyp types, our study was done on well-sectioned, clearly stained slides and did not include less common classes such as traditional serrated adenoma or sessile serrated adenoma with cytological dysplasia. Our team plans to collect additional data and extend our model to these rare cases as future work. Finally, local pathologists might have had access to additional slides and patient information, such as patient colonoscopy history and polyp biopsy location, that influenced their diagnoses of polyp type. Access to this additional information might explain some of the discrepancies between local diagnoses and the ground-truth labels, which were only based on digitized slides.

Moving forward, further work can be done in deep learning for analysis of colorectal polyp images. Foremost, we plan to implement our model prospectively in a clinical setting to measure its ability to enhance pathologists’ classification of colorectal polyps and improve outcomes in a clinical trial. Although our results currently show strong potential for clinical application, deep learning must gain the confidence of patients, clinicians, and the medical community before widespread implementation is possible. In terms of technical improvements to our model, more data can be collected and used for training to increase the model’s performance, especially for sessile serrated adenomas. Moreover, related work has shown that deep learning has potential to identify hidden features in histopathology images that can be used to detect gene mutations [19] and predict patient survival [31-33], tasks that pathologists do not perform. To this end, we plan to collect more patient outcome data to train our model to predict polyp recurrence and patient survival in colorectal cancer.

Overall, we have demonstrated a deep neural network for classification of colorectal polyps that outperformed local pathologists on an independent test set comprising 513 images from 24 institutions. If confirmed in clinical trials, our model could potentially improve the efficiency, reproducibility, and accuracy of one of the most common cancer screening procedures.

4 Methods

Data Collection. We utilized two datasets of hematoxylin and eosin (H&E) stained formalin-fixed paraffin embedded colorectal polyp whole-slide images, each of which could contain one or more lesions, scanned by Leica Aperio scanners at 40x resolution. For our internal dataset, we collected all 508 slides scanned from January 2016 to June 2016 at the Dartmouth-Hitchcock Medical Center (DHMC), a tertiary academic care center in New Hampshire, USA. We randomly partitioned these slides into a training set of 326 slides, a development set of 25 slides, and an internal test set of 157 slides. In our internal data set, each whole-slide image belonged to a different patient and colonoscopy procedure. For our external test set, we used slides from patients who participated in a randomized clinical trial studying the effect of supplementation with calcium and/or vitamin D for the prevention of colorectal adenomas [21]. This external test set includes 1182 previously collected whole slides along with their diagnoses given by local pathologists at the point-of-care. Of these slides, we randomly sampled up to 150 slides for each of the four most common polyp types (tubular adenoma, tubulovillous/villous adenoma, hyperplastic polyp, and sessile serrated adenoma comprised 95.5% of the diagnoses) as diagnosed by the local pathologist. For the less frequent polyp types (tubulovillous/villous adenoma and sessile serrated adenoma), digitized slides with more than one lesion were split by lesion into smaller images. After random sampling, we had 528 slides from the four polyp types, of which 15 were removed due to poor slide quality as determined by our study’s lead gastrointestinal pathologist (AAS). So, in total, our final external validation set comprised 513 slides from 24 different institutions spanning 13 states in the United States. In this external test set, some of the slides corresponded to the same patients, as our 513 slides came from 179 distinct patients. All slides from the internal and external test sets were held out from model development until final evaluation of the model.

Human Subject Regulations. This study and the use of human subject data in this project were approved by the Dartmouth-Hitchcock Health institutional review board (D-HH IRB) with a waiver of informed consent. The conducted research reported in this paper is in accordance with this approved D-HH IRB protocol and the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects.

Data Annotation.

Our study’s annotation process involved five GI pathologists from the Department of Pathology and Laboratory Medicine at DHMC: three with GI pathology fellowship training and two who gained GI pathology expertise through years on GI pathology service. For the training set slides, two of the GI pathologists manually annotated bounding boxes around regions of interest to be used in model training. In total, 3,848 regions of interest were identified for model training and labeled as one of the following classes: tubular adenoma, tubulovillous/villous adenoma, hyperplastic polyp, sessile serrated adenoma, or normal. In our development set, the two pathologists annotated fixed-size patches (square areas of 224 by 224 pixels) of classic examples for each polyp type. Since this dataset was used to tune hyperparameters in our model, all patches were confirmed with high confidence by both pathologists, and patches with disagreements were discarded. For the internal test set, local diagnoses were parsed from DHMC’s laboratory information system, and the five GI pathologists also independently and retrospectively diagnosed each slide as one of four polyp types: tubular adenoma, tubulovillous/villous adenoma, hyperplastic polyp, or sessile serrated adenoma. For this internal set, the local diagnoses given at the point-of-care at DHMC may have been from one of the five study GI pathologists, but the original diagnosis and identity of the pathologist at the point-of-care were hidden during the retrospective annotation phase.

For the external test set, we retrieved the diagnoses given at the point-of-care by local pathologists from their corresponding institutions. In addition, the five GI pathologists from DHMC also retrospectively diagnosed all slides in the test set in the same fashion as for the internal test set. In total, we had five complete sets of diagnoses from GI pathologists, as well as the diagnoses given by local pathologists at the point-of-care. For both internal and external test sets, gold-standard diagnoses were assigned by taking the majority vote of three randomly selected GI pathologists, and the other individual diagnoses were used to calculate pathologists’ performance on the dataset. Figure 5 depicts the data flow for our study design. Supplementary Figure 2 shows statistics describing slide sizes and polyp types for the internal and external test sets. In Supplementary Table 1, we show the polyp type distribution in our test set for each external institution, grouped by institution type and state.

Figure 5: Data flow diagram for our study. We trained our model on an internal training and development set and then evaluated it on internal and external test sets with multi-pathologist gold-standard diagnoses. WSI denotes whole-slide image. Crops in the training set varied in length and width, whereas patches in the development set were of fixed size and represented classic examples of each polyp type.

Deep Learning Model.

In recent years, deep convolutional neural networks have achieved state-of-the-art performance in image classification [11]. In this study, we implemented the deep residual network (ResNet), a neural network architecture proposed by Microsoft that significantly outperformed all other models on the ImageNet and COCO image recognition benchmarks [22]. For model training, we applied a sliding window method to the 3,848 variable-size regions of interest labeled by pathologists in the training set, extracting approximately 7,000 fixed-size patches per polyp type. Then, we initialized ResNet with the He weight initialization [11] and trained our neural network for 200 epochs with an initial learning rate of 0.001, which decayed by a factor of 0.9 every epoch. Throughout training, we applied standard image augmentation techniques including rotations and flips, as well as color jittering on the brightness, contrast, saturation, and hue of each image. For our final model, we used an ensembled model comprising five ResNets of 18, 34, 50, 101, and 152 layers.

Slide-Level Inference.

For our deep learning model to infer the overall diagnosis of a whole-slide image, we designed a hierarchical classification algorithm to match the nature of our classification task. First, each slide was broken down into many patches using a sliding window algorithm, and each patch was classified by the neural network. Using these predicted diagnoses for all patches in a given slide, our model first determined whether a polyp was adenomatous (tubular/tubulovillous/villous) or serrated (hyperplastic/sessile serrated) by comparing the number of predicted patches for the adenomatous and serrated types. Adenomatous polyps with more than a certain threshold amount of tubulovillous/villous tissue were classified as overall tubulovillous/villous adenoma, whereas the remaining polyps were classified as tubular adenoma. For serrated polyps, our algorithm classified polyps with above a certain threshold of sessile serrated patches as overall sessile serrated adenomas and the remaining polyps as hyperplastic. Finally, we refined our model’s prediction of high-risk diagnoses by discarding patch predictions of low confidence for tubulovillous/villous and sessile serrated adenomas. All thresholds were determined using a grid search over our internal training set. The hierarchical nature of our inference heuristic allowed us to imitate the schema used by pathologists for this classification task without training a separate machine learning classifier, which would likely overfit our training set of 326 whole-slide images.

Evaluation. For final evaluation, we compared the performance of our model with that of local pathologists originally made at the point-of-care on both the internal test set and the multi-institutional external test set. Local pathologist performance measures were averaged over all samples, since information about individual pathologists’ performances from external institutions were anonymized. First, we measured the agreement of our study’s GI pathologists in terms of multi-class Cohen’s and percent agreement. For our model’s classifications, we calculated accuracy (the average of individual inference class accuracies), sensitivity, and specificity in relation to gold-standard diagnoses and compared these metrics with those of local pathologists. Furthermore, to evaluate our model’s performance at various sensitivities and specificities, for images of each polyp type, we plotted Receiver Operating Characteristic (ROC) curves as a function of the number of patches predicted by the deep neural network. Finally, we calculated confusion matrices for local pathologists and our model and conducted appropriate error analysis.

Data Availability. The dataset used in this study is not publicly available due to patient privacy constraints in accordance with institutional requirements governing human subject privacy protections. The data that supports the findings of this study are offered to editors and peer reviewers at the time of submission upon request for the purposes of evaluating the manuscript.

Code Availability. The source code for this study is publicly available at


  1. [leftmargin=*]

  2. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin 2019;69(1):7-34.

  3. Siegel RL, Ward EM, Jemal A. Trends in colorectal cancer incidence rates in the United States by Tumor Location and Stage, 1992-2008. Cancer Epidemiol Biomarkers 2012;21(3):411-416.

  4. Cress RD, Morris C, Ellison GL, Goodman MT. Secular changes in colorectal cancer incidence by subsite, stage at diagnosis, and race/ethnicity, 1992-2001. Cancer 2006;107(S5):1142-1152.

  5. Edwards BK, Ward E, Kohler BA, Eheman C, Zauber AG, Anderson RN, et al. Annual report to the nation on the status of cancer, 1975-2006, featuring colorectal cancer trends and impact of interventions (risk factors, screening, and treatment) to reduce future rates. Cancer 2009;116(3):544-573.

  6. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2017. CA Cancer J Clin 2017;67(1)7-30.

  7. Rex DK, Boland CR, Dominitz JA, Giardiello FM, Johnson DA, Kaltenbach T, et al. Colorectal cancer screening: recommendations for physicians and patients from the U.S. multi-society task force on colorectal cancer. Gastroenterology 2017;153:307-323.

  8. Kronborg O, Fenger C, Olsen J, Jorgensen OD, Sondergaard O. Randomised study of screening for colorectal cancer with faecal-occult-blood test. Lancet 1996;348:1467-1471.

  9. Zauber AG, Winawer SJ, O’Brien MJ, Lansdorp-Vogelaar I, van Ballegooijen M, Hankey BF, et al. Colonoscopic Polypectomy and Long-Term Prevention of Colorectal-Cancer Deaths. NEJM 2012;366:687-696.

  10. Citarda F, Tomaselli G, Capocaccia R, Barcherini S, Crespi M. Efficacy in standard clinical practice of colonoscopic polypectomy in reducing colorectal cancer incidence Gut 2001;48:812-815.

  11. Wilson ML, Fleming KA, Kuti M, Looi LM, Lago N, Ru K. Access to pathology and laboratory medicine services: a crucial gap. Lancet 2018;391(10133):1927-1938.

  12. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proc. IEEE International Conference on Computer Vision 2015; published online Feb 6. Accessed May 12, 2019.

  13. Jean S, Cho K, Memisevic R, Bengio Y. On using very large target vocabulary for neural machine translation. Proc. ACL-IJCNLP 2014; published online Dec 5. Accessed May 12, 2019.

  14. Mikolov T, Deoras A, Povey D, Burget L, Cernocky J. Strategies for training large scale neural network language models. Proc. Automatic Speech Recognition and Understanding 2011; published online March 5, 2012. Accessed May 12, 2019.

  15. Silver D, Huang A, Maddison C, Guez A, Sifra L, van den Driessche G, et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016;529:484-489.

  16. Ker J, Wang L, Rao J, Lim T. Deep Learning Applications in Medical Image Analysis. IEEE Access 2017;6:9375-9389.

  17. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc. AAAI 2019, published online Jan 21.

  18. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316(22):2402-2410.

  19. Chilamkurthy S, Gosh R, Tanamala S, Biviji M, Campeau N, Venugopal V, et al. Deep learning algorithms for detection of critical findings in head CT scans: a retrospective study. Lancet 2018;392:2388-96.

  20. Coudray, N. Ocampo PS, Sakellaropoulos T, Narula N, Snuderl M, Fenyö D, et al. Classification and mutation prediction from non–small cell lung cancer histopathology images with deep learning. Nat Med 2018; 24:1559-1567.

  21. Esteva A, Kuprel B, Novoa R, Ko J, Swetter S, Blau H, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115-118.

  22. Baron JA, Barry EL, Mott LA, Rees JR, Sandler RS, Snover DC, et al. A trial of calcium and vitamin D for the prevention of colorectal adenomas. NEJM 2015;373(16):1519-1530.

  23. He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Proc. CVPR 2015; published online Dec 10. Accessed May 19, 2019.

  24. Yoon H, Martin A, Benamouzig R, Longchampt E, Deyra J, Chaussade S. Inter-observer agreement on histological diagnosis of colorectal polyps: the APACC study. Gasteroenterol Clin Biol. 2012 26(3):220-224.

  25. Terry MB, Neugut AI, Bostick RM, Potter JD, Haile RW, Fenoglio-Presier CM. Reliability in classification of advanced colorectal adenomas. Cancer Epidemiol Biomarkers Prev 2002. 11(7):660-663.

  26. Van Putten PG, Hol L, Van Dekken H, Van Kriken JH, Van Ballegooijen M, Kuiper EJ, et al. Inter-observer variation in the histological diagnosis of polyps in colorectal cancer screening. Histopathology 2011;58(6):971-981.

  27. Mahajan D, Downs-Kelly E, Liu X, Pai RK, Patil DT, Rybicki L, et al. Interobserver variability in assessing dysplasia and architecture in colorectal adenomas: a multicenter Canadian study. J Clin Pathol 2014;67(9):781-786.

  28. Korbar B, Olofson AM, Miraflor AP, Nicka CM, Suriawinata MA, Torresani L, et al. Deep learning for classification of colorectal polyps on whole-slide images. J Pathol Inform 2017;8:30.

  29. Korbar B, Olofson AM, Miraflor AP, Nicka CM, Suriawinata MA, Torresani L, et al. Looking under the hood: deep neural network visualization to interpret whole-slide image analysis outcomes for colorectal polyps. Proceedings of the CVPR Workshops 2017. Accessed June 20, 2019.

  30. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25:65-69.

  31. Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol 2019;20:193-201.

  32. Bychkov D, Linder N, Turkki R, Nordling S, Kovanen P, Verrill C, et al. Deep learning based tissue analysis predicts outcome in colorectal cancer. Sci Rep 2019;8:3395.

  33. Kim DW, Lee S, Kwon S, Nam W, Cha IH, Kim H. Deep learning-based survival prediction of oral cancer patients. Sci Rep 2019;9:6994.

  34. Diamant A, Chatterjee A, Vallières M, Shenouda G, Seuntjens J. Deep learning in head and neck cancer outcome prediction. Sci Rep 2019;9:2764.

5 Supplementary Materials

Gold Standard Diagnosis
Pathology Lab Affiliation Type State TA TVA HP SSA Total Slides
University hospital Georgia 42 0 27 6 75
Colorado 1 7 12 0 20
California 6 0 0 0 6
Texas 0 0 1 0 1
Veteran’s hospital Georgia 20 1 10 0 31
Minnesota 9 12 3 0 24
Colorado 1 0 6 0 7
Metropolitan/regional hospital Iowa 7 9 9 2 27
Ohio 4 2 8 8 22
Minnesota 0 0 11 10 21
South Carolina 7 3 3 1 14
New Hampshire 10 2 1 0 13
South Carolina 3 0 1 0 4
Minnesota 1 0 3 0 4
California 2 0 0 0 2
Iowa 1 0 1 0 2
Freestanding Iowa 25 19 10 0 54
Iowa 18 3 8 1 30
Iowa 1 1 5 0 7
Colorado 0 1 3 1 5
New Hampshire 0 0 2 2 4
Colorado 0 0 3 0 3
Colorado 0 0 1 0 1
Specialty clinic/practice Minnesota 39 25 33 39 136
Combined 197 85 161 70 513
Table S1: Colorectal polyp slide class distribution for our multi-institutional external test set grouped by pathology laboratory institutional affiliation type and state. TA: tubular adenoma; TVA: tubulovillous/villous adenoma; HP: hyperplastic polyp; SSA: sessile serrated adenoma.
Figure S1: Violin plots showing predicted areas (calculated by number of patches) for each polyp type on whole-slide images. TA: tubular adenoma; TVA: tubulovillous/villous adenoma; HP: hyperplastic polyp; SSA: sessile serrated adenoma. Subjective inspection by our study pathologists confirm that each plot reflects the expected histology distribution. For whole-slides that were diagnosed as TA and TVA, our model detected significant areas of TA and TVA, reflecting the morphological similarity of the two polyp types. Our model also detected large areas of HP in whole-slides diagnosed as TA, which is expected since all polypoid lesions in TAs are exposed to elevated mechanical forces and therefore show hyperplastic features at their peripheries. While the model detected mostly HP areas in whole-slides diagnosed as hyperplastic polyps, it did find some small areas of SSAs, possibly because larger HPs with deep, dilated crypts and serrated epithelium can appear similar to SSAs. Finally, for whole-slides that were diagnosed as SSAs, it makes sense that HPs comprised the largest area, since all SSAs have significant morphological overlap with hyperplastic polyps and may contain only a few classic, broad-based, dilated crypts with heavy serration.
Figure S2: Distribution of polyp class, number of patches per digitized slide, and slide size (in pixels) for (A) the internal test set and (B) the multi-institutional external test set. Patches are fixed-size areas of tissue obtained by sliding a window over the entire image. Size of digitized slides reflects the area of the tissue after removing the background.