Recommendations on test datasets for evaluating AI solutions in pathology

by   André Homeyer, et al.

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations for the collection of test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help regulatory agencies and end users verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.




"Happy and Assured that life will be easy 10years from now.": Perceptions of Artificial Intelligence in 8 Countries

As the influence and use of artificial intelligence (AI) have grown and ...

Lessons Learned from Designing an AI-Enabled Diagnosis Tool for Pathologists

Despite the promises of data-driven artificial intelligence (AI), little...

Ethics in AI through the Developer's Prism: A Socio-Technical Grounded Theory Literature Review and Guidelines

The term 'ethics' is widely used, explored, and debated in the context o...

Survey of XAI in digital pathology

Artificial intelligence (AI) has shown great promise for diagnostic imag...

AI-based Carcinoma Detection and Classification Using Histopathological Images: A Systematic Review

Histopathological image analysis is the gold standard to diagnose cancer...

Challenge AI Mind: A Crowd System for Proactive AI Testing

Artificial Intelligence (AI) has burrowed into our lives in various aspe...

Estimating the Brittleness of AI: Safety Integrity Levels and the Need for Testing Out-Of-Distribution Performance

Test, Evaluation, Verification, and Validation (TEVV) for Artificial Int...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The application of artificial intelligence techniques to digital tissue images has shown great promise for improving pathological diagnosis (serag2019, ; abels2019, ; moxley-wiles2020, ). They can not only automate time-consuming diagnostic tasks and make analyses more sensitive and reproducible, but also extract new digital biomarkers from tissue morphology for precision medicine (echle2021, ).

Pathology involves a large number of diagnostic tasks, each being a potential application for AI. Many of these involve the characterization of tissue morphology. Such tissue classification approaches have been developed for identifying tumors in a variety of tissues, including lung (coudray2018, ; wang2020, ), colon (iizuka2020, ), breast (cruz-roa2017, ; campanella2019, ), and prostate (campanella2019, ) but also in non-tumor pathology, e.g., kidney transplants (kers2022, ). Further applications include predicting outcomes (skrede2020, ; saillard2020, ) or gene mutations (kather2019, ; coudray2018, ; couture2018, )

directly from tissue images. Similar approaches are also employed to detect and classify cell nuclei, e.g., to quantify the positivity of immunohistochemistry markers like Ki67, ER/PR, Her2, and PD-L1 

(hoefener2018, ; balkenhol2021, ).

Testing AI solutions is an important step to ensure that they work reliably and robustly on routine laboratory cases. AI algorithms run the risk of exploiting feature associations that are specific to their training data (lever2016, )

. Such “overfitted” models tend to perform poorly on previously unseen data. To obtain a realistic estimate of the prediction performance on real-word data, it is common practice to apply AI solutions to a test dataset. The results are then compared with reference results in terms of task-specific performance metrics, e.g., sensitivity, specificity, or ROC-AUC.

Test datasets may only be used once to evaluate the performance of a finalized AI solution (lever2016, ). They may not be considered during development. This can be considered a consequence of Goodhart’s law stating that measures cease to be meaningful when used as targets (strathern1997, )

: If AI solutions are optimized for test datasets, they cannot provide realistic performance estimates for real-world data. Test datasets are also referred to as “hold-out datasets” or “(external) validation datasets.” The term “validation,” however, is not used consistently in the machine learning community and can also refer to model selection during development 

(lever2016, ).

Besides overfitting, AI methods are prone to “shortcut learning” (geirhos2020, ). Many datasets used in the development of AI methods contain confounding variables (e.g., slide origin, scanner type, patient age) that are spuriously correlated with the target variable (e.g., tumor type) (schmitt2021, ). AI methods often exploit features that are discriminative for such confounding variables and not for the target variable (wallis2022, ). Despite working well for smaller datasets containing similar correlations, such methods fail in more challenging real-world scenarios in ways humans never would (oakden-rayner2020, ). To minimize the likelihood of spurious correlations between confounding variables and the target variable, test datasets must be large and diversified (schmitt2021, ). At the same time, test datasets must be small enough to be acquired with realistic effort and cost. Finding a good balance between these requirements is a major challenge for AI developers.

Comparatively little attention has been paid to compiling test datasets for AI solutions in pathology. Datasets for training, on the other hand, were considered frequently (campanella2019, ; nagpal2019, ; tang2021, ; vali-betts2021, ; tellez2019, ; anghel2019, ; maree2017, ). Training datasets are collected with a different goal than test datasets: While training datasets should produce the best possible AI models, test datasets should provide the most realistic performance assessment for routine use, which presents unique challenges.

Some publications address individual problems in compiling test datasets in pathology, e.g., how to avoid bias in the performance evaluation caused by site-specific image features in test datasets (howard2021, ). Other publications provide general recommendations for evaluating AI methods for medical applications without considering the specific challenges of pathology (oala2020, ; maleki2020, ; cabitza2021, ; park2021, ; dehond2022, ).

Appropriate test datasets are critical to demonstrate the utility of AI solutions as well as to obtain regulatory approval. However, the lack of guidance on how to compile test datasets is a major barrier to the adoption of AI solutions in laboratory practice.

This article gives recommendations for test datasets in pathology. It summarizes the results of extensive literature reviews and discussions by a committee of various stakeholders, including commercial AI developers, pathologists, and researchers. This committee was established as part of the EMPAIA project (Ecosystem for Pathology Diagnostics with AI Assistance), aiming to facilitate the adoption of AI in pathology (hufnagl2021, ).

2 Results

The next sections discuss and provide recommendations on various aspects that must be considered when creating test datasets. For meaningful performance estimates, test datasets must be both diverse enough to cover the variability of data in routine diagnostics and large enough to allow statistically meaningful analyses. Relevant subgroups must be covered, and test datasets should be unbiased. Moreover, test datasets must be sufficiently independent of datasets used in the development of AI solutions. Comprehensive information about test datasets must be reported and regulatory requirements must be met when evaluating the clinical applicability of AI solutions.

2.1 Target population of images

All images an AI solution may encounter in its intended use constitute its “target population of images.” A test dataset must be an adequate sample of this target population to provide a reasonable estimate of the prediction performance of the AI solution. For all applications in pathology, the target population is distributed across multiple dimensions of variability, see Table 1.

Origin Variabilities
Patient Patient ethnicity Patient demographics Disease stage/severity Rare cases of disease Comorbidities Biological differences (genetic, transcriptional, epigenetic, proteomic, and metabolomic)
Specimen sampling Tissue heterogeneity Size of tissue section Coverage of diseased/healthy/boundary regions Tissue damage, e.g., torn, cauterized Surgical ink present
Slide processing Inter-material and device differences Preparation differences (fixation, dehydration; freezing; mechanical handling) Cutting artifacts (torn, folded, deformed, thick or inhomogeneously thick tissue) Foreign matter/floaters in specimen Over-/under-staining, inhomogeneous staining Foreign objects on slide/cover slip (dirt, stain residue, pen markings, fingerprint) Cracks, air bubbles, scratches Slide age
Imaging/image processing Inter- and intra-scanner differences Out-of-focus images, heterogeneous focus Amount of background in analyzed image region Magnification/image resolution Heterogeneous illumination Grid noise, stitching artifacts Lossy image compression
Ground truth annotation Inter- and intra-observer differences Ambiguous cases
Table 1: Examples of data variabilities within the intended use (chen2021, ; avanaki2016, ; schoemig-markiefka2021, ; focke2017, ; tellez2019, ; taqi2018, ; chatterjee2014, ; schmitt2021, ; cajal2020, ; pursnani2016, ).

Biological variability. The visual appearance of tissue varies between normal and diseased states. This is what AI solutions are designed to detect and characterize. But even tissue of the same category can look very different (see Figure 1). The appearance is influenced by many factors (e.g., genetic, transcriptional, epigenetic, proteomic, and metabolomic) that differ between patients as well as between demographic and ethnic groups (cajal2020, ). These factors often vary spatially (e.g., different parts of organs are differently affected) and temporally (e.g., the pathological alterations differ based on disease stage) within a single patient (dagogo-jack2017, ).

Figure 1: Examples of tissue variability within and between biopsies (H&E-stained breast tissue of female patients with invasive carcinomas of no special type, 40× objective magnification). First and second column from the left: 41yo patients, grade 2; third and fourth column: 42yo patients, grade 3.

Technical variability. Processing and digitization of tissue sections consists of several steps (e.g., tissue fixation, processing, cutting, staining and digitization) all of which can contribute to image variability (chen2021, ). Differences in section thickness and staining solutions can lead to variable staining appearances  (focke2017, ). Artifacts frequently occur during tissue processing, including elastic deformations, inclusion of foreign objects, and cover glass scratches (schoemig-markiefka2021, ). Differences in illumination, resolution, and encoding algorithms of slide scanner models also affect the appearance of tissue images (chen2021, ).

Observer variability. Images in test datasets are commonly associated with a reference label like a disease category or score determined by a human observer. It is well known that the assessment of tissue images is subject to intra- and inter-observer variability (allison2014, ; el-badry2009, ; martinez2007, ; kujan2007, ; boiesen2000, ; oni2017, ; furness2003, ). This variability results from subjective biases (e.g., caused by training, specialization, and experience) but also from inherent ambiguities in the images (tizhoosh2021, ; homeyer2017, ).

Routine laboratory work occasionally produces images that are unsuitable for the intended use of an AI solution, e.g., because they are ambiguous or of insufficient quality. Most AI solutions require prior quality assurance steps to ensure that solutions are only applied to suitable images (perincheri2021, ; dasilva2021, ). The boundary between suitable and unsuitable images is usually fuzzy (see Figure 2) and there are difficult images that cannot be clearly assigned to either category (see Figure 3).

Figure 2: Qualitative overview of sampling regimes for performance assessment in the entire target population of images or in specific subgroups. The boundary between the target population of images and unsuitable images that do not fall under the intended use is fuzzy.
Figure 3: Examples of different severity levels of artifacts on a prostate section. The top row shows simulated foreign objects, the bottom row shows simulated focal blur. The original image on the left is clearly within the intended use of algorithms for Gleason grading in prostate cancer diagnostics, while the rightmost images are clearly unsuitable. The tissue image is adapted from another source (arvaniti2018dataset, ) (CC0-licensed (CC0license, )).

Defining the target population is challenging and presumes a clear definition of the intended use by the AI developer. The target population of images must be defined before test datasets are collected. It must be clearly stated which subsets of images fall under the intended use. Such subsets may consist of specific disease variants, demographic characteristics, ethnicities, staining characteristics, artifacts, or scanner types. These subsets typically overlap, e.g., the subset of images of one scanner type contains images from different patient age groups. A particular challenge is to define where the target population ends. Examples of images within and outside the intended use can help human observers sort out unsuitable images as objectively as possible.

2.2 Data collection

Test datasets must be representative of the entire target population of images, i.e., sufficiently diverse and unbiased. To minimize spurious correlations between confounding variables and the target variable and to uncover shortcut learning in AI methods, all dimensions of biological and technical variability must be adequately covered for the classes considered (schmitt2021, ; maree2017, ), also reflecting the variability of negative cases without visible pathology (maree2017, ; ianni2020, ).

All images encountered in the normal laboratory workflow must be considered. One way to achieve this is to collect all cases that occurred over a given time period (ianni2020, ) long enough for a sufficient number of cases to be collected (e.g., one year (campanella2019, )). Data should be collected from multiple international laboratories, since they differ in their spectra of patients and diseases, technical equipment and operating procedures. To avoid selection bias, artifacts or atypical morphologies must not be excluded if they are part of the intended use of the product (campanella2019, ; ianni2020, ; freeman2021, ). Data should be collected at the point in the workflow where the AI solution would be applied, taking into account possible prior quality assurance steps in the workflow.

All data in a test dataset must be collected according to a consistent acquisition protocol (see “Reporting”). The best way to ensure this is to prospectively collect test datasets according to this protocol. Retrospective datasets were typically collected for a different purpose and are thus likely to be subject to selection bias, that is difficult to adjust for (talari2020, ). If retrospective data are used in a test dataset, a comprehensive description of the acquisition protocol must be available so that potential issues can be identified (gianfrancesco2018, ).

2.2.1 Annotation

Test datasets for AI solutions contain not only images, but also annotations representing the expected analysis result, e.g., slide-level labels or delineations of tissue regions. In most cases, such reference annotations must be prepared by human observers with sufficient experience in the diagnostic use case. Since humans are prone to intra- and inter-observer variability, annotations in test datasets should be created by multiple observers from different hospitals or laboratories. For unequivocal results, it can be helpful to organize consensus conferences and to use standardized electronic reporting formats (allison2014, ). Any remaining disagreement should be documented with justification (e.g., suboptimal sample quality) and considered when evaluating AI solutions. Semi-automatic annotation methods can help reduce the effort required for manual annotation (gamper2020, ; graham2021, ). However, they can introduce biases themselves and should therefore be monitored by human observers.

2.2.2 Curation

Unsuitable data that does not fit the intended use of an AI solution should not be included in a test dataset. Such data usually must be detected by human observers, e.g., in a dedicated data curation step or during the generation of reference annotations. However, there are automated tools to support this process (janowczyk2019, ). Some approaches identify unsuitable data based on basic image features such as brightness, predominant colors, and sharpness (ameisen2014, ; senaras2018, ) or by detecting typical artifacts like tissue folds and air bubbles (avanaki2016, ; smit2021, ). Other methods analyze domain shifts (stacke2021, ; bozorgtabar2021, ; linmans2020, )

or use dedicated neural networks trained for outlier detection 

(guha-roy2022, )

. There are also approaches for detecting outliers depending on the tested AI solution 

(calli2019, ; cao2020, ; berger2021, ; stacke2021, ; zhang2021, ). Although these approaches can help exclude unsuitable images from test datasets, they do not yet appear to be mature enough to be used entirely without human supervision.

2.2.3 Synthetic data

There are a variety of techniques for extending datasets with synthetic data. Some techniques alter existing images in a generic (e.g., rotation, mirroring) or histology-specific way (e.g., stain transformations (tellez2019, ) or emulation of image artifacts (wang2021, ; sinha2021, ; schoemig-markiefka2021, ; lehmussola2007, ; ulman2016, ; gadermayr2019, ; moghadam2022, )). Other techniques create fully synthetic images from scratch (niazi2018, ; levine2020, ; quiros2019, ; jose2021, ; deshpande2022, ). These techniques are useful for data augmentation (janowczyk2016, ; serag2019, ; abels2019, ), i.e., enriching development data in order to avoid overfitting and increase robustness. However, they cannot replace original real-world data for test datasets. Because all of these techniques are based on simplified models of real-world variability, they are likely to introduce biases into a test dataset and make meaningful performance measurement impossible.

2.3 Sample size

Any test dataset is a sample from the target population of images, thus any performance metric computed on a test dataset is subject to sampling error. In order to draw reliable conclusions from evaluation results, the sampling error must be sufficiently small. Larger samples generally result in lower sampling error, but are also more expensive to produce. Therefore, the minimum sample size required to achieve a maximum allowable sampling error should be determined prior to data collection.

Many different methods have been proposed for sample size determination. Most of them refer to statistical significance tests which are used to test a prespecified hypothesis about a population parameter (e.g., sensitivity, specificity, ROC-AUC) on the basis of an observed data sample (adcock1997, ; pepe2004, ; flahault2005, ). Such sample size determination methods are commonly used in clinical trial planning and available in many statistical software packages (zhang2021, ).

When evaluating AI solutions in pathology, the goal is more often to estimate a performance metric with a sufficient degree of precision than to test a previously defined hypothesis. Confidence intervals (CIs) are a natural way to express the precision of an estimated metric and should be reported instead of or in addition to test results 

(bland2009, ). A CI is an interval around the sample statistic that is likely to cover the true population value at some confidence level, usually 95% (hazra2017, ). The sample statistic can either be the performance metric itself or a difference between the performance metrics of two methods, e.g., when comparing performance to an established solution.

When using CIs, the sample size calculation can be based on the targeted width of the CI which is inversely proportional to the precision of the performance estimation (bland2009, ). Several approaches have been proposed for that matter (hanley1982, ; simel1991, ; kelley2003, ; riley2021, ; pavlou2021, )

. To determine a minimum sample size, assumptions regarding the sample statistic, its variability, and usually also its distributional form must be made. The open-source software “presize” implements several of these methods and provides a simple web-based user interface to perform CI-based sample size calculations for common performance metrics 

(haynes2021, ).

2.4 Subsets

AI solutions that are very accurate on average often perform much worse on certain subsets of their target population of images (echle2022, ), a phenomenon known as “hidden stratification.” Such differences in performance can exceed 20% (oakden-rayner2020, ). Hidden stratification occurs particularly in low-prevalence subgroups, but may also occur in subgroups with poor label quality or subtle distinguishing characteristics (oakden-rayner2020, ). There are substantial differences in cancer incidence, e.g., by gender, socioeconomic status, and geographic region (sung2021, ). Hence, hidden stratification may result in disproportionate harm to patients in less common demographic groups and jeopardize the clinical applicability of AI solutions (oakden-rayner2020, ). Common performance measures computed on the entire test dataset can be dominated by larger subsets and do not indicate whether there are subsets for which an AI solution underperforms (saito2015, ).

To detect hidden stratification, AI solutions must be evaluated independently on relevant subsets of the target population of images (e.g., certain medical characteristics, patient demographics, ethnicities, scanning equipment) (oakden-rayner2020, ; echle2022, ). This means in particular that the metadata for identifying the subsets must be available (oala2020, ). Performance evaluation on subsets is an important requirement to obtain clinical approval by the FDA (see “Regulatory requirements”). Accordingly, such subsets should be specifically delineated within test datasets. Each subset needs to be sufficiently large to allow statistically meaningful results (see “Sample size”). It is important to provide information on why and how subsets were collected so that any issues AI solutions may have with specific subsets can be specifically tracked (see “Reporting”). Identifying subsets at risk of hidden stratification is a major challenge and requires extensive knowledge of the use case and the distribution of possible input images (oakden-rayner2020, )

. As an aid, potentially relevant subsets can also be detected automatically using unsupervised clustering approaches such as k-means 

(oakden-rayner2020, ). If a detected cluster underperforms compared to the entire dataset, this may indicate the presence of hidden stratification that needs further examination.

2.5 Bias detection

Biases can make test datasets unsuitable for evaluating the performance of AI algorithms. Therefore, it is important to identify potential biases and to mitigate them early during data acquisition (maree2017, ). Bias, in this context, refers to sampling bias, i.e., the test dataset is not a randomly drawn sample from the target population of images. Subsets to be evaluated independently may be biased by construction with respect to particular features (e.g., patient age). Here, it is important to ensure that the subgroups do not contain unexpected biases with respect to other features. For example, the prevalence of slide scanners should be independent of patient age, whereas the prevalence of diagnoses may vary by age group.

For features represented as metadata (e.g., patient age, slide scanner, or diagnosis), bias can be detected by comparing the feature distributions in the test dataset and the target population using summary statistics (e.g., via mean and standard deviation) or dedicated fairness metrics 

(qi2021, ; cabitza2020, ). Detection of bias in an entire test dataset requires a good estimate of the feature distribution of the target population of images. Bias in subgroups can be detected by comparing the subset distribution to the entire dataset. Several toolkits for measuring bias based on metadata have been proposed (saleiro2018, ; bellamy2018, ) and evaluated (lee2021, ).

Detecting bias in the image data itself is more challenging. Numerous features can be extracted from image data and it is difficult to determine the distribution of these features in the target population of images. Similar to automatic detection of unsuitable data, there are automatic methods to reveal bias in image data. Domain shifts (stacke2021, ) can be detected either by comparing the distributions of basic image features (e.g., contrast) or by more complex image representations learned through specific neural network models (stacke2021, ; guha-roy2022, ; roohi2020, ). Another approach is to train trivial machine learning models with modified images from which obvious predictive information has been removed (e.g., tumor regions): If such models perform better than chance, this indicates bias in the dataset (model2015, ; shamir2008, ).

2.6 Independence

In the development of AI solutions, it is common practice to split a given dataset into two sets, one for development (e.g., a training and a validation set for model selection) and one for testing (lever2016, ). AI methods are prone to exploit spurious correlations in datasets as shortcut opportunities (geirhos2020, ). In this case, the methods perform well on data with similar correlations, but not on the target population. If both development and test datasets are drawn from the same original dataset, they are likely to share spurious correlations, and the performance on the test dataset may overestimate the performance on the target population. Therefore, datasets used for development and testing need to be sufficiently independent. As explained below, it is not sufficient for test datasets to merely contain different images than development datasets (lever2016, ; geirhos2020, ).

To account for memory constraints, histologic whole-slide images (WSIs) are usually divided into small sub-images called “tiles.” AI methods are then applied to each tile individually, and the result for the entire WSI is obtained by aggregating the results of the individual tiles. If tiles are randomly assigned, tiles from the same WSI can end up in both the development and the test datasets, possibly inflating performance results. A substantial number of published research studies are affected by this problem (bussola2021, ). Therefore, to avoid any risk of bias, none of the tiles in a test dataset may originate from the same WSI as the tiles in the development set (bussola2021, ).

Datasets can contain site-specific feature distributions (howard2021, ). If these site-specific features are correlated with the outcome of interest, AI methods might use these features for classification rather than the relevant biological features (e.g., tissue morphology) and be unable to generalize to new datasets. A comprehensive evaluation based on multi-site datasets from TCGA showed that including data from one site in development and test datasets often leads to overoptimistic estimates of model accuracy (howard2021, ). This study also found that commonly used color normalization and augmentation methods did not prevent models from learning site-specific features, although stain differences between laboratories appeared to be a primary source of site-specific features. Therefore, the images in development and test datasets must originate not only from different subjects, but should also from different clinical sites (maleki2020, ; wu2021, ; koenig2007, ).

As described in the Introduction section, a given AI solution should only be evaluated once against a given test dataset (lever2016, ). Datasets published in the context of challenges or studies (many of which are based on TCGA (echle2021, ) and have regional biases (celi2022, )) should generally not be used as test datasets: it cannot be ruled out that they were taken into account in some form during development, e.g., inadvertently or as part of pretraining. Ideally, test datasets should not be published at all and the evaluation should be conducted by an independent body with no conflicts of interest (oala2020, ).

2.7 Reporting

Adequate reporting of test datasets is essential to determine whether a particular dataset is appropriate for a particular AI solution. Detailed metadata on the coverage of various dimensions of variability is required for detecting bias and identifying relevant subsets. Data provenance must be tracked to ensure that test data are sufficiently disjoint from development data (maree2017, ; howard2021, ). Requirements for the test data (ai4h2020del5-4, ) and acquisition protocols (ai4h2020del5-1, ) should also be reported so that further data can be collected later. Accurate reporting of test datasets is important in order to submit evaluation results traceable to the test data for regulatory approval (MDCG2022-2, ).

Various guidelines for reporting clinical research and trials, including diagnostic models, have been published (moons2015, ). Some of these have been adapted specifically for machine learning approaches (liu2020, ; norgeot2020, ) or such adaptation is under development (wiegand2019, ; wenzel2020, ; sounderajah2021, ; collins2021, ). However, only very few guidelines elaborate on data reporting (stevens2020, ), and there is not yet consensus on structured reporting of test datasets, particularly for computational pathology.

Data acquisition protocols should comprehensively describe how and where the test dataset was acquired, handled, processed, and stored (ai4h2020del5-1, ; ai4h2020del5-4, ). This documentation should include precise details of the hardware and software versions used and also cover the creation of reference annotations. Moreover, quality criteria for rejecting data and procedures for handling missing data (stevens2020, ) should be reported, i.e., aspects of what is not in the dataset. Protocols should be defined prior to data acquisition when prospectively collecting test data. Completeness and clarity of the protocols should be verified during data acquisition.

Reported information should characterize the acquired dataset in a useful way. For example, summary statistics allow an initial assessment whether a given dataset is an adequate sample of the target population. Relevant subsets and biases identified in the dataset should be reported as well. Generally, one should collect and report as much information as feasible with the available resources, since retrospectively obtaining missing metadata is hard or impossible. If there will be multiple versions of a dataset, e.g., due to iterative data acquisition or review of reference annotations, versioning is needed. Suitable hashing can guarantee integrity of the entire dataset as well as its individual samples, and identify datasets without disclosing contents.

2.8 Regulatory requirements

AI solutions in pathology are in vitro diagnostic medical devices (IVDMDs) because they evaluate tissue images for diagnostic purposes outside the human body. Therefore, regulatory approval is required for sale and use in a clinical setting (homeyer2021, ). The U.S. Food and Drug Administration (FDA) and European Union (EU) impose similar requirements to obtain regulatory approval. This includes compliance with certain quality management and documentation standards, a risk analysis, and a comprehensive performance evaluation. The performance evaluation must demonstrate that the method provides accurate and reliable results compared to a gold standard (analytical performance) and that the method provides real benefit in a clinical context (clinical performance). Good test datasets are an essential prerequisite for a meaningful evaluation of analytical performance.

2.8.1 Eu + Uk

In the EU and UK, IVDMDs are regulated by the In vitro Diagnostic Device Regulation (IVDR, formally “Regulation 2017/746”) (eu2017, ). After a transition period, compliance with the IVDR will be mandatory for novel routine pathology diagnostics as of May 26, 2022. The IVDR does not impose specific requirements on test datasets used in the analytical performance evaluation. However, the EU has put forward a proposal for an EU-wide regulation on harmonized rules for the assessment of AI (eu2021proposal, ).

The EU proposal (eu2021proposal, ) considers AI-based IVDMDs as “high-risk AI systems” (preamble (30)). For test datasets used in the evaluation of such systems, the proposal imposes certain quality criteria: test datasets must be “relevant, representative, free of errors and complete” and “have the appropriate statistical properties” (Article 10.3). Likewise, it requires test datasets to be subject to “appropriate data governance and management practices” (preamble (44)) with regard to design choices, suitability assessment, data collection, and identification of shortcomings.

2.8.2 Usa

In the US, IVDMDs are regulated in the Code of Federal Regulations (CFR) Part 809 (ecfr21ivdr, ). Just like the IVDR, the CFR does not impose specific requirements on test datasets used in the analytical performance evaluation. However, the CFR states that products should be accompanied by labeling stating specific performance characteristics (e.g., accuracy, precision, specificity, and sensitivity) related to normal and abnormal populations of biological specimens.

In 2021, the FDA approved the first AI software for pathology (fda2021prostate, ). In this context, the FDA has established a definition and requirements for approval of generic AI software for pathology, formally referred to as “software algorithm devices to assist users in digital pathology” (fda2021paigeprostateapproval, ).

Test datasets used in analytical performance studies are expected to contain an “appropriate” number of images. To be “representative of the entire spectrum of challenging cases” (3.ii.A. and B. of source (fda2021paigeprostateapproval, )) that can occur when the product is used as intended, test datasets should cover multiple operators, slide scanners, and clinical sites and contain “clinical specimens with defined, clinically relevant, and challenging characteristics.”(3.ii.B. of source (fda2021paigeprostateapproval, )) In particular, test datasets should be stratified into relevant subsets (e.g., by medical characteristics, patient demographics, scanning equipment) to allow separate determination of performance for each subset. Case cohorts considered in clinical performance studies (e.g., evaluating unassisted and software-assisted evaluation of pathology slides with intended users) are expected to adhere to similar specifications.

Product labeling according to CFR 809 was also defined in more detail. In addition to the general characteristics of the dataset (e.g., origin of images, annotation procedures, subsets, …), limitations of the dataset (e.g., poor image quality or insufficient sampling of certain subsets) that may cause the software to fail or operate unexpectedly should be specified.

In summary, there are much more specific requirements for test datasets in the US than in the EU. However, none of the regulations clearly specify how the respective requirements can be achieved or verified.

3 Discussion

Our recommendations for compiling test datasets are summarized in Figure 4. They are intended to help AI developers demonstrate the robustness and practicality of their solutions to regulatory agencies and end users. Likewise, the advice can be used to check whether test datasets used in the evaluation of AI solutions were appropriate and reported performance measures are meaningful. Much of the advice can also be transferred both to image analysis solutions without AI and to similar domains where solutions are applied to medical images, such as radiology or ophthalmology.

Figure 4: Overview of recommendations to be considered during different phases of collecting test datasets

A key finding of the work is that it remains challenging to collect test datasets and that there are still many unanswered questions. The current regulatory requirements remain vague and do not specify in detail important aspects such as the required diversity of test datasets or the required confidence in measured performance metrics. The main challenge is that the target population of images is elusive, i.e., it cannot be formally specified but only roughly described. This makes it difficult to determine whether a dataset is representative, i.e., whether the many dimensions of variability are covered sufficiently, and whether the sample distribution corresponds to real-world data. Without a clear measure of representativity, it is also impossible to determine whether a test dataset is large enough to enable assessment of performance metrics with a maximum sampling error.

For regulatory approval, a plausible justification is needed why the test dataset used was good enough. Besides following the advice in this paper, it can also be helpful to refer to published studies in which AI solutions have been comprehensively evaluated. Additional guidance can be found in the summary documents of approved AI solutions published by the FDA, which include information on their evaluation (wu2021, ). It turns out that many of the AI devices approved by the FDA were evaluated only at a small number of sites (wu2021, ) with limited geographic diversity (kaushal2020, ). Test sets used in current studies typically involved 1000s of slides, 100s of patients, <5 sites, and <5 scanner types (perincheri2021, ; ianni2020, ; bulten2022, ; dudgeon2021, ).

Today, AI solutions in pathology may not be used for primary diagnosis, but only in conjunction with a standard evaluation by the pathologist (fda2021paigeprostateapproval, ). Therefore, compared to a fully automated usage scenario, requirements for robustness are considerably lower. This also applies to the expected confidence in the performance measurement and the scope of the test dataset used. In a supervised usage scenario, the accuracy of an AI solution determines how often the user needs to intervene to correct results, and thus its practical usefulness. End users are interested in the most meaningful evaluation of the accuracy of AI solutions to assess their practical utility. Therefore, a comprehensive evaluation of the real-world performance of a product, taking into account the advice given in this paper, can be an important marketing tool.

3.1 Limitations and outlook

Some aspects of compiling test datasets were not considered in this article. One aspect is how to collaborate with data donors, i.e., how to incentivize or compensate them for donating data. Other aspects include the choice of software tools and data formats for the collection and storage of data sets or how the use of test datasets should be regulated. These aspects must be clarified individually for each use case and the AI solution to be tested. Furthermore, we do not elaborate on legal aspects of collecting test datasets, e.g., obtaining consent from patients, privacy regulations, licensing, and liability. For more details on these topics, we refer to other works (rodrigues2020, ). This paper focuses exclusively on the compilation of test datasets. For advice on other issues related to validating AI solutions in pathology, such as how to select an appropriate performance metric, how to make algorithmic results interpretable, or how to conduct a clinical performance evaluation with end users, we also refer to other works (maleki2020, ; kelly2019, ; oala2020, ; park2021, ; dehond2022, ; evans2022, ).

For AI solutions to operate with less user intervention and to better support diagnostic workflows, real-world performance must be assessed more accurately than is currently possible. The key to accurate performance measures is the representativeness of the test dataset. Therefore, future work should focus on better characterizing the target population of images and how to collect more representative samples. Empirical studies should be conducted on how different levels of coverage of the variability dimensions (e.g., laboratories, scanner types) affect the quality of performance evaluation for common use cases in computational pathology.

In addition, clear criteria should be developed to delineate the target population from unsuitable data. Currently, the assessment of the suitability of data is typically done by humans, which might introduce subjective bias. Automated methods can help to make the assessment of suitability more objective (see “Curation”) and should therefore be further explored. However, such automated methods must be validated on dedicated test datasets themselves.

Another open challenge is how to deal with changes in the target population of images. Since the intended use for a particular product is fixed, in theory the requirements for the test datasets should also be fixed. However, the target distribution of images is influenced by several factors that change over time. These include technological advances in specimen and image acquisition, distribution of scanner systems used, and shifting patient populations (finlayson2021, ; kelly2019, ). As part of post-market surveillance, AI solutions must be continuously monitored during their entire lifecycle (MDCG2022-2, ). Clear processes are required for identifying changes in the target population of images and adapting performance estimates accordingly.

4 Conclusions

Appropriate test datasets are essential for meaningful evaluation of the performance of AI solutions. The recommendations provided in this article are intended to help demonstrate the utility of AI solutions in pathology and to assess the validity of performance studies. The key remaining challenge is the vast variability of images in computational pathology. Further research is needed on how to formalize criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.


A.H., C.G., L.O.S., F.Z., T.E., K.S., A.K., C.O.R., T.S. R.C., P.B., P.H., and N.Z. were supported by the German Federal Ministry for Economic Affairs and Climate Action via the EMPAIA project (grant numbers 01MK20002A, 01MK20002B, 01MK20002C, 01MK20002E). M.Ka., M.P., and H.M. received funding from the Austrian Science Fund (FWF), Project P-32554 (Explainable Artificial Intelligence), the Austrian Research Promotion Agency (FFG) under grant agreement No. 879881 (EMPAIA), and the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 857122 (CY-Biobank). P.S. was funded by Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI. R.B. was supported by the START Program of the Faculty of Medicine of the RWTH Aachen University (Grant-Nr. 148/21). P.B. was also supported by the German Research Foundation (DFG, Project IDs 322900939, 454024652, 432698239, and 445703531), the European Research Council (ERC, CoG AIM.imaging.CKD No. 101001791), the German Federal Ministries of Health (Deep Liver, No. ZMVI1-2520DAT111), Education and Research (STOP-FSGS-01GM1901A). The funders had no role in the committee work, discussions, literature research, decision to publish, or preparation of the manuscript.

Author Contributions

A.H. and C.G. organized the committee work. A.H., C.G., and L.O.S. conceived the manuscript. A.H., C.G., L.O.S., F.Z., T.E., K.S., M.W., and R.D.B. wrote the manuscript. All authors participated in the committee work and contributed to the literature review. The final version of the paper was reviewed and approved by all authors.

Competing interests

F.Z. is a shareholder of asgen GmbH. P.S. is a member of the supervisory board of asgen GmbH. All other authors declare that they have no conflict of interest.


  • (1)

    A. Serag, A. Ion-Margineanu, H. Qureshi, R. McMillan, M.-J. S. Martin, J. Diamond, P. O’Reilly, P. Hamilton, Translational AI and deep learning in diagnostic pathology, Frontiers in Medicine 6 (2019).

  • (2) E. Abels, L. Pantanowitz, F. Aeffner, M. D. Zarella, J. Laak, M. M. Bui, V. N. P. Vemuri, A. V. Parwani, J. Gibbs, E. Agosto-Arroyo, A. H. Beck, C. Kozlowski, Computational pathology definitions, best practices, and recommendations for regulatory guidance: a white paper from the digital pathology association, The Journal of Pathology 249 (3) (2019) 286–294. doi:10.1002/path.5331.
  • (3) B. Moxley-Wyles, R. Colling, C. Verrill, Artificial intelligence in pathology: an overview, Diagnostic Histopathology 26 (11) (2020) 513–520. doi:10.1016/j.mpdhp.2020.08.004.
  • (4) A. Echle, N. T. Rindtorff, T. J. Brinker, T. Luedde, A. T. Pearson, J. N. Kather, Deep learning in cancer pathology: a new generation of clinical biomarkers, British Journal of Cancer 124 (4) (2021) 686–696. doi:10.1038/s41416-020-01122-x.
  • (5) N. Coudray, P. S. Ocampo, T. Sakellaropoulos, N. Narula, M. Snuderl, D. Fenyö, A. L. Moreira, N. Razavian, A. Tsirigos, Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning, Nature Medicine 24 (10) (2018) 1559–1567. doi:10.1038/s41591-018-0177-5.
  • (6) X. Wang, H. Chen, C. Gan, H. Lin, Q. Dou, E. Tsougenis, Q. Huang, M. Cai, P.-A. Heng, Weakly supervised deep learning for whole slide lung cancer image analysis, IEEE Transactions on Cybernetics 50 (9) (2020) 3950–3962. doi:10.1109/tcyb.2019.2935141.
  • (7) O. Iizuka, F. Kanavati, K. Kato, M. Rambeau, K. Arihiro, M. Tsuneki, Deep learning models for histopathological classification of gastric and colonic epithelial tumours, Scientific Reports 10 (1) (2020). doi:10.1038/s41598-020-58467-9.
  • (8) A. Cruz-Roa, H. Gilmore, A. Basavanhally, M. Feldman, S. Ganesan, N. N. Shih, J. Tomaszewski, F. A. González, A. Madabhushi, Accurate and reproducible invasive breast cancer detection in whole-slide images: A deep learning approach for quantifying tumor extent, Scientific Reports 7 (1) (2017). doi:10.1038/srep46450.
  • (9) G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. W. Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, T. J. Fuchs, Clinical-grade computational pathology using weakly supervised deep learning on whole slide images, Nature Medicine 25 (8) (2019) 1301–1309. doi:10.1038/s41591-019-0508-1.
  • (10) J. Kers, R. D. Bülow, B. M. Klinkhammer, G. E. Breimer, F. Fontana, A. A. Abiola, R. Hofstraat, G. L. Corthals, H. Peters-Sengers, S. Djudjaj, S. von Stillfried, D. L. Hölscher, P. T. T., A. D. van Zuilen, F. J. Bemelman, A. S. Nurmohamed, M. Naesens, J. J. T. H. Roelofs, S. Florquin, J. Floege, T. Q. Nguyen, J. N. Kather, P. Boor, Deep learning-based classification of kidney transplant pathology: a retrospective, multicentre, proof-of-concept study, The Lancet Digital Health 4 (1) (2022) e18–e26. doi:10.1016/s2589-7500(21)00211-9.
  • (11) O.-J. Skrede, S. De Raedt, A. Kleppe, T. S. Hveem, K. Liestøl, J. Maddison, H. A. Askautrud, M. Pradhan, J. A. Nesheim, F. Albregtsen, I. N. Farstad, E. Domingo, D. N. Church, A. Nesbakken, N. A. Shepherd, I. Tomlinson, R. Kerr, M. Novelli, D. J. Kerr, H. E. Danielsen, Deep learning for prediction of colorectal cancer outcome: a discovery and validation study, The Lancet 395 (10221) (2020) 350–360. doi:10.1016/s0140-6736(19)32998-8.
  • (12) C. Saillard, B. Schmauch, O. Laifa, M. Moarii, S. Toldo, M. Zaslavskiy, E. Pronier, A. Laurent, G. Amaddeo, H. Regnault, D. Sommacale, M. Ziol, J.-M. Pawlotsky, S. Mulé, A. Luciani, G. Wainrib, T. Clozel, P. Courtiol, J. Calderaro, Predicting survival after hepatocellular carcinoma resection using deep learning on histological slides, Hepatology 72 (6) (2020) 2000–2013. doi:10.1002/hep.31207.
  • (13) J. N. Kather, A. T. Pearson, N. Halama, D. Jäger, J. Krause, S. H. Loosen, A. Marx, P. Boor, F. Tacke, U. P. Neumann, H. I. Grabsch, T. Yoshikawa, H. Brenner, J. Chang-Claude, M. Hoffmeister, C. Trautwein, T. Luedde, Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer, Nature Medicine 25 (7) (2019) 1054–1056. doi:10.1038/s41591-019-0462-y.
  • (14) H. D. Couture, L. A. Williams, J. Geradts, S. J. Nyante, E. N. Butler, J. S. Marron, C. M. Perou, M. A. Troester, M. Niethammer, Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype, npj Breast Cancer 4 (1) (2018). doi:10.1038/s41523-018-0079-1.
  • (15) H. Höfener, A. Homeyer, N. Weiss, J. Molin, C. F. Lundström, H. K. Hahn, Deep learning nuclei detection: A simple approach can deliver state-of-the-art results, Computerized Medical Imaging and Graphics 70 (2018) 43–52. doi:10.1016/j.compmedimag.2018.08.010.
  • (16) M. C. Balkenhol, F. Ciompi, Ż. Świderska-Chadaj, R. van de Loo, M. Intezar, I. Otte-Höller, D. Geijs, J. Lotz, N. Weiss, T. de Bel, G. Litjens, P. Bult, J. A. W. M. van der Laak, Optimized tumour infiltrating lymphocyte assessment for triple negative breast cancer prognostics, The Breast 56 (2021) 78–87. doi:10.1016/j.breast.2021.02.007.
  • (17) J. Lever, M. Krzywinski, N. Altman, Model selection and overfitting, Nature Methods 13 (9) (2016) 703–704. doi:10.1038/nmeth.3968.
  • (18) M. Strathern, ‘improving ratings’: Audit in the British university system, European Review 5 (3) (1997) 305–321. doi:10.1002/(sici)1234-981x(199707)5:3<305::aid-euro184>;2-4.
  • (19) R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. A. Wichmann, Shortcut learning in deep neural networks, Nature Machine Intelligence 2 (11) (2020) 665–673. doi:10.1038/s42256-020-00257-z.
  • (20) M. Schmitt, R. C. Maron, A. Hekler, A. Stenzinger, A. Hauschild, M. Weichenthal, M. Tiemann, D. Krahl, H. Kutzner, J. S. Utikal, S. Haferkamp, J. N. Kather, F. Klauschen, E. Krieghoff-Henning, S. Fröhling, C. von Kalle, T. J. Brinker, Hidden variables in deep learning digital pathology and their potential to cause batch effects: Prediction model study, Journal of Medical Internet Research 23 (2) (2021) e23436. doi:10.2196/23436.
  • (21) D. Wallis, I. Buvat, Clever Hans effect found in a widely used brain tumour MRI dataset, Medical Image Analysis 77 (2022) 102368. doi:10.1016/
  • (22) L. Oakden-Rayner, J. Dunnmon, G. Carneiro, C. Re, Hidden stratification causes clinically meaningful failures in machine learning for medical imaging, in: Proceedings of the ACM Conference on Health, Inference, and Learning, 2020, pp. 151–159. doi:10.1145/3368555.3384468.
  • (23) K. Nagpal, D. Foote, Y. Liu, P.-H. C. Chen, E. Wulczyn, F. Tan, N. Olson, J. L. Smith, A. Mohtashamian, J. H. Wren, G. S. Corrado, R. MacDonald, L. H. Peng, M. B. Amin, A. J. Evans, A. R. Sangoi, C. H. Mermel, J. D. Hipp, M. C. Stumpe, Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer, npj Digital Medicine 2 (1) (2019). doi:10.1038/s41746-019-0112-2.
  • (24) H. Tang, N. Sun, S. Shen, Improving generalization of deep learning models for diagnostic pathology by increasing variability in training data: Experiments on osteosarcoma subtypes, Journal of Pathology Informatics 12 (1) (2021) 30. doi:10.4103/jpi.jpi_78_20.
  • (25) E. Vali-Betts, K. J. Krause, A. Dubrovsky, K. Olson, J. P. Graff, A. Mitra, A. Datta-Mitra, K. Beck, A. Tsirigos, C. Loomis, A. G. Neto, E. Adler, Effects of image quantity and image source variation on machine learning histology differential diagnosis models, Journal of Pathology Informatics 12 (1) (2021) 5. doi:10.4103/jpi.jpi_69_20.
  • (26)

    D. Tellez, G. Litjens, P. Bándi, W. Bulten, J.-M. Bokhorst, F. Ciompi, J. van der Laak, Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology, Medical Image Analysis 58 (2019) 101544.

  • (27) A. Anghel, M. Stanisavljevic, S. Andani, N. Papandreou, J. H. Rüschoff, P. Wild, M. Gabrani, H. Pozidis, A high-performance system for robust stain normalization of whole-slide images in histopathology, Frontiers in Medicine 6 (2019). doi:10.3389/fmed.2019.00193.
  • (28)

    R. Marée, The need for careful data collection for pattern recognition in digital pathology, Journal of Pathology Informatics 8 (1) (2017) 19.

  • (29) F. M. Howard, J. Dolezal, S. Kochanny, J. Schulte, H. Chen, L. Heij, D. Huo, R. Nanda, O. I. Olopade, J. N. Kather, N. Cipriani, R. L. Grossman, A. T. Pearson, The impact of site-specific digital histology signatures on deep learning model accuracy and bias, Nature Communications 12 (1) (2021). doi:10.1038/s41467-021-24698-1.
  • (30) L. Oala, J. Fehr, L. Gilli, P. Balachandran, A. W. Leite, S. Calderon-Ramirez, D. X. Li, G. Nobis, E. A. M. Alvarado, G. Jaramillo-Gutierrez, C. Matek, A. Shroff, F. Kherif, B. Sanguinetti, T. Wiegand, ML4H auditing: From paper to practice, in: Proceedings of the Machine Learning for Health NeurIPS Workshop, Vol. 136 of Proceedings in Machine Learning Research, 2020, pp. 280–317.
  • (31) F. Maleki, N. Muthukrishnan, K. Ovens, C. Reinhold, R. Forghani, Machine learning algorithm validation, Neuroimaging Clinics of North America 30 (4) (2020) 433–445. doi:10.1016/j.nic.2020.08.004.
  • (32) F. Cabitza, A. Campagner, F. Soares, L. García de Guadiana-Romualdo, F. Challa, A. Sulejmani, M. Seghezzi, A. Carobene, The importance of being external. Methodological insights for the external validation of machine learning models in medicine, Computer Methods and Programs in Biomedicine 208 (2021) 106288. doi:10.1016/j.cmpb.2021.106288.
  • (33) S. H. Park, J. Choi, J.-S. Byeon, Key principles of clinical validation, device approval, and insurance coverage decisions of artificial intelligence, Korean Journal of Radiology 22 (3) (2021) 442. doi:10.3348/kjr.2021.0048.
  • (34) A. A. H. de Hond, A. M. Leeuwenberg, L. Hooft, I. M. J. Kant, S. W. J. Nijman, H. J. A. van Os, J. J. Aardoom, T. P. A. Debray, E. Schuit, M. van Smeden, J. B. Reitsma, E. W. Steyerberg, N. H. Chavannes, K. G. M. Moons, Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review, npj Digital Medicine 5 (1) (2022). doi:10.1038/s41746-021-00549-7.
  • (35) P. Hufnagl, EMPAIA – Ökosystem zur Nutzung von KI in der Pathologie, Der Pathologe 42 (S2) (2021) 135–141. doi:10.1007/s00292-021-01029-1.
  • (36) Y. Chen, J. Zee, A. Smith, C. Jayapandian, J. Hodgin, D. Howell, M. Palmer, D. Thomas, C. Cassol, A. B. Farris, K. Perkinson, A. Madabhushi, L. Barisoni, A. Janowczyk, Assessment of a computerized quantitative quality control tool for whole slide images of kidney biopsies, The Journal of Pathology 253 (3) (2021) 268–278. doi:10.1002/path.5590.
  • (37) A. R. N. Avanaki, K. S. Espig, A. Xthona, C. Lanciault, T. R. L. Kimpe, Automatic image quality assessment for digital pathology, in: Breast Imaging, Springer International Publishing, 2016, pp. 431–438. doi:10.1007/978-3-319-41546-8_54.
  • (38) B. Schömig-Markiefka, A. Pryalukhin, W. Hulla, A. Bychkov, J. Fukuoka, A. Madabhushi, V. Achter, L. Nieroda, R. Büttner, A. Quaas, Y. Tolkach, Quality control stress test for deep learning-based diagnostic model in digital pathology, Modern Pathology 34 (12) (2021) 2098–2108. doi:10.1038/s41379-021-00859-x.
  • (39) C. M. Focke, H. Bürger, P. J. van Diest, K. Finsterbusch, D. Gläser, E. Korsching, T. Decker, M. Anders, R. Bollmann, F. Eiting, K. Friedrich, J.-O. Habeck, G. Haroske, B. Hinrichs, A. Behrens, U. Krause, U. Lang, J. Lorenzen, N. Minew, M. Mlynek-Kersjes, H. Nenning, J. Packeisen, F. P.-d. Vos, S. Reyher-Klein, D. Rothacker, M. Schultz, U. Sturm, M. Tawfik, K.-H. Berghäuser, W. Böcker, G. Cserni, S. Habedank, S. Lax, F. Moinfar, P. Regitnig, A. Reiner-Concin, J. Rüschoff, Z. Varga, J. Woziwodski, Interlaboratory variability of Ki67 staining in breast cancer, European Journal of Cancer 84 (2017) 219–227. doi:10.1016/j.ejca.2017.07.041.
  • (40) S. Taqi, S. Sami, L. Sami, S. Zaki, A review of artifacts in histopathology, Journal of Oral and Maxillofacial Pathology 22 (2) (2018) 279. doi:10.4103/jomfp.jomfp_125_15.
  • (41) S. Chatterjee, Artefacts in histopathology, Journal of Oral and Maxillofacial Pathology 18 (4) (2014) 111. doi:10.4103/0973-029x.141346.
  • (42) S. Ramón y Cajal, M. Sesé, C. Capdevila, T. Aasen, L. De Mattos-Arruda, S. J. Diaz-Cano, J. Hernández-Losa, J. Castellví, Clinical implications of intratumor heterogeneity: challenges and opportunities, Journal of Molecular Medicine 98 (2) (2020) 161–177. doi:10.1007/s00109-020-01874-2.
  • (43) D. Pursnani, S. Arora, P. Katyayani, A. C, B. R. Yelikar, Inking in surgical pathology: does the method matter? A procedural analysis of a spectrum of colours, Turkish Journal of Pathology (2016). doi:10.5146/tjpath.2015.01351.
  • (44) I. Dagogo-Jack, A. T. Shaw, Tumour heterogeneity and resistance to cancer therapies, Nature Reviews Clinical Oncology 15 (2) (2017) 81–94. doi:10.1038/nrclinonc.2017.166.
  • (45) K. H. Allison, L. M. Reisch, P. A. Carney, D. L. Weaver, S. J. Schnitt, F. P. O’Malley, B. M. Geller, J. G. Elmore, Understanding diagnostic variability in breast pathology: lessons learned from an expert consensus review panel, Histopathology 65 (2) (2014) 240–251. doi:10.1111/his.12387.
  • (46) A. M. El-Badry, S. Breitenstein, W. Jochum, K. Washington, V. Paradis, L. Rubbia-Brandt, M. A. Puhan, K. Slankamenac, R. Graf, P.-A. Clavien, Assessment of hepatic steatosis by expert pathologists, Annals of Surgery 250 (5) (2009) 691–697. doi:10.1097/sla.0b013e3181bcd6dd.
  • (47) A. E. Martinez, L. Lin, C. H. Dunphy, Grading of follicular lymphoma: Comparison of routine histology with immunohistochemistry, Archives of Pathology & Laboratory Medicine 131 (7) (2007) 1084–1088. doi:10.5858/2007-131-1084-goflco.
  • (48) O. Kujan, A. Khattab, R. J. Oliver, S. A. Roberts, N. Thakker, P. Sloan, Why oral histopathology suffers inter-observer variability on grading oral epithelial dysplasia: An attempt to understand the sources of variation, Oral Oncology 43 (3) (2007) 224–231. doi:10.1016/j.oraloncology.2006.03.009.
  • (49) P. Boiesen, P.-O. Bendahl, L. Anagnostaki, H. Domanski, E. Holm, I. Idvall, S. Johansson, O. Ljungberg, A. Ringberg, M. Östberg, Göreland Fernö, Histologic grading in breast cancer: Reproducibility between seven pathologic departments, Acta Oncologica 39 (1) (2000) 41–45. doi:10.1080/028418600430950.
  • (50) L. Oni, M. W. Beresford, D. Witte, A. Chatzitolios, N. Sebire, K. Abulaban, R. Shukla, J. Ying, H. I. Brunner, Inter-observer variability of the histological classification of lupus glomerulonephritis in children, Lupus 26 (11) (2017) 1205–1211. doi:10.1177/0961203317706558.
  • (51) P. N. Furness, N. Taub, K. J. M. Assmann, G. Banfi, J.-P. Cosyns, A. M. Dorman, C. M. Hill, S. K. Kapper, R. Waldherr, A. Laurinavicius, N. Marcussen, A. P. Martins, M. Nogueira, H. Regele, D. Seron, M. Carrera, S. Sund, E. I. Taskinen, T. Paavonen, T. Tihomirova, R. Rosenthal, International variation in histologic grading is large, and persistent feedback does not improve reproducibility, The American Journal of Surgical Pathology 27 (6) (2003) 805–810. doi:10.1097/00000478-200306000-00012.
  • (52) H. R. Tizhoosh, P. Diamandis, C. J. Campbell, A. Safarpoor, S. Kalra, D. Maleki, A. Riasatian, M. Babaie, Searching images for consensus, The American Journal of Pathology 191 (10) (2021) 1702–1708. doi:10.1016/j.ajpath.2021.01.015.
  • (53) A. Homeyer, P. Nasr, C. Engel, S. Kechagias, P. Lundberg, M. Ekstedt, H. Kost, N. Weiss, T. Palmer, H. K. Hahn, D. Treanor, C. Lundström, Automated quantification of steatosis: agreement with stereological point counting, Diagnostic Pathology 12 (1) (2017). doi:10.1186/s13000-017-0671-y.
  • (54) S. Perincheri, A. W. Levi, R. Celli, P. Gershkovich, D. Rimm, J. S. Morrow, B. Rothrock, P. Raciti, D. Klimstra, J. Sinard, An independent assessment of an artificial intelligence system for prostate cancer detection shows strong diagnostic accuracy, Modern Pathology 34 (8) (2021) 1588–1595. doi:10.1038/s41379-021-00794-x.
  • (55) L. M. da Silva, E. M. Pereira, P. G. O. Salles, R. Godrich, R. Ceballos, J. D. Kunz, A. Casson, J. Viret, S. Chandarlapaty, C. G. Ferreira, B. Ferrari, B. Rothrock, P. Raciti, V. Reuter, B. Dogdas, G. DeMuth, J. Sue, C. Kanan, L. Grady, T. J. Fuchs, J. S. Reis-Filho, Independent real-world application of a clinical-grade automated prostate cancer detection system, The Journal of Pathology 254 (2) (2021) 147–158. doi:10.1002/path.5662.
  • (56) E. Arvaniti, K. Fricker, M. Moret, N. Rupp, T. Hermanns, C. Fankhauser, N. Wey, P. Wild, J. H. Rüschoff, M. Claassen, Replication data for: Automated Gleason grading of prostate cancer tissue microarrays via deep learning, Harvard Dataverse (2018). doi:10.7910/DVN/OCYCMP.
  • (57) Creative Commons, CC0 1.0 Universal (CC0 1.0) public domain dedication (2022).
  • (58) J. D. Ianni, R. E. Soans, S. Sankarapandian, R. V. Chamarthi, D. Ayyagari, T. G. Olsen, M. J. Bonham, C. C. Stavish, K. Motaparthi, C. J. Cockerell, T. A. Feeser, J. B. Lee, Tailored for real-world: A whole slide image classification system validated on uncurated multi-site data emulating the prospective pathology workload, Scientific Reports 10 (1) (2020). doi:10.1038/s41598-020-59985-2.
  • (59) K. Freeman, J. Geppert, C. Stinton, D. Todkill, S. Johnson, A. Clarke, S. Taylor-Phillips, Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy, BMJ (2021) n1872doi:10.1136/bmj.n1872.
  • (60) K. Talari, M. Goyal, Retrospective studies – utility and caveats, Journal of the Royal College of Physicians of Edinburgh 50 (4) (2020) 398–402. doi:10.4997/jrcpe.2020.409.
  • (61) M. A. Gianfrancesco, S. Tamang, J. Yazdany, G. Schmajuk, Potential biases in machine learning algorithms using electronic health record data, JAMA Internal Medicine 178 (11) (2018) 1544. doi:10.1001/jamainternmed.2018.3763.
  • (62) J. Gamper, N. A. Koohbanani, K. Benes, S. Graham, M. Jahanifar, S. A. Khurram, A. Azam, K. Hewitt, N. Rajpoot, PanNuke dataset extension, insights and baselines, arXiv:2003.10778 [q-bio.QM] (2020). doi:10.48550/ARXIV.2003.10778.
  • (63) S. Graham, M. Jahanifar, A. Azam, M. Nimir, Y.-W. Tsang, K. Dodd, E. Hero, H. Sahota, A. Tank, K. Benes, N. Wahab, F. Minhas, S. E. A. Raza, H. E. Daly, K. Gopalakrishnan, D. Snead, N. Rajpoot, Lizard: A large-scale dataset for colonic nuclear instance segmentation and classification, arXiv:2108.11195 [cs.LG] (2021). doi:10.48550/ARXIV.2108.11195.
  • (64) A. Janowczyk, R. Zuo, H. Gilmore, M. Feldman, A. Madabhushi, HistoQC: An open-source quality control tool for digital pathology slides, JCO Clinical Cancer Informatics 3 (2019) 1–7. doi:10.1200/cci.18.00157.
  • (65) D. Ameisen, C. Deroulers, V. Perrier, F. Bouhidel, M. Battistella, L. Legrès, A. Janin, P. Bertheau, J.-B. Yunès, Towards better digital pathology workflows: programming libraries for high-speed sharpness assessment of whole slide images, Diagnostic Pathology 9 (S1) (2014). doi:10.1186/1746-1596-9-s1-s3.
  • (66) C. Senaras, M. K. K. Niazi, G. Lozanski, M. N. Gurcan, DeepFocus: Detection of out-of-focus regions in whole slide digital images using deep learning, PLOS ONE 13 (10) (2018) e0205387. doi:10.1371/journal.pone.0205387.
  • (67) G. Smit, F. Ciompi, M. Cigéhn, A. Bodén, J. van der Laak, C. Mercan, Quality control of whole-slide images through multi-class semantic segmentation of artifacts, MIDL 2021 Short Paper (2021).
  • (68) K. Stacke, G. Eilertsen, J. Unger, C. Lundstrom, Measuring domain shift for deep learning in histopathology, IEEE Journal of Biomedical and Health Informatics 25 (2) (2021) 325–336. doi:10.1109/jbhi.2020.3032060.
  • (69)

    B. Bozorgtabar, G. Vray, D. Mahapatra, J.-P. Thiran, SOoD: Self-supervised out-of-distribution detection under domain shift for multi-class colorectal cancer tissue types, in: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, 2021, pp. 3317–3326.

  • (70) J. Linmans, J. van der Laak, G. Litjens, Efficient out-of-distribution detection in digital pathology using multi-head convolutional neural networks, in: Proceedings of the Third Conference on Medical Imaging with Deep Learning MIDL 2020, Vol. 121 of Proceedings in Machine Learning Research, PMLR, 2020, pp. 465–478.
  • (71) A. Guha Roy, J. Ren, S. Azizi, A. Loh, V. Natarajan, B. Mustafa, N. Pawlowski, J. Freyberg, Y. Liu, Z. Beaver, N. Vo, P. Bui, S. Winter, P. MacWilliams, G. S. Corrado, U. Telang, Y. Liu, T. Cemgil, A. Karthikesalingam, B. Lakshminarayanan, J. Winkens, Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions, Medical Image Analysis 75 (2022) 102274. doi:10.1016/
  • (72) E. Çallı, K. Murphy, E. Sogancioglu, B. van Ginneken, FRODO: Free rejection of out-of-distribution samples: application to chest X-ray analysis (2019). doi:10.48550/ARXIV.1907.01253.
  • (73) T. Cao, C.-W. Huang, D. Y.-T. Hui, J. P. Cohen, A benchmark of medical out of distribution detection, arXiv:2007.04250 [stat.ML] (2020). doi:10.48550/ARXIV.2007.04250.
  • (74) C. Berger, M. Paschali, B. Glocker, K. Kamnitsas, Confidence-based out-of-distribution detection: A comparative study and analysis, arXiv:2107.02568 [cs.CV] (2021). doi:10.48550/ARXIV.2107.02568.
  • (75) O. Zhang, J.-B. Delbrouck, D. L. Rubin, Out of distribution detection for medical images, in: Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, Springer International Publishing, 2021, pp. 102–111. doi:10.1007/978-3-030-87735-4_10.
  • (76) N. C. Wang, J. Kaplan, J. Lee, J. Hodgin, A. Udager, A. Rao, Stress testing pathology models with generated artifacts, Journal of Pathology Informatics 12 (1) (2021) 54. doi:10.4103/jpi.jpi_6_21.
  • (77) A. Sinha, K. Ayush, J. Song, B. Uzkent, H. Jin, S. Ermon, Negative data augmentation, arXiv:2102.05113 [cs.AI] (2021). doi:10.48550/ARXIV.2102.05113.
  • (78) A. Lehmussola, P. Ruusuvuori, J. Selinummi, H. Huttunen, O. Yli-Harja, Computational framework for simulating fluorescence microscope images with cell populations, IEEE Transactions on Medical Imaging 26 (7) (2007) 1010–1016. doi:10.1109/tmi.2007.896925.
  • (79) V. Ulman, D. Svoboda, M. Nykter, M. Kozubek, P. Ruusuvuori, Virtual cell imaging: A review on simulation methods employed in image cytometry, Cytometry Part A 89 (12) (2016) 1057–1072. doi:10.1002/cyto.a.23031.
  • (80)

    M. Gadermayr, L. Gupta, V. Appel, P. Boor, B. M. Klinkhammer, D. Merhof, Generative adversarial networks for facilitating stain-independent supervised and unsupervised segmentation: A study on kidney histology, IEEE Transactions on Medical Imaging 38 (10) (2019) 2293–2302.

  • (81) A. Z. Moghadam, H. Azarnoush, S. A. Seyyedsalehi, M. Havaei, Stain transfer using generative adversarial networks and disentangled features, Computers in Biology and Medicine 142 (2022) 105219. doi:10.1016/j.compbiomed.2022.105219.
  • (82) M. K. K. Niazi, F. S. Abas, C. Senaras, M. Pennell, B. Sahiner, W. Chen, J. Opfer, R. Hasserjian, A. Louissaint, A. Shana’ah, G. Lozanski, M. N. Gurcan, Nuclear IHC enumeration: A digital phantom to evaluate the performance of automated algorithms in digital pathology, PLOS ONE 13 (5) (2018) e0196547. doi:10.1371/journal.pone.0196547.
  • (83) A. B. Levine, J. Peng, D. Farnell, M. Nursey, Y. Wang, J. R. Naso, H. Ren, H. Farahani, C. Chen, D. Chiu, A. Talhouk, B. Sheffield, M. Riazy, P. P. Ip, C. Parra-Herran, A. Mills, N. Singh, B. Tessier-Cloutier, T. Salisbury, J. Lee, T. Salcudean, S. J. M. Jones, D. G. Huntsman, C. B. Gilks, S. Yip, A. Bashashati, Synthesis of diagnostic quality cancer pathology images by generative adversarial networks, The Journal of Pathology 252 (2) (2020) 178–188. doi:10.1002/path.5509.
  • (84) A. C. Quiros, R. Murray-Smith, K. Yuan, Pathologygan: Learning deep representations of cancer tissue, arXiv:1907.02644 [stat.ML] (2019). doi:10.48550/ARXIV.1907.02644.
  • (85) L. Jose, S. Liu, C. Russo, A. Nadort, A. D. Ieva, Generative adversarial networks in digital pathology and histopathological image processing: A review, Journal of Pathology Informatics 12 (1) (2021) 43. doi:10.4103/jpi.jpi_103_20.
  • (86) S. Deshpande, F. Minhas, S. Graham, N. Rajpoot, SAFRON: Stitching across the frontier network for generating colorectal cancer histology images, Medical Image Analysis 77 (2022) 102337. doi:10.1016/
  • (87) A. Janowczyk, A. Madabhushi, Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases, Journal of Pathology Informatics 7 (1) (2016) 29. doi:10.4103/2153-3539.186902.
  • (88) C. J. Adcock, Sample size determination: a review, Journal of the Royal Statistical Society: Series D (The Statistician) 46 (2) (1997) 261–283. doi:10.1111/1467-9884.00082.
  • (89) M. S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford statistical science series, Oxford University Press, 2004.
  • (90) A. Flahault, M. Cadilhac, G. Thomas, Sample size calculation should be performed for design accuracy in diagnostic test studies, Journal of Clinical Epidemiology 58 (8) (2005) 859–862. doi:10.1016/j.jclinepi.2004.12.009.
  • (91) J. M. Bland, The tyranny of power: is there a better way to calculate sample size?, BMJ 339 (2009) b3985. doi:10.1136/bmj.b3985.
  • (92) A. Hazra, Using the confidence interval confidently, Journal of Thoracic Disease 9 (10) (2017) 4124–4129. doi:10.21037/jtd.2017.09.14.
  • (93) J. A. Hanley, B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve., Radiology 143 (1) (1982) 29–36. doi:10.1148/radiology.143.1.7063747.
  • (94) D. L. Simel, G. P. Samsa, D. B. Matchar, Likelihood ratios with confidence: Sample size estimation for diagnostic test studies, Journal of Clinical Epidemiology 44 (8) (1991) 763–770. doi:10.1016/0895-4356(91)90128-v.
  • (95) K. Kelley, S. E. Maxwell, J. R. Rausch, Obtaining power or obtaining precision, Evaluation & the Health Professions 26 (3) (2003) 258–287. doi:10.1177/0163278703255242.
  • (96) R. D. Riley, T. P. A. Debray, G. S. Collins, L. Archer, J. Ensor, M. Smeden, K. I. E. Snell, Minimum sample size for external validation of a clinical prediction model with a binary outcome, Statistics in Medicine 40 (19) (2021) 4230–4251. doi:10.1002/sim.9025.
  • (97) M. Pavlou, C. Qu, R. Z. Omar, S. R. Seaman, E. W. Steyerberg, I. R. White, G. Ambler, Estimation of required sample size for external validation of risk models for binary outcomes, Statistical Methods in Medical Research 30 (10) (2021) 2187–2206. doi:10.1177/09622802211007522.
  • (98) A. Haynes, A. Lenz, O. Stalder, A. Limacher, presize: An R-package for precision-based sample size calculation in clinical research, Journal of Open Source Software 6 (60) (2021) 3118. doi:10.21105/joss.03118.
  • (99) A. Echle, N. G. Laleh, P. Quirke, H. Grabsch, H. Muti, O. Saldanha, S. Brockmoeller, P. van den Brandt, G. Hutchins, S. Richman, K. Horisberger, C. Galata, M. Ebert, M. Eckardt, M. Boutros, D. Horst, C. Reissfelder, E. Alwers, T. Brinker, R. Langer, J. Jenniskens, K. Offermans, W. Mueller, R. Gray, S. Gruber, J. Greenson, G. Rennert, J. Bonner, D. Schmolze, J. Chang-Claude, H. Brenner, C. Trautwein, P. Boor, D. Jaeger, N. Gaisa, M. Hoffmeister, N. West, J. Kather, Artificial intelligence for detection of microsatellite instability in colorectal cancer—a multicentric analysis of a pre-screening tool for clinical application, ESMO Open 7 (2) (2022) 100400. doi:10.1016/j.esmoop.2022.100400.
  • (100) H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, F. Bray, Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA: A Cancer Journal for Clinicians 71 (3) (2021) 209–249. doi:10.3322/caac.21660.
  • (101) T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, PLOS ONE 10 (3) (2015) e0118432. doi:10.1371/journal.pone.0118432.
  • (102) M. Qi, O. Cahan, M. A. Foreman, D. M. Gruen, A. K. Das, K. P. Bennett, Quantifying representativeness in randomized clinical trials using machine learning fairness metrics, JAMIA Open 4 (3) (2021). doi:10.1093/jamiaopen/ooab077.
  • (103) F. Cabitza, A. Campagner, L. M. Sconfienza, As if sand were stone. new concepts and metrics to probe the ground on which to build trustable AI, BMC Medical Informatics and Decision Making 20 (1) (2020). doi:10.1186/s12911-020-01224-9.
  • (104) P. Saleiro, B. Kuester, L. Hinkson, J. London, A. Stevens, A. Anisfeld, K. T. Rodolfa, R. Ghani, Aequitas: A bias and fairness audit toolkit, arXiv:1811.05577 [cs.LG] (2018). doi:10.48550/ARXIV.1811.05577.
  • (105) R. K. E. Bellamy, K. Dey, M. Hind, S. C. Hoffman, S. Houde, K. Kannan, P. Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. N. Ramamurthy, J. Richards, D. Saha, P. Sattigeri, M. Singh, K. R. Varshney, Y. Zhang, AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, arXiv:1810.01943 [cs.AI] (2018). doi:10.48550/ARXIV.1810.01943.
  • (106) M. S. A. Lee, J. Singh, The landscape and gaps in open source fairness toolkits, in: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–13. doi:10.1145/3411764.3445261.
  • (107) A. Roohi, K. Faust, U. Djuric, P. Diamandis, Unsupervised machine learning in pathology, Surgical Pathology Clinics 13 (2) (2020) 349–358. doi:10.1016/j.path.2020.01.002.
  • (108) I. Model, L. Shamir, Comparison of data set bias in object recognition benchmarks, IEEE Access 3 (2015) 1953–1962. doi:10.1109/access.2015.2491921.
  • (109)

    L. Shamir, Evaluation of face datasets as tools for assessing the performance of face recognition methods, International Journal of Computer Vision 79 (3) (2008) 225–230.

  • (110) N. Bussola, A. Marcolini, V. Maggio, G. Jurman, C. Furlanello, AI slipping on tiles: Data leakage in digital pathology, in: Pattern Recognition. ICPR International Workshops and Challenges, Springer International Publishing, 2021, pp. 167–182. doi:10.1007/978-3-030-68763-2_13.
  • (111) E. Wu, K. Wu, R. Daneshjou, D. Ouyang, D. E. Ho, J. Zou, How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals, Nature Medicine 27 (4) (2021) 582–584. doi:10.1038/s41591-021-01312-x.
  • (112) I. R. König, J. D. Malley, C. Weimar, H.-C. Diener, A. Z. and, Practical experiences on the necessity of external validation, Statistics in Medicine 26 (30) (2007) 5499–5511. doi:10.1002/sim.3069.
  • (113) L. A. Celi, J. Cellini, M.-L. Charpignon, E. C. Dee, F. Dernoncourt, R. Eber, W. G. Mitchell, L. Moukheiber, J. Schirmer, J. Situ, J. Paguio, J. Park, J. G. Wawira, S. Yao, Sources of bias in artificial intelligence that perpetuate healthcare disparities—a global review, PLOS Digital Health 1 (3) (2022) e0000022. doi:10.1371/journal.pdig.0000022.
  • (114) ITU-T Focus Group on AI for Health, Del05.4: Training and test data specification, fG-AI4H-DEL05.4 (2020).
  • (115) ITU-T Focus Group on AI for Health, Del05.1: Data requirements, fG-AI4H-DEL05.1 (2020).
  • (116) Medical Device Coordination Group, Report mdcg 2022-2: Guidance on general principles of clinical evidence for in vitro diagnostic medical devices (IVDs) (2022).
  • (117) K. G. Moons, D. G. Altman, J. B. Reitsma, J. P. Ioannidis, P. Macaskill, E. W. Steyerberg, A. J. Vickers, D. F. Ransohoff, G. S. Collins, Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration, Annals of Internal Medicine 162 (1) (2015) W1–W73. doi:10.7326/m14-0698.
  • (118) X. Liu, S. C. Rivera, D. Moher, M. J. Calvert, A. K. Denniston, Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension, BMJ (2020) m3164doi:10.1136/bmj.m3164.
  • (119) B. Norgeot, G. Quer, B. K. Beaulieu-Jones, A. Torkamani, R. Dias, M. Gianfrancesco, R. Arnaout, I. S. Kohane, S. Saria, E. Topol, Z. Obermeyer, B. Yu, A. J. Butte, Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist, Nature Medicine 26 (9) (2020) 1320–1324. doi:10.1038/s41591-020-1041-y.
  • (120) T. Wiegand, R. Krishnamurthy, M. Kuglitsch, N. Lee, S. Pujari, M. Salathé, M. Wenzel, S. Xu, WHO and ITU establish benchmarking process for artificial intelligence in health, The Lancet 394 (10192) (2019) 9–11. doi:10.1016/s0140-6736(19)30762-7.
  • (121) M. Wenzel, T. Wiegand, Toward global validation standards for health AI, IEEE Communications Standards Magazine 4 (3) (2020) 64–69. doi:10.1109/mcomstd.001.2000006.
  • (122) V. Sounderajah, H. Ashrafian, R. M. Golub, S. Shetty, J. De Fauw, L. Hooft, K. Moons, G. Collins, D. Moher, P. M. Bossuyt, A. Darzi, A. Karthikesalingam, A. K. Denniston, B. A. Mateen, D. Ting, D. Treanor, D. King, F. Greaves, J. Godwin, J. Pearson-Stuttard, L. Harling, M. McInnes, N. Rifai, N. Tomasev, P. Normahani, P. Whiting, R. Aggarwal, S. Vollmer, S. R. Markar, T. Panch, X. Liu, Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol, BMJ Open 11 (6) (2021) e047709. doi:10.1136/bmjopen-2020-047709.
  • (123) G. S. Collins, P. Dhiman, C. L. A. Navarro, J. Ma, L. Hooft, J. B. Reitsma, P. Logullo, A. L. Beam, L. Peng, B. Van Calster, M. van Smeden, R. D. Riley, K. G. M. Moons, Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence, BMJ Open 11 (7) (2021) e048008. doi:10.1136/bmjopen-2020-048008.
  • (124) L. M. Stevens, B. J. Mortazavi, R. C. Deo, L. Curtis, D. P. Kao, Recommendations for reporting machine learning analyses in clinical research, Circulation: Cardiovascular Quality and Outcomes 13 (10) (2020). doi:10.1161/circoutcomes.120.006556.
  • (125) A. Homeyer, J. Lotz, L. Schwen, N. Weiss, D. Romberg, H. Höfener, N. Zerbe, P. Hufnagl, Artificial intelligence in pathology: From prototype to product, Journal of Pathology Informatics 12 (1) (2021) 13. doi:10.4103/jpi.jpi_84_20.
  • (126) European Commission, Regulation (EU) 2017/746 of the European Parilament and of the Council of 5 April 2017 on in vitro diagnostic medical devices and repealing Directive 98/79/EC and Commission Decision 2010/227/EU (2017).
  • (127) European Commission, Proposal for a regulation of the European parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, document 52021PC0206 COM/2021/206 final (2021).
  • (128) Code of Federal Regulations, Title 21, Chapter I, Subchapter H, Part 809 – in vitro diagnostic products for human use (2021).
  • (129) U.S. Food & Drug Administration, FDA authorizes software that can help identify prostate cancer, Press Release (2021).
  • (130) U.S. Food & Drug Administration, DEN200080.Letter.DENG.pdf (2021).
  • (131) A. Kaushal, R. Altman, C. Langlotz, Geographic distribution of US cohorts used to train deep learning algorithms, JAMA 324 (12) (2020) 1212. doi:10.1001/jama.2020.12067.
  • (132) W. Bulten, K. Kartasalo, P.-H. C. Chen, P. Ström, H. Pinckaers, K. Nagpal, Y. Cai, D. F. Steiner, H. van Boven, R. Vink, C. Hulsbergen-van de Kaa, J. van der Laak, M. B. Amin, A. J. Evans, T. van der Kwast, R. Allan, P. A. Humphrey, H. Grönberg, H. Samaratunga, B. Delahunt, T. Tsuzuki, T. Häkkinen, L. Egevad, M. Demkin, S. Dane, F. Tan, M. Valkonen, G. S. Corrado, L. Peng, C. H. Mermel, P. Ruusuvuori, G. Litjens, M. Eklund, A. Brilhante, A. Çakır, X. Farré, K. Geronatsiou, V. Molinié, G. Pereira, P. Roy, G. Saile, P. G. O. Salles, E. Schaafsma, J. Tschui, J. Billoch-Lima, E. M. Pereira, M. Zhou, S. He, S. Song, Q. Sun, H. Yoshihara, T. Yamaguchi, K. Ono, T. Shen, J. Ji, A. Roussel, K. Zhou, T. Chai, N. Weng, D. Grechka, M. V. Shugaev, R. Kiminya, V. Kovalev, D. Voynov, V. Malyshev, E. Lapo, M. Campos, N. Ota, S. Yamaoka, Y. Fujimoto, K. Yoshioka, J. Juvonen, M. Tukiainen, A. Karlsson, R. Guo, C.-L. Hsieh, I. Zubarev, H. S. T. Bukhar, W. Li, J. Li, W. Speier, C. Arnold, K. Kim, B. Bae, Y. W. Kim, H.-S. Lee, J. P. and, Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge, Nature Medicine 28 (1) (2022) 154–163. doi:10.1038/s41591-021-01620-2.
  • (133) S. N. Dudgeon, S. Wen, M. G. Hanna, R. Gupta, M. Amgad, M. Sheth, H. Marble, R. Huang, M. D. Herrmann, C. H. Szu, D. Tong, B. Werness, E. Szu, D. Larsimont, A. Madabhushi, E. Hytopoulos, W. Chen, R. Singh, S. N. Hart, A. Sharma, J. Saltz, R. Salgado, B. D. Gallas, A pathologist-annotated dataset for validating artificial intelligence: A project description and pilot study, Journal of Pathology Informatics 12 (1) (2021) 45. doi:10.4103/jpi.jpi_83_20.
  • (134) R. Rodrigues, Legal and human rights issues of AI: Gaps, challenges and vulnerabilities, Journal of Responsible Technology 4 (2020) 100005. doi:10.1016/j.jrt.2020.100005.
  • (135) C. J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, D. King, Key challenges for delivering clinical impact with artificial intelligence, BMC Medicine 17 (1) (2019). doi:10.1186/s12916-019-1426-2.
  • (136) T. Evans, C. O. Retzlaff, C. Geißler, M. Kargl, M. Plass, H. Müller, T.-R. Kiehl, N. Zerbe, A. Holzinger, The explainability paradox: Challenges for xAI in digital pathology, Future Generation Computer Systems 133 (2022) 281–296. doi:10.1016/j.future.2022.03.009.
  • (137) S. G. Finlayson, A. Subbaswamy, K. Singh, J. Bowers, A. Kupke, J. Zittrain, I. S. Kohane, S. Saria, The clinician and dataset shift in artificial intelligence, New England Journal of Medicine 385 (3) (2021) 283–286. doi:10.1056/nejmc2104626.