Fast and Scalable Image Search For Histology

07/28/2021 ∙ by Chengkuan Chen, et al. ∙ Harvard University 75

The expanding adoption of digital pathology has enabled the curation of large repositories of histology whole slide images (WSIs), which contain a wealth of information. Similar pathology image search offers the opportunity to comb through large historical repositories of gigapixel WSIs to identify cases with similar morphological features and can be particularly useful for diagnosing rare diseases, identifying similar cases for predicting prognosis, treatment outcomes, and potential clinical trial success. A critical challenge in developing a WSI search and retrieval system is scalability, which is uniquely challenging given the need to search a growing number of slides that each can consist of billions of pixels and are several gigabytes in size. Such systems are typically slow and retrieval speed often scales with the size of the repository they search through, making their clinical adoption tedious and are not feasible for repositories that are constantly growing. Here we present Fast Image Search for Histopathology (FISH), a histology image search pipeline that is infinitely scalable and achieves constant search speed that is independent of the image database size while being interpretable and without requiring detailed annotations. FISH uses self-supervised deep learning to encode meaningful representations from WSIs and a Van Emde Boas tree for fast search, followed by an uncertainty-based ranking algorithm to retrieve similar WSIs. We evaluated FISH on multiple tasks and datasets with over 22,000 patient cases spanning 56 disease subtypes. We additionally demonstrate that FISH can be used to assist with the diagnosis of rare cancer types where sufficient cases may not be available to train traditional supervised deep models. FISH is available as an easy-to-use, open-source software package (https://github.com/mahmoodlab/FISH).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

page 32

page 33

page 34

page 35

page 36

page 37

Code Repositories

FISH

Fast and Scalable Image Search for Histology


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

FISH: A scalable search engine for histology images

FISH is a deep learning based histology image retrieval method that combines VQ-VAE[30] and vEB tree to achieve search performance with low storage cost, while also supporting patch-level retrieval and human-interpretability. To achieve this performance, we represent each slide image as a set of integers and binary codes for efficient storage and encode the integers into a vEB tree for fast search. The overview of FISH pipeline is shown in Figure 1.

FISH begins by distilling a mosaic representation of a given slide[27]. To select the patches used for representing the slide we used two-stage K-means clustering. Specifically, we first apply K-means clustering on the RGB features extracted from patches at 5, followed by K-means clustering on the coordinates of patches at 20 within each initial cluster. We extract image patches corresponding the coordinates of the set of final cluster centers and use them as a mosaic representation of the given slide. To convert the mosaics into a set of integers and binary code (Figure 1b), we pre-train a VQ-VAE, which is a variant of a Variational Autoencoder[31] that gives the input a discrete latent code from a codebook learned on the TCGA slides at 20. We use the encoder of the pretrained VQ-VAE along with the learned codebook to encode the patches at 20 and extract mosaic features by using a Densenet[32] model and a binarization algorithm. The last step is to convert the discrete latent codes into integers to store the mosaics in the vEB tree. We feed the latent codes of the mosaics into a pipeline composed of a series of average pooling (AvgPools), summation, and shift operations. The intuition behind this pipeline is to summarize the information in each scale via summation then store it into a different range of digits in a integer.

During searching (Figure 1c), we extract the features of the preprocessed mosaics of the query whole slide image and then apply the proposed Guided Search Algorithm (GSA) to find the most similar results of each mosaic. The design principle of GSA is to find a fixed number of the nearest neighbors using the vEB and only select the neighbors whose hamming distances from the query mosaic are lower than a certain threshold . Since we only look for a fixed number of neighbors and the search time of the vEB to find neighbors is , the time complexity of FISH search is . The search result of each mosaic is a list of patches. Each patch contains metadata that documents the name of the slide where the patch is located, the diagnosis of the slide, and hamming distance between the patch and the query mosaic. Once each mosaic gets its search results, our ranking algorithm ranks the candidate mosaics used to retrieve the final top-K similar slide. We collect all slides that appear in the search results from the candidate mosaics and sort them based on hamming distance in ascending order to return the top-K similar slides.

In the next sections, we demonstrate performance in four areas: (1) disease subtype retrieval in the fixed anatomic site in public cohorts (TCGA, CPTAC), (2) disease subtype retrieval in the fixed anatomic site in independent cohorts (BWH in-house data) to test generalizability, (3) disease anatomic site retrieval, and (4) speed and interpretability. In addition, we also report that our system can handle patch level retrieval with search performance on Kather100k[33] and in-house prostate data though the system is not specifically designed for this task.

Results

Disease subtypes retrieval in public cohorts
We reported the majority top-5 accuracy (mMV@5) for FISH and Yottixel as our main comparing metric and provided mAP@5 of FISH for reference. The mMV@5 evaluates how often the majority slide diagnosis in the top 5 results matches the query one, while mAP@5 measures how well the model can give slides with the same diagnosis as the query slide a higher rank in the retrieval results. We used mMV@5 as the primary metric for comparison as it is stricter than widely-used top-5 accuracy and also report mAP@5 because the results are considered correct only when the majority diagnosis in the retrieval agrees with the query. More details can be found in Online Methods. We built the FISH pipeline on slides from each anatomic site and tested whether FISH can retrieve slides with correct diagnosis. Overall, we had better results than Yottixel (Figure 2a) in terms of macro-average mMV@5 within each site and across all sites. We believe macro-averaging is the appropriate measure here as the uncommon cases in an unbalanced real world histology database are as crucial as the common ones. For some sites such as Pulmonary, Gynecological, Urinary and Hematopoietic where the data distributions are skewed, FISH outperforms Yottixel in the uncommon diagnosis by large margin (47.68% improvement on Pulmonary-MESO; 29%, 16.3%, 16.2% improvement on Gynecological-UCS, Gynecological-CESC, Gynecological-OV, respectively; 14.2% improvement on Urinary-KICH and 30.2% improvement on Hematopoietic-DLBC). A detailed comparison is shown in Table 1 and individual retrieval results are available in Supplementary Table 1. In addition, the speed advantage of FISH became pronounced especially after the number of slide in the database exceeded 1,000 (Figure 2b). The median query speed of FISH remains almost constant despite the growing number of slides, which is justified by our theoretical results. We perform more experiments to demonstrate that FISH is scalable to thousands of slides in a later section (Speed and Interpretability). Additionally, the ranking algorithm plays a crucial role in the success of FISH and applies three post-processing steps to the predictions. Therefore, we conducted an ablation study to validate these steps and showed that FISH achieved the best performance by including all the steps (red line in Figure 3b). The details of these steps are explained in the Ablation study in our Methods.

To further test the generalization ability of FISH, we combined several diseases (KIRC, UCES, SKCM, LUAD, and LUSC) in CPTAC with the TCGA data to test performance on a mixed public cohort with the results reported in Figure 3a. After combination, the distribution of the dataset in all sites became more skewed, but the performance of FISH did not vary substantially in most cases. This result further shows that FISH can address dataset unbalance commonly presented in the real world. The only exception was Pulmonary-MESO, for which the site where the disease is located was highly unbalanced. Individual retrieval results are available in Supplementary Table 2. Note that our VQ-VAE was only trained on TCGA data without observing slides in the CPTAC, which also showed the generalizability of our encoder.

Figure 2: Disease subtype retrieval in public cohorts. a The marco-average mMV@5 of FISH and Yottixel on the TCGA anatomic sites. FISH has better performance in most sites. b Top: The query speed comparison between FISH and Yottixel for each site. Bottom: The mean confidence ( std) of query speed between FISH and Yottixel. It crucial to note that FISH is 2x more effective when the number of slide is over 1,000. Details study of speed was reported in Speed and Interpretability. c-d

The confusion and hamming distance matrices provide insights into FISH’s retrieval results. The x-axis and y-axis correspond to the ground truth and the model prediction respectively. The sharp and dark diagonal line in the confusion matrix and hamming distance matrix suggest that FISH can retrieve correct results in most of the cases. The PB in all figures stands for pancreaticobiliary and the number in the parentheses in

a-b denote the number of slide in each site.
Figure 3: Disease subtype retrieval in public cohorts and ablation study. a The comparison between FISH on TCGA and TCGA + CPTAC cohorts. The performance does not vary before and after mixing with CPTAC cohorts for most of cases. The number of slides for each diagnosis on TCGA from left to right are: KIRP (297), KIRC (519), KICH (121), BLCA (457), UVM (80), SKCM (473), LUSC (512), LUAD (530), MESOS (86), OV (107), UCS (87), CESC (278) and UCEC (566). For TCGA + CPTAC cohort, the only differences are on KIRC (1,022), LUSC (1,191), LUAD (1,199), SKCM (756) and UCEC (1,110). b Ablation study result on the ranking module of FISH. We observed that FISH has the best performance in the setting where all functions are applied (Filter). The details of each setting was described in Ablation Study in our Method. PB denotes Pancreaticobiliary in liver/PB and the number in the parentheses denote the number of slide in each site.

We also created confusion matrices and hamming distance matrices in Figure 2c-d to gain more insight. The hamming distance calculates each diagnosis’s average paired hamming distance in a given site, which helps further explain the trend behind the confusion matrix. More details about how we calculate these matrices are described in Method. By examining the confusion matrices, we can see a dark diagonal line, which suggested the majority results FISH retrieves match the queried diagnosis. The hamming distance matrices further explained the trends in the confusion matrices. The dark diagonal line shows the smallest hamming distance values in all sites, demonstrating that slides with different diagnosis are pushed far away in the hamming distance space. We can also use the matrices to explain why FISH performed worse in certain diseases. For example, query slide with diagnosis Liver-CHOL was more often confused with Liver-LIHC than Liver PAAD, which can be explained by the fact that the distance between Liver-CHOL and Liver-LIHC is smaller than to Liver-PAAD. Another example was the gastrointestinal site, where the diagonal values in the distance matrix were generally higher than in other sites, which explains why FISH performed worse in this site. We can apply similar logic to other sites and diseases.

Diagnosis #slide mMV@5 mAP@5 FISH Yottixel FISH Brain GBM 816 87.75 91.88 88.31 LGG 838 97.02 89.77 97.18 Endocrine ACC 227 96.04 93.83 95.96 PCPG 196 91.84 88.77 91.44 THCA 518 98.07 97.66 98.23 Gastrointestinal COAD 441 48.30 76.14 58.07 ESCA 158 79.75 59.87 79.83 READ 158 44.94 10.19 48.75 STAD 357 74.23 74.23 77.09 Gynecologic UCEC 566 84.28 92.22 86.74 CESC 278 78.78 62.45 81.23 UCS 87 71.26 42.22 72.51 OV 107 83.18 66.98 83.88 Hematopoietic DLBC 43 88.37 58.13 90.58 THYM 180 93.89 98.87 95.99 Diagnosis #slide mMV@5 mAP@5 FISH Yottixel FISH Melanocytic UVM 80 70.00 83.75 77.53 SKCM 473 99.58 99.57 99.69 Liver/PB CHOL 39 46.15 43.58 56.41 LIHC 371 90.30 93.65 91.16 PAAD 209 89.47 91.04 89.41 Pulmonary LUAD 530 79.81 70.96 82.26 LUSC 512 71.68 81.70 73.52 MESO 86 55.81 8.13 63.37 Urinary BLCA 457 93.22 95.81 93.38 KIRC 519 92.29 91.66 92.91 KICH 121 90.10 75.92 88.13 KIRP 297 66.33 67.22 68.62 Prostate/Testis TGCT 254 97.64 99.21 97.56 PRAD 449 98.44 98.43 98.39
Table 1: Disease subtype retrieval on TCGA. We compare FISH to Yottixel in terms of mMV@5 on the TCGA diagnostic whole slide images in 10 anatomic sites as this is the best result quoted from Yottixel’s paper. FISH consistently performs better than Yottixel especially in the sites where the dataset is unbalanced (e.g., Gynecological and Pulmonary). We also report mAP@5 for FISH to evaluate whether FISH can give the desired slide higher rank. PB stands for pancreaticobiliary.
Figure 4: Adapting to BWH independent test cohort. a The average mMV@5 of FISH in each site in BWH general cohorts. b FISH’s performance on rare type cancer in terms of average mMV@5 in each site. We observed that FISH has comparable performance in both general and rare disease cohort. The number in the parentheses denote the number of slide in each site.
Figure 5: Adapting FISH to independent BWH in-house whole slide images The confusion matrix (blue color) and hamming distance matrix (grey color) of 37 disease from 9 sites in BWH in-house data. The x-axis and y-axis correspond to the ground truth and the model prediction respectively. The sharp and dark diagonal line in the confusion matrix and hamming distance matrix suggest that FISH can retrieve correct results in most of the cases. The alphabetical legends below each site name denote the diagnosis studied in the corresponding site.

Adapting FISH to independent BWH in-house whole slide image.

There are many variations in whole slide images (WSIs) across the institutions due to differences in the protocol of slide preparation and digitalization. Therefore, it is essential to validate our FISH trained on TCGA is robust to in-house data. We collected 8,035 diagnostic slides that contains 9 anatomic sites with 37 primary cancer subtypes from the WSI database at Brigham and Women’s Hospital. For each anatomic site, we built our pipeline separately and used mMV@5 as the main evaluation metric while provided mAP@5 for reference. FISH achieved an average

mMV@5 across all sites (Figure 4a). It was especially successful in Urinary (), Thyroid (), Cutaneous (), Liver/Biliary () and Gynecology() as shown in Figure 5 where the diagonal lines in both confusion and hamming distance matrix were relatively clear. We reported the detailed results in Table 2 and individual retrieval results are available in Supplementary Table 3. Note that we did not fine-tune our encoder in this cohort, which shows the generalizability of encoder trained only on TCGA. To further investigate the clinical value of FISH, we conducted another experiment specifically for the rare type cancers by combing BWH cohorts and TCGA which results in 1,785 slides with 23 rare type cancers from 7 sites. FISH achieved in terms of mMV@5 (Figure 4b), which is comparable to the performance achieved on the general cohort in the previous experiment. The detailed results were described in Table 3. This is an encouraging result as it shows that if we create a whole slide database dedicated to rare disease, FISH can attain better performance. To the best of our knowledge, this is the first study that evaluates whole slide search engine on the rare diseases.

Rare disease retrieval

The number of slides in rare diseases is usually less than the common ones, which makes the modern machine learning method challenging to train an efficient classifier upon it. The situation gets worse in some low-resource areas. To further investigate the clinical value of FISH, we conducted another experiment specifically for the rare type cancers by combing BWH cohorts and TCGA which results in 1,785 slides with 23 rare type cancers from 7 sites. FISH achieved

in terms of mMV@5 (Figure 4b), which is comparable to the performance achieved on the general cohort in the previous experiment. The detailed results were described in Table 3 and individual retrieval results are available in Supplementary Table 4. This is an encouraging result as it shows that if we create a whole slide database dedicated to rare disease, FISH can attain better performance. To the best of our knowledge, this is the first study that evaluates whole slide search engine on the rare diseases.

Site Diagnosis #slide mMV@5 mAP@5
Brain Glioblastoma Multiforme 380 93.68 92.34
Low-Grade Glioma, NOS 62 27.42 46.61
Astrocytoma 46 19.57 24.95
Anaplastic Astrocytoma 30 33.33 50.12
Oligodendroglioma 28 35.71 50.99
Thyroid Papillary Thyroid Cancer 316 97.47 97.42
Medullary Thyroid Cancer 202 76.73 80.26
Follicular Thyroid Cancer 150 65.33 69.57
Anaplastic Thyroid Cancer 114 74.56 78.59
Hurthle Cell Thyroid Cancer 56 66.07 68.20
Gastrointestinal Colorectal Adenocarcinoma 1,024 96.68 95.67
Esophageal Adenocarcinoma 178 28.65 47.62
Esophageal Squamous Cell Carcinoma 41 24.39 26.54
Anal Squamous Cell Carcinoma 39 17.95 36.06
Gynecological Uterine Endometrioid Carcinoma 480 78.96 82.35
High-Grade Serous Ovarian Cancer 242 64.88 64.71
Uterine Papillary Serous Carcinoma 157 29.94 44.16
Endometrioid Ovarian Cancer 64 21.88 35.70
Clear Cell Ovarian Cancer 48 60.42 58.86
Liver&Biliary Cholangiocarcinoma 55 49.09 57.23
Hepatocellular carcinoma 47 76.60 79.70
Gallbladder Cancer 37 78.38 73.77
Cutaneous Melanoma 197 79.70 83.14
Merkel Cell Carcinoma 75 77.33 82.04
Cutaneous Squamous Cell Carcinoma 38 47.37 59.22
Pulmonary Lung adenocarcinoma 1,377 84.46 85.91
Lung squamous cell carcinoma 392 54.59 60.96
Lung Carcinoid 53 56.60 74.55
Small Lung Cell Cancer 28 28.57 42.75
Urinary Bladder Urothelial Carcinoma 406 88.42 90.51
Kidney renal clear cell carcinoma 271 87.08 89.04
Kidney renal papillary cell carcinoma 96 59.38 62.27
Kidney Chromophobe 67 83.58 85.74
Upper tract Urothelial Carcinoma 47 55.32 61.73
Wilms Tumor 43 86.05 89.41
Breast Breast Invasive Ductal Carcinoma 859 97.44 98.29
Breast Invasive Lobular Carcinoma 290 49.66 50.55
Table 2: Adapting FISH to independent BWH in-house whole slide image. The detail performance of FISH on 37 cancer types from BWH test cohort. The cancer in this table is a general cohort that contain both common and rare cancer subtype in each site.
Site Diagnosis #slide mMV@5 mAP@5
Brain Astrocytoma 46 32.61 46.12
Anaplastic Astrocytoma 30 40.00 50.79
Oligodendroglioma 28 42.86 54.85
Pilocytic Astrocytoma 20 55.00 64.48
Anaplastic Oligodendroglioma 14 7.14 20.48
Thyroid Medullary Thyroid Cancer 202 80.20 83.21
Follicular Thyroid Cancer 150 72.67 78.86
Anaplastic Thyroid Cancer 114 77.19 82.66
Hurthle Cell Thyroid Cancer 56 75.00 79.91
Gastrointestinal Esophageal Squamous Cell Carcinoma 41 53.66 61.59
Anal Squamous Cell Carcinoma 39 66.67 69.47
Gynecological Uterine Papillary Serous Carcinoma 157 93.63 95.47
Endometrioid Ovarian Cancer 64 43.75 48.24
Clear Cell Ovarian Cancer 48 58.33 63.93
Liver Pancreaticobiliary Pancreatic Adenocarcinoma 209 94.26 95.06
Cholangiocarcinoma 94 69.15 70.26
Pancreatic Neuroendocrine Tumor 77 76.62 82.47
Gallbladder Cancer 37 70.27 74.52
Pulmonary Lung Carcinoid 53 96.23 97.30
Small Cell Lung Cancer 28 57.14 65.60
Urinary Kidney Chromophobe 188 97.34 97.61
Upper Tract Urothelial Carcinoma 47 85.11 85.53
Wilms Tumor 43 95.35 95.12
Table 3: FISH performance on rare type cancers. The detail performance of FISH on 23 rare type cancers from BWH and TCGA cohort. We only use TCGA cohort for Cholangiocarcinoma, Pancreatic Adenocarcinoma and Kidney Chromophobe.

Anatomic sites retrieval
Although we usually know the anatomic site where the tissue is resected nowadays, it is still possible that the site information is missed in some old whole slide image database. A search engine that can return the slide with the same sites is beneficial to archive the database. We used the diagnostic slide from TCGA and follow the paper[27] to group slides into 13 categories which results in 11,561 whole slide images. We built FISH pipeline on this database and the goal was to retrieve slides with the same anatomic site as the queried one. FISH achieved mMV@10 on average which was slightly better than Yottixel () (Figure 6a). We compared the mMV@10 in this experiment as this was the best performance reported in the Yottixel’s paper[27]. It important to note that FISH is over 15 faster than Yottixel as shown in the rightmost box plot in the Figure 6b) although the gap between two methods is small. The detail study of speed between FISH and Yottixel can be found in Speed and Interpretability and individual retrieval results are available in Supplementary Table 5.

Figure 6: Performance on anatomic site retrieval and speed. a The mMV@10 comparison between FISH and Yottixel. b The speed comparison between FISH and Yottixel in the TCGA anatomic site retrieval. FISH is faster than Yottixel by 15x when the number of slide is over 10,000. Please refer to Speed and Interpretability fore more details. c, d The confusion matrix and hamming distance matrix of FISH on anatomic site retrieval. The x-axis and y-axis correspond to the ground truth and the model prediction respectively. The sharp diagonal line in both matrix showed that FISH can retrieval the correct results and push the dissimilar ones in most cases. The number in the parentheses denotes the number of slide in each site. PB stands for pancreaticobiliary.
Figure 7: Interpretability and speed. a FISH outputs the region of interest that is useful to define the similarity of a cancer type in both query slide and each slide in the results. The number in parentheses is the hamming distance between the query slide and each result, determined by the identified ROI in each WSI. More examples can be found in Extended Data Figure 1-4. b We studied KIRC, OV and STAD in TCGA cohort with , and mMV@5 score. Each study contains 30 randomly selected queries and each query contains 1-5 ROIs. The pathologist rates agree, partially agree and disagree based on whether the regions contain tumors. The ratio of agree plus partially agree is over in all studies.

Analysis of Speed and interpretability
Speed and interpretability are essential properties of consideration for whole slide image search engines in addition to accuracy. Fast search speeds enable the usability of search engine in large databases in the digital pathology era and interpretability makes the system easy to debug and more robust to unexpected errors. We demonstrated that FISH has all these desired properties in this section.

We show how FISH interpret the results of query slide in Figure 7a. For a query slide, FISH returns the regions in the slide that are useful for defining the similarity of cancer type. This is allows us to examine these regions and ensure the search system returns the results based on some evidence agreed by pathologist instead of unmeaningful regions such as debris. More examples are shown in the Extended Data Figure 1-4 We conducted three interpretation studies using TCGA-KIRC, TCGA-OV and TCGA-STAD respectively to understand FISH’s interpretability across different levels of performance (in terms of differences in mMV@5 scores). For each study, we randomly selected 30 queries which contained at least 1 correct retrieval in the results and then extracted the ROIs found in the query slide. We asked a pathologist to rate whether the ROIs agree with their judgement by “agree”, “partially agree” (i.e., if the pathologist agrees with at least one of ROIs) and “disagree”. For example, in the study of TCGA-KIRC, the prompt was whether the ROI contain features of KIRC. The results were shown in Figure 7b. The key finding was that the ratio of agree plus partially agree exceeded in all studies.

We used the same TCGA data in the anatomic site retrieval experiment to evaluate query speed. We applied weighted sampling to to select slides from each site to create database of size 500, 1,000, 2,000, 3,000, 4,000, and 5,000, 7,000, 9,000 together with the original dataset of size 11,561. We implemented both method in Python and evaluated them on the same machine for a fair comparison. The average query speeds of both methods are reported in Figure 6b

. Since we observed that Yottixel was inefficient beyond 3,000 slides, we use the same 100 queries sampled from the databases to calculate the average query speed of FISH and Yottixel instead of using all data when the size of the database goes beyond 3,000. On the contrary, the average query speed of FISH remained almost constant with low variances throughout the experiments, which agree with our theoretical results. This result is highly encouraging as it demonstrates that FISH can scale with the growing number of slides in the digital pathology era while maintaining a relatively constant query speed.

Figure 8: Patch level retrieval. a FISH’s mMV@5 score on each type of tissue in Kather100k dataset. b FISH’s mMV@5 score on prostate tissue with different Gleason score in the in-house prostate data. c, d

FISH query speed and its mean confidence on the real kather100k data and two augmented ones (1M and 10M). Note the low variance in the kather1M and kather10M is due to the way we do data augmentation. An outlier with low query speed in original kather100k will get 10 and 100 neighbors after augmentation in kather1M and kather10M. The examples of retrieval can be found in

Extended Data Figures 5-6.

Patch level retrieval
We show that FISH can also perform patch-level retrieval in O(1) query speed, although it is not designed for the task. In this task, we view each patch query as a single mosaic fed into the FISH search pipeline. Since there is only one mosaic, there is no need for the ranking module. We get the top-K results by directly sorting the predictions by their hamming distance. We evaluated FISH on Kather100k without color normalization (NCT-CRC-HE-100K-NONORM)[33] and BWH in-house prostate data.

Kather100k contains 9 types of tissue of size and in-house prostate data contain 4 types of annotations (Normal tissue, Gleason score 3, Gleason score 4 and Gleason score 5) of size cropped from slides at 20. We resized them to before feeding into our pipeline. We observed that FISH achieves (Figure 8a) and (Figure 8b) macro-average mMV@5 on Kather100k and in-house prostate data. The individual retrieval results are available in Supplementary Table 6 and Supplementary Table 7 for Kather100k and in-house prostate. More example results can be found in Extended Data Figures 5-6. We also conducted the speed test on our Kather100k data. To efficiently curate larger dataset for testing speed, we applied data augmentation by directly adding noise to the latent code of each patch from VQ-VAE encoder instead of raw image data. For each latent code, we added a binary array whose element equals to and

with probability

and respectively as noise to augment the data. All augmented data shared the same texture feature with the original one. We curated the dataset kather1M and kather10M by augmenting each patch 10 and 100 times respectively and used 100k patch in original data as query to test the query speed. We observed that the median query speed of FISH ranges from 0.15 to 0.25s and remains unaffected all the way to 10M. Figure 8c-d. Note that the work most related to our study reports 25s per query on the 10M patch data[25].

Discussion

In summary, we show that FISH addresses several key challenges in whole slide image search: speed, accuracy, and scalability. Our experiments demonstrate that FISH is an interpretable histology image search pipeline, achieving constant speed search after training with only slide-level labels. This constant search speed and lack of reliance on pixel-level annotations will only become increasingly important as institutions’ WSI repositories grow to hundreds of thousands or millions of slides. We also showed that FISH has strong performance on unbalanced datasets commonly seen in real world histopathology, and can generalize to independent test cohorts, rare diseases, and can even be used as a search engine for patch retrieval.

To the best of our knowledge, our study presents the first search pipeline that is evaluated on the most diverse and largest dataset of diagnostic slides to date, while also reporting speed performance, an essential metric for a histology search engine[12]. We are also the first to evaluate the whole slide image search engine on rare type cancers. Additionally, FISH is the first search pipeline that can provide interpretable results for interrogation by pathologists.

Although our combination of a VQ-VAE and the vEB tree is key to success in our method, this approach is limited by the expressiveness of the integer indices created in this way. The accuracy of the method could be increased by increasing the length of these indices, but at the cost of increasing the size of metadata needed for searching and decreasing the speed of the search itself, as the vEB tree would need to visit more neighbors before finding the optimal candidates to return. One line of future work is to design a better indexing system whereby the distances would be more semantically relevant to expedite the searching process of the vEB tree, though one can imagine a system where institutions tweak the index for more speed or greater accuracy, depending on their needs. In addition, due to the limited access to large annotated patch data, the performance of FISH on large scale patch level retrieval has not yet been fully investigated. As such, evaluating FISH on millions or even billions annotated patch data is also a promising future direction.

Human-in-the-loop computing has been identified as a potential way to bring deep learning-based applications for medical images closer to the clinic[34]. Allowing end-users to give feedback and then using that feedback to iteratively refine the system can allow algorithms to better generalize to unseen data[34]. Many deep learning-based medical image segmentation models have utilized this concept[35, 36, 37], but it is not commonly used in histology image search systems. In our study, we have shown that FISH can return interpretable semantic descriptors for both query and result slides, making it feasible to build a feedback loop into FISH whereby pathologists could agree or disagree with semantic descriptors to refine or expand the search without any additional training or fine-tuning. This may be especially useful in complex settings such as rare disease retrieval where finding additional data to improve search results may be impossible. By providing researchers and pathologists with a novel and efficient way of searching, sharing, and accessing knowledge and by leveraging human-in-the-loop computing, FISH shows promise in the seamless integration into the digital pathology workflow and demonstrates its potential role in medical education, research, and even the clinical setting.

Online Methods

FISH
FISH is a histology image search pipeline that addresses the scalability issue of speed, storage, and pixel-wise label scarcity. It builds upon a set of mosaics preprocessed from whole slide images without pixel-wise labels to save storage and labelling cost and achieves search speed by leveraging the benefits from discrete latent code of VQ-VAE, Guided Search Algorithm, and Ranking Algorithm. We present these essential components of FISH in this section.

Discrete Latent Code of VQ-VAE. VQ-VAE[30] is a variant of VAE that introduces a training objective that allows discrete latent code. Let be the latent space (i.e., codebook) where is the number of discrete codeword and is the dimension of codeword. We use and in our experiment. To decide the codeword of the given input, an encoder encodes input as . The final codeword of and the training objective function are given by

(1)
(2)

where is a hyperparamter and sg denotes stop gradient operation. A stop gradient operation acts as identify function in forward pass while have zero gradient during the backward. The first term in the objective function optimizes the encoder and decoder to have a good reconstruction, the second term is used to update the codebook, and the third term is used to prevent the encoder’s output from diverting the latent space too far. The detail architecture of our VQ-VAE is shown in Extended Data Figure 7. We reordered the codebook based on the value of the first principle component and change the latent code accordingly as we found the reordered codebook can provide more semantic concept of the original input image (Extended Data Figure 8).

Feature Extraction, Index Generation and Index Encoding. We show how each mosaic can be represent by a tuple composed of mosaic index and mosaic texture feature . To get , we encode and re-map the latent code by the encoder and reorderd codeook in VQ-VAE. The index is determined by following equations.

(3)
(4)
(5)
(6)
(7)

We insert each into vEB tree for fast search. Note that the time complexity of all operations in vEB tree are . Based on the property of vEB tree, can be determined by

(8)

where is the minimum integer that makes inequality hold. Since our codebook size ranges from 0 to 127, we can determine the maximum summation in each level. Solving the inequality, we can get the minimum of to satisfy the inequality is . Because is a constant that only depends on the index generation pipeline, our search performance is . To get , we use DenseNet128 to extract feature from patch at 20 then follow the algorithm proposed in the paper[27] to binarize it.

In addition to creating the tuple to represent the mosaic, we also make a hash table with as key and value as metadata of the mosaic. The metadata includes the texture feature , the slide name associated with the mosaic, the coordinate of the slide where this mosaic is cropped, the slide file format, and the diagnosis of the slide. Note that different mosaics could share the same key. In this case, the value is a list that stores all metadata.

Guided Search Algorithm. Given the query slide represented as with mosaics where each tuple is composed of the index of mosaic and its texture feature , we apply Guided-Search to each tuple and return the corresponding results . The output takes the form of . Each is a set of tuple consists of the indices of the similar mosaics in the database and the associated information . Each element contains the hamming distance between and the query mosaic along with metadata of , including the diagnosis and site of slide where from, the position where is cropped, and the slide file format.

The drawback to use only for query is that the current mosaic index is sensitive to the minor change in . For example, a mosaic that differs from another by incur difference, which put two mosaics far away to be searched by the vEB tree. To address this issue, we create a set of candidates indices and along with the original by adding and subtracting an integer for times from . We call helper functions Forward-Search and Backward-Search to search the neighbor indices in and respectively. Both functions are only include the neighbor indices whose hamming distance from the query is smaller than a threshold . The details of algorithms are shown from Algorithm 1-3.

hash table with key as mosaic index and value as metadata
Integer and number of times for addition and deduction
Threshold of the hamming distance between query mosaic index and the neighbor
Number of time to call vEB.Successor() and vEB.Predecessor()
function Guided-Search()
     
     
     
     for  do
         
         
               
     Forward-Search()
     Backward-Search()
     
     
      Sort-Ascending()
     return
Algorithm 1 Guided Search Algorithm
function Forward-Search()
     
     for  in  do
         
         while  do
              
              if  or is empty then
                  break
              else if  then
                  // The case when the patient is identical to query slide
                  
                  continue
              else
                  // Find the mosaic with smallest hamming distance in the same key
                                 
              if  then
                  
                  
                                 
                             return
Algorithm 2 Foward Search Algorithm
function Backward-Search()
     
     for  in  do
         
         while  do
              
              if  or is empty then
                  break
              else if  then
                  // The case when the patient is identical to query slide
                  
                  continue
              else
                  // Find the mosaic with smallest hamming distance in the same key
                                 
              if  then
                  
                  
                                 
                             return
Algorithm 3 Backward Search Algorithm

Results Ranking Algorithm. Our ranking function Ranking (Algorithm 4) takes the results from Guided-Search as the input. The output is the top 5 similar slides given the query slide . The intuition of Ranking is to find the most promising mosaics in based on the uncertainty. It relies on three helper functions, which are Uncertainty-Cal (Algorithm 5), Clean (Algorithm 6) and Filtered-By-Prediction (Algorithm 7).

function Ranking(, , , )
     if  is empty then return      
      Normalize the reciprocal of diagnosis count so that the sum is equals N. for the fixed site experiment and for the anatomic site one.
     
     
     for each mosaic’s results in  do
         if  is not empty then
              Uncertainty-Cal()
              
              
              
         else
              continue               
     )
     )
     
     for  in  do
         
         if  in  then
              continue
         else
              
              for  in  do
                  if  and not in  then
                       
                       
                  else if  and and not in  then
                       
                                                                      
     
     return
Algorithm 4 Results Ranking Algorithm

Uncertainty-Cal (Algorithm 5) takes as the input and calculates the uncertainty for each by entropy. The lower the entropy, the less uncertain the mosaic and vice versa. The output is the entropy of along with records that summarize the diagnosis occurrences and hamming distance of each element in . The disadvantage of counting the occurrences naively in the entropy calculation is that the most frequent diagnosis in the anatomic site dominates the result and therefore downplays the importance of others. We introduce a weighted occurrence approach to address this issue. The approach counts the diagnosis occurrences by considering the percentage of the diagnosis in the given site and the diagnosis position in the retrieval results. It calculates the weight of each diagnosis in the anatomic sites by the reciprocal of the number of diagnosis. We normalize the weights such that summation of them equals a constant . A diagnosis’s final occurrence in is the multiplication of the normalized weight of diagnosis and the inverse of position where the diagnosis appears in . Therefore, the same diagnosis can have different weighted occurrence because of its position in . By doing so, the less frequent diagnosis and the one with lower hamming distance (i.e., diagnosis close to the front of the retrieval results) gain more importance in the ranking process. After this stage, we also summarized by three metadata , , and to facilitate the subsequent processes, which are defined as

  • : A nested hash table that stores the index of in as the key and its weighted diagnosis occurrences table as value.

  • : An array that stores tuples composed of the index , the entropy, the hamming distance of all mosaics, and the total number of mosaics for each in .

  • : An array that stores the total number of mosaics for each in .

function Weighted-Uncertainty-Cal(, )
     
     for  in  do
         
               
     for  in  do
         if  then
                             
     
     return
Algorithm 5 Uncertainty Calculation

Clean (Algorithm 6) aims to remove the outlier and the mosaics that are less similar to the query one in . It takes summaries of mosaic and from the previous stage as input, removing whose result length is less than or greater than quantile. Besides, we take the average of the mean of hamming distance in top 5 mosaics for each as a threshold denoted by , using it to filter out whose mean of hamming distance in top 5 retrieval is greater than . After cleaning the results, we sort it based on uncertainty calculated from Uncertainty-Cal in ascending order.

function Clean()
     //:An array that stores tuples composed of the index , the entropy, the hamming distance of all mosaics, and the total number of mosaics for each in .
     //:An array that stores the total number of mosaics for each in .
     
     
     
     // When the unique results length is less than 3, we keep the original .
     if Unique( then
         for  in  do
              if  or  then
                  del res
              else
                                          
     else
         for  in  do
                             
     
     for  in  do
         if  then
              del               
     
     return
Algorithm 6 Results Cleaning

We can return the slide from in the front of the sorted based on the uncertainty However, the drawback is that the low uncertainty of first several could be caused by the domination of the most frequent diagnosis in the given anatomic site. For example, the most frequent occurrences of the top 5 entries in could be KIRC, BLCA, KIPR, KIRP, and KIRP in the urinary site. In this case, the query slide should be better diagnosed as KIRP based on the majority vote. Therefore, the first and second entries that dominate the urinary site cases should not be considered during retrieval. We leverage the Filtered-By-Prediction (Algorithm 7) to mitigate the issue. The function takes the summation of the diagnosis occurrences from the top 5 certain in . It first uses the diagnosis with the maximum score as a pseudo ground truth diagnosis from top 5 certain . Afterwards, it removes whose maximum occurrence diagnosis disagrees with the pesudo ground truth.

To return final results of slide query , we take slide name and its diagnosis in pointed by one by one. If the uncertainty of is zero, we take all . On the contrary, we use again to ignore whose hamming distance is greater than the threshold. We sort first by uncertainty in the ascending order then by the hamming distance in the descending order if the uncertainty is tie.

function Filtered_By_Prediction()
     //: An array that stores tuples composed of the index , the entropy, the hamming distance of all mosaics, and the total number of mosaics for each in .
     //: A nested hash table that stores the index of in as the key and its weighted diagnosis occurrences table as value.
     
     
     for  in  do
         // Calculate the score of each diagnosis
         for  in  do
                             
     
     
     // A while loop is used here to avoid the case that the plb remove all .
     while  do
         
         
         for  in  do
              
              if  then
                                          
         if  then
              break
         else
                             
     return
Algorithm 7 Results Filtering by Prediction

Training details of VQ-VAE We used a sampled version of TCGA slide data in the first experiment (i.e., disease subtype retrieval in TCGA) to train our VQ-VAE. For each slide, we sample 10 1024x1024 patches at 20

. All patches are converted from RGB to Pytorch tensor then normalized to

. The model was trained by Adam optimizer with a learning rate of without weight decay and AMSgrad. We used default setting for other hyperparamters in Adam (i.e., and

). We trained our model with a batch size of 4 for 10 epochs. We applied gradient clipping techniques by setting the gradient threshold to 1.0. The hyperparatmer

in VQ-VAE was also set to 1.

Ablation Study We conducted ablation study on our ranking module to test the benefit of each function. Specifically, we compared the performance of following four settings: (1) Naive: removing Clean and Filtered-By-Prediction and treating each diagnosis occurrence in the mosaic retrieval result equally (i.e., replacing the assignment in line 4 in Algorithm 5 with 1.) (2)Weighed count: applying Uncertainty-Cal to the ranking module only. (3)Clean: applying Uncertainty-Cal and Clean to the ranking module. (4)Filter: applying all functions to the ranking module.

Visualization We build confusion matrix for each site, using each slide diagnosis as ground truth along the x-axis and as predicted diagnosis along the y-axis. For the hamming distance matrix, we inspect the hamming distance between the query slide and each result in one by one, adding the hamming distance to the associated diagnosis label and infinity to others. The infinity here is defined as hamming distance threshold plus 1 as is the maximum distance we can have in our pipeline. The final hamming distance matrix is obtained by dividing the total number of slides in the given anatomic site.

Evaluation Metrics
For all experiments, we remove slide with the same patient id as the query slide in the database (i.e., leave-one-patient-out evaluation). We use the mean majority vote () results in the top retrieval (mMV@k) instead of top-k accuracy over all instances in the data as this metric is more suitable to medical domain[26]. We also use mean average precision at (mAP@k) to further evaluate the retrieval performance. Specifically,

mMV@k (9)
mAP@k (10)

where is the number of slide, is the ground truth diagnosis of slide , and the is the predicted diagnosis of taken from the majority vote of top retrieval . is another indicator function that output if the two inputs are the same and otherwise. We use to denote the the number of times when predicted diagnosis matches the ground truth among the top retrievals of . Note that mAP@k is more lenient metric compared to mMV@k as a model can get mAP@k by put only one relevant slide at the first place among top k retrievals while mMV@k score is still zero in this case. Therefore, higher mMV@k is more important in our application but we still report mAP@k to quantify model’s capability to give the relevant slide higher rank. To fairly compare with the best results in the paper[26]. We set for all our experiments except anatomic site retrieval, where is set to 10. In few cases, it is possible that the number of retrieval results is less than . We consider a query is correct if the number of correct retrieval is greater than .

Computational Hardware and Software
We stored all whole slide images (WSIs), patches, segmentation, mosaics across multiple disk with total size around 27T. Segmentation, patching mosaic extraction and search of WSIs were performed on CPU (AMD Ryzen Threadripper 3970X 32-Core Processor). The VQ-VAE pretraining and feature extraction were performed on 4 NVIDIA 2080Ti GPU. The whole FISH pipeline was written in Python (version 3.7.0) with the following external package: h5py (2.10.0), matplotlib(3.3.0), numpy (1.19.1), opencv-python (4.3.0.38), pillow (7.2.0), pandas (1.1.0), scikit-learn (0.23.1), seaborn (0.10), scikit-image (0.17.2), torchvision (0.6.0) tensorboard (2.3.0) and tqdm (4.48.0). We used Pytorch (1.5.0) for deep learning. All plots were created by matplotlib (version 3.2.2) and seaborn (version 0.10.1). The internal function in Google Slide was used to plot pie chart.

WSI dataset
There are three dataset in our slide level retrieval experiment which are the diagnostic slide in The Cancer Genome Atlas (TCGA), Clinical Proteomic Tumor Analysis Consortium (CPTAC) and BWH in-house data.

TCGA diagnostic slide. We downloaded all diagnostic slide from TCGA website. To fairly compare with Yottixel, we used slides from the same 13 anatomic sites for anatomic site retrieval and the same 29 diagnosis for disease subbtype retrieval. The detail slide and patient number are reported in Extended Table 1.

CPTAC diagnostic slide We downloaded the tumor tissue slide of from the official website. There are 503 CPTAC-CCRC slides from 216 patient, 544 CPTAC-UCEC slides from 240 patients, 679 CPTAC-LUSC slides from 210 patients, 669 CPTAC-LUAD slides from 224 patients and CPTAC-SKCM 283 slides from 93 patients. All slides are at 20.

BWH in house dataset In this cohort, each whole slide image is from different patient. For the prostate data used in patch-level retrieval, we collected 23 slides at 20 and annotated regions in each slide by GP3, GP4, GP5, and normal. The detail slide and patient number are reported in Extended Table 2.

WSI processing
Segmentation. We used the automatic segmentation tool in CLAM[9] to generate the segmentation mask for each slide. The tool first applies binary threshold to a downsampled whole slide image on the HSV color space to generate a binary mask then refines the mask by median blurring and morphological closing to remove the artifacts. After getting the approximate contours of the tissue, the tool filters out the tissue contours and cavities based on certain area threshold.
Patching. After segmentation, we cropped the contours into patches without overlapping at 20. For 40 whole slide, we first cropped it into patches then downsampled them into to get the equivalent patches at 20.
Mosaic generation. We followed the the mosaic generation process proposed in the paper[27]. The algorithm first apply K-mean clustering on the RGB features extracted from each patches with number of cluster . Within each cluster, we run K-mean clustering again on the coordinate of each patch by setting the number of cluster equals to

of the cluster size. If the number of cluster is less than 1 in the second stage, we took all coordinates within that cluster. Except the number of clusters, we used all default settings in Scikit-learn for K-mean clustering. To get better quality of mosaics, we collected 101 patches for both debris/pen smudges and tissue to train a logistic regression based on local binary pattern histogram feature to remove the unmeaningful regions. We used default setting from Scikit-learn package in logistic regression and used rotate invariant binary pattern from Scikit-image package with

and . The bin number of histogram was set to 128.
Artifacts removal. We found that there could be pure white mosaic in some rare cases after mosaic generation. We removed the such mosaic from the slide if the white region accounts for over of area. We applied binary threshold method in OpenCV with threshold value equals to 235 to determine the area of white regions.

Data Availability
The TCGA diagnostic whole slide image are available from TCGA website and CPTAC data is available from the NIH cancer imaging archive. The kather100k data are available from link provided in the paper[33]. Reasonable requests for in-house BWH whole slide and prostate data may be addressed to the corresponding author.
Code Availability We implemented all our methods in Python and using Pytorch as the primary package for training VQ-VAE. All scripts, checkpoints, preprocessed mosaics, and pre-built database to reproduce our experiment in the paper are available at https://github.com/mahmoodlab/FISH. All source code is licensed under GNU GPv3.
Author Contributions
C.C. and F.M. conceived the study and designed the experiments. C.C. performed the experiments. C.C. M.Y.L. D.F.K.W T.C. A.J.S. and F.M. analyzed the results. D.W. conducted the reader study. All authors wrote and approved the final paper.
Acknowledgement
This work was supported in part by internal funds from BWH Pathology, Google Cloud Research Grant and Nvidia GPU Grant Program and NIGMS R35GM138216 (F.M.). The content is solely the responsibility of the authors and does not reflect the official views of the National Institutes of Health, or the National Institute of General Medical Sciences.

Competing Interests

The authors declare that they have no competing financial interests.

Ethics Oversight

The study was approved by the Mass General Brigham (MGB) IRB office under protocol 2020P000233.

References

References

  • [1] Snead, D. R. et al. Validation of digital pathology imaging for primary histopathological diagnosis. Histopathology 68, 1063–1072 (2016).
  • [2] Mukhopadhyay, S. et al. Whole slide imaging versus microscopy for primary diagnosis in surgical pathology: a multicenter blinded randomized noninferiority study of 1992 cases (pivotal study). The American journal of surgical pathology 42, 39 (2018).
  • [3] Azam, A. S. et al. Diagnostic concordance and discordance in digital pathology: a systematic review and meta-analysis. Journal of Clinical Pathology (2020).
  • [4] LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436–444 (2015).
  • [5] Esteva, A. et al. A guide to deep learning in healthcare. Nature medicine 25, 24–29 (2019).
  • [6] Esteva, A. et al.

    Dermatologist-level classification of skin cancer with deep neural networks.

    Nature 542, 115–118 (2017).
  • [7] Bera, K., Schalper & Madabhushi, A. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nature Reviews Clinical Oncology 16, 703–715 (2019).
  • [8] Niazi, M. K. K., Parwani, A. V. & Gurcan, M. N. Digital pathology and artificial intelligence. The Lancet Oncology 20, e253–e261 (2019).
  • [9] Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering 1–16 (2021).
  • [10] Chen, R. J. et al. Pathomic fusion: An integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Transactions on Medical Imaging (2020).
  • [11] Lu, M. Y. et al. Deep learning-based computational pathology predicts origins for cancers of unknown primary. Nature (2021).
  • [12] Komura, D. & Ishikawa, S. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal 16, 34–42 (2018).
  • [13] Qi, X. et al. Content-based histopathology image retrieval using cometcloud. BMC bioinformatics 15, 1–17 (2014).
  • [14] Zhang, X., Liu, W., Dundar, M., Badve, S. & Zhang, S. Towards large-scale histopathological image analysis: Hashing-based image retrieval. IEEE Transactions on Medical Imaging 34, 496–506 (2014).
  • [15] Sridhar, A., Doyle, S. & Madabhushi, A. Content-based image retrieval of digitized histopathology in boosted spectrally embedded spaces. Journal of pathology informatics 6 (2015).
  • [16] Kwak, J. T., Hewitt, S. M., Kajdacsy-Balla, A. A., Sinha, S. & Bhargava, R. Automated prostate tissue referencing for cancer detection and diagnosis. BMC bioinformatics 17, 1–12 (2016).
  • [17] Sparks, R. & Madabhushi, A. Out-of-sample extrapolation utilizing semi-supervised manifold learning (ose-ssl): content based image retrieval for histopathology images. Scientific reports 6, 1–15 (2016).
  • [18] Jiang, M., Zhang, S., Huang, J., Yang, L. & Metaxas, D. N. Scalable histopathological image analysis via supervised hashing with multiple features. Medical image analysis 34, 3–12 (2016).
  • [19] Shi, X. et al. Supervised graph hashing for histopathology image retrieval and classification. Medical image analysis 42, 117–128 (2017).
  • [20] Komura, D. et al. Luigi: Large-scale histopathological image retrieval system using deep texture representations. biorxiv 345785 (2018).
  • [21] Schaer, R., Otálora, S., Jimenez-del Toro, O., Atzori, M. & Müller, H. Deep learning-based retrieval system for gigapixel histopathology cases and the open access literature. Journal of pathology informatics 10 (2019).
  • [22] Ma, Y. et al. Breast histopathological image retrieval based on latent dirichlet allocation. IEEE journal of biomedical and health informatics 21, 1114–1123 (2016).
  • [23] Zheng, Y. et al. Histopathological whole slide image analysis using context-based cbir. IEEE transactions on medical imaging 37, 1641–1652 (2018).
  • [24] Akakin, H. C. & Gurcan, M. N. Content-based microscopic image retrieval system for multi-image queries. IEEE transactions on information technology in biomedicine 16, 758–769 (2012).
  • [25] Hegde, N. G. et al. Similar image search for histopathology: Smily. Nature Partner Journal (npj) Digital Medicine (2019). URL https://www.nature.com/articles/s41746-019-0131-z.
  • [26] Kalra, S. et al. Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence. NPJ digital medicine 3, 1–15 (2020).
  • [27] Kalra, S. et al. Yottixel–an image search engine for large archives of histopathology whole slide images. Medical Image Analysis 65, 101757 (2020).
  • [28] Hemati, S. et al. Cnn and deep sets for end-to-end whole slide image representation learning (2021).
  • [29] Riasatian, A. et al. Fine-tuning and training of densenet for histopathology image representation using tcga diagnostic slides. Medical Image Analysis 70, 102032 (2021).
  • [30] Oord, A. v. d., Vinyals, O. & Kavukcuoglu, K. Neural discrete representation learning. arXiv preprint arXiv:1711.00937 (2017).
  • [31] Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  • [32] Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 4700–4708 (2017).
  • [33] Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine 16, e1002730 (2019).
  • [34] Budd, S., Robinson, E. C. & Kainz, B.

    A survey on active learning and human-in-the-loop deep learning for medical image analysis.

    Medical Image Analysis 102062 (2021).
  • [35] Wang, G. et al. Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE transactions on medical imaging 37, 1562–1573 (2018).
  • [36] Wang, G. et al. Deepigeos: a deep interactive geodesic framework for medical image segmentation. IEEE transactions on pattern analysis and machine intelligence 41, 1559–1572 (2018).
  • [37] Amrehn, M. et al. Ui-net: Interactive artificial neural networks for iterative image segmentation based on a user model. arXiv preprint arXiv:1709.03450 (2017).