Fast and Scalable Image Search for Histology
The expanding adoption of digital pathology has enabled the curation of large repositories of histology whole slide images (WSIs), which contain a wealth of information. Similar pathology image search offers the opportunity to comb through large historical repositories of gigapixel WSIs to identify cases with similar morphological features and can be particularly useful for diagnosing rare diseases, identifying similar cases for predicting prognosis, treatment outcomes, and potential clinical trial success. A critical challenge in developing a WSI search and retrieval system is scalability, which is uniquely challenging given the need to search a growing number of slides that each can consist of billions of pixels and are several gigabytes in size. Such systems are typically slow and retrieval speed often scales with the size of the repository they search through, making their clinical adoption tedious and are not feasible for repositories that are constantly growing. Here we present Fast Image Search for Histopathology (FISH), a histology image search pipeline that is infinitely scalable and achieves constant search speed that is independent of the image database size while being interpretable and without requiring detailed annotations. FISH uses self-supervised deep learning to encode meaningful representations from WSIs and a Van Emde Boas tree for fast search, followed by an uncertainty-based ranking algorithm to retrieve similar WSIs. We evaluated FISH on multiple tasks and datasets with over 22,000 patient cases spanning 56 disease subtypes. We additionally demonstrate that FISH can be used to assist with the diagnosis of rare cancer types where sufficient cases may not be available to train traditional supervised deep models. FISH is available as an easy-to-use, open-source software package (https://github.com/mahmoodlab/FISH).READ FULL TEXT VIEW PDF
The increasing availability of large institutional and public histopatho...
Prostate cancer is the most prevalent cancer among men in Western countr...
The examination of histopathology images is considered to be the gold
Virtual microscopy includes digitisation of histology slides and the use...
The rapidly emerging field of deep learning-based computational patholog...
Researchers often spend weeks sifting through decades of unlabeled satel...
Facial analysis technologies have recently measured up to the capabiliti...
Fast and Scalable Image Search for Histology
FISH is a deep learning based histology image retrieval method that combines VQ-VAE and vEB tree to achieve search performance with low storage cost, while also supporting patch-level retrieval and human-interpretability. To achieve this performance, we represent each slide image as a set of integers and binary codes for efficient storage and encode the integers into a vEB tree for fast search. The overview of FISH pipeline is shown in Figure 1.
FISH begins by distilling a mosaic representation of a given slide. To select the patches used for representing the slide we used two-stage K-means clustering. Specifically, we first apply K-means clustering on the RGB features extracted from patches at 5, followed by K-means clustering on the coordinates of patches at 20 within each initial cluster. We extract image patches corresponding the coordinates of the set of final cluster centers and use them as a mosaic representation of the given slide. To convert the mosaics into a set of integers and binary code (Figure 1b), we pre-train a VQ-VAE, which is a variant of a Variational Autoencoder that gives the input a discrete latent code from a codebook learned on the TCGA slides at 20. We use the encoder of the pretrained VQ-VAE along with the learned codebook to encode the patches at 20 and extract mosaic features by using a Densenet model and a binarization algorithm. The last step is to convert the discrete latent codes into integers to store the mosaics in the vEB tree. We feed the latent codes of the mosaics into a pipeline composed of a series of average pooling (AvgPools), summation, and shift operations. The intuition behind this pipeline is to summarize the information in each scale via summation then store it into a different range of digits in a integer.
During searching (Figure 1c), we extract the features of the preprocessed mosaics of the query whole slide image and then apply the proposed Guided Search Algorithm (GSA) to find the most similar results of each mosaic. The design principle of GSA is to find a fixed number of the nearest neighbors using the vEB and only select the neighbors whose hamming distances from the query mosaic are lower than a certain threshold . Since we only look for a fixed number of neighbors and the search time of the vEB to find neighbors is , the time complexity of FISH search is . The search result of each mosaic is a list of patches. Each patch contains metadata that documents the name of the slide where the patch is located, the diagnosis of the slide, and hamming distance between the patch and the query mosaic. Once each mosaic gets its search results, our ranking algorithm ranks the candidate mosaics used to retrieve the final top-K similar slide. We collect all slides that appear in the search results from the candidate mosaics and sort them based on hamming distance in ascending order to return the top-K similar slides.
In the next sections, we demonstrate performance in four areas: (1) disease subtype retrieval in the fixed anatomic site in public cohorts (TCGA, CPTAC), (2) disease subtype retrieval in the fixed anatomic site in independent cohorts (BWH in-house data) to test generalizability, (3) disease anatomic site retrieval, and (4) speed and interpretability. In addition, we also report that our system can handle patch level retrieval with search performance on Kather100k and in-house prostate data though the system is not specifically designed for this task.
Disease subtypes retrieval in public cohorts
We reported the majority top-5 accuracy (mMV@5) for FISH and Yottixel as our main comparing metric and provided mAP@5 of FISH for reference. The mMV@5 evaluates how often the majority slide diagnosis in the top 5 results matches the query one, while mAP@5 measures how well the model can give slides with the same diagnosis as the query slide a higher rank in the retrieval results. We used mMV@5 as the primary metric for comparison as it is stricter than widely-used top-5 accuracy and also report mAP@5 because the results are considered correct only when the majority diagnosis in the retrieval agrees with the query. More details can be found in Online Methods. We built the FISH pipeline on slides from each anatomic site and tested whether FISH can retrieve slides with correct diagnosis. Overall, we had better results than Yottixel (Figure 2a) in terms of macro-average mMV@5 within each site and across all sites. We believe macro-averaging is the appropriate measure here as the uncommon cases in an unbalanced real world histology database are as crucial as the common ones. For some sites such as Pulmonary, Gynecological, Urinary and Hematopoietic where the data distributions are skewed, FISH outperforms Yottixel in the uncommon diagnosis by large margin (47.68% improvement on Pulmonary-MESO; 29%, 16.3%, 16.2% improvement on Gynecological-UCS, Gynecological-CESC, Gynecological-OV, respectively; 14.2% improvement on Urinary-KICH and 30.2% improvement on Hematopoietic-DLBC). A detailed comparison is shown in Table 1 and individual retrieval results are available in Supplementary Table 1. In addition, the speed advantage of FISH became pronounced especially after the number of slide in the database exceeded 1,000 (Figure 2b). The median query speed of FISH remains almost constant despite the growing number of slides, which is justified by our theoretical results. We perform more experiments to demonstrate that FISH is scalable to thousands of slides in a later section (Speed and Interpretability). Additionally, the ranking algorithm plays a crucial role in the success of FISH and applies three post-processing steps to the predictions. Therefore, we conducted an ablation study to validate these steps and showed that FISH achieved the best performance by including all the steps (red line in Figure 3b). The details of these steps are explained in the Ablation study in our Methods.
To further test the generalization ability of FISH, we combined several diseases (KIRC, UCES, SKCM, LUAD, and LUSC) in CPTAC with the TCGA data to test performance on a mixed public cohort with the results reported in Figure 3a. After combination, the distribution of the dataset in all sites became more skewed, but the performance of FISH did not vary substantially in most cases. This result further shows that FISH can address dataset unbalance commonly presented in the real world. The only exception was Pulmonary-MESO, for which the site where the disease is located was highly unbalanced. Individual retrieval results are available in Supplementary Table 2. Note that our VQ-VAE was only trained on TCGA data without observing slides in the CPTAC, which also showed the generalizability of our encoder.
We also created confusion matrices and hamming distance matrices in Figure 2c-d to gain more insight. The hamming distance calculates each diagnosis’s average paired hamming distance in a given site, which helps further explain the trend behind the confusion matrix. More details about how we calculate these matrices are described in Method. By examining the confusion matrices, we can see a dark diagonal line, which suggested the majority results FISH retrieves match the queried diagnosis. The hamming distance matrices further explained the trends in the confusion matrices. The dark diagonal line shows the smallest hamming distance values in all sites, demonstrating that slides with different diagnosis are pushed far away in the hamming distance space. We can also use the matrices to explain why FISH performed worse in certain diseases. For example, query slide with diagnosis Liver-CHOL was more often confused with Liver-LIHC than Liver PAAD, which can be explained by the fact that the distance between Liver-CHOL and Liver-LIHC is smaller than to Liver-PAAD. Another example was the gastrointestinal site, where the diagonal values in the distance matrix were generally higher than in other sites, which explains why FISH performed worse in this site. We can apply similar logic to other sites and diseases.
Adapting FISH to independent BWH in-house whole slide image.
There are many variations in whole slide images (WSIs) across the institutions due to differences in the protocol of slide preparation and digitalization. Therefore, it is essential to validate our FISH trained on TCGA is robust to in-house data. We collected 8,035 diagnostic slides that contains 9 anatomic sites with 37 primary cancer subtypes from the WSI database at Brigham and Women’s Hospital. For each anatomic site, we built our pipeline separately and used mMV@5 as the main evaluation metric while provided mAP@5 for reference. FISH achieved an averagemMV@5 across all sites (Figure 4a). It was especially successful in Urinary (), Thyroid (), Cutaneous (), Liver/Biliary () and Gynecology() as shown in Figure 5 where the diagonal lines in both confusion and hamming distance matrix were relatively clear. We reported the detailed results in Table 2 and individual retrieval results are available in Supplementary Table 3. Note that we did not fine-tune our encoder in this cohort, which shows the generalizability of encoder trained only on TCGA. To further investigate the clinical value of FISH, we conducted another experiment specifically for the rare type cancers by combing BWH cohorts and TCGA which results in 1,785 slides with 23 rare type cancers from 7 sites. FISH achieved in terms of mMV@5 (Figure 4b), which is comparable to the performance achieved on the general cohort in the previous experiment. The detailed results were described in Table 3. This is an encouraging result as it shows that if we create a whole slide database dedicated to rare disease, FISH can attain better performance. To the best of our knowledge, this is the first study that evaluates whole slide search engine on the rare diseases.
Rare disease retrieval
The number of slides in rare diseases is usually less than the common ones, which makes the modern machine learning method challenging to train an efficient classifier upon it. The situation gets worse in some low-resource areas. To further investigate the clinical value of FISH, we conducted another experiment specifically for the rare type cancers by combing BWH cohorts and TCGA which results in 1,785 slides with 23 rare type cancers from 7 sites. FISH achievedin terms of mMV@5 (Figure 4b), which is comparable to the performance achieved on the general cohort in the previous experiment. The detailed results were described in Table 3 and individual retrieval results are available in Supplementary Table 4. This is an encouraging result as it shows that if we create a whole slide database dedicated to rare disease, FISH can attain better performance. To the best of our knowledge, this is the first study that evaluates whole slide search engine on the rare diseases.
|Low-Grade Glioma, NOS||62||27.42||46.61|
|Thyroid||Papillary Thyroid Cancer||316||97.47||97.42|
|Medullary Thyroid Cancer||202||76.73||80.26|
|Follicular Thyroid Cancer||150||65.33||69.57|
|Anaplastic Thyroid Cancer||114||74.56||78.59|
|Hurthle Cell Thyroid Cancer||56||66.07||68.20|
|Esophageal Squamous Cell Carcinoma||41||24.39||26.54|
|Anal Squamous Cell Carcinoma||39||17.95||36.06|
|Gynecological||Uterine Endometrioid Carcinoma||480||78.96||82.35|
|High-Grade Serous Ovarian Cancer||242||64.88||64.71|
|Uterine Papillary Serous Carcinoma||157||29.94||44.16|
|Endometrioid Ovarian Cancer||64||21.88||35.70|
|Clear Cell Ovarian Cancer||48||60.42||58.86|
|Merkel Cell Carcinoma||75||77.33||82.04|
|Cutaneous Squamous Cell Carcinoma||38||47.37||59.22|
|Lung squamous cell carcinoma||392||54.59||60.96|
|Small Lung Cell Cancer||28||28.57||42.75|
|Urinary||Bladder Urothelial Carcinoma||406||88.42||90.51|
|Kidney renal clear cell carcinoma||271||87.08||89.04|
|Kidney renal papillary cell carcinoma||96||59.38||62.27|
|Upper tract Urothelial Carcinoma||47||55.32||61.73|
|Breast||Breast Invasive Ductal Carcinoma||859||97.44||98.29|
|Breast Invasive Lobular Carcinoma||290||49.66||50.55|
|Thyroid||Medullary Thyroid Cancer||202||80.20||83.21|
|Follicular Thyroid Cancer||150||72.67||78.86|
|Anaplastic Thyroid Cancer||114||77.19||82.66|
|Hurthle Cell Thyroid Cancer||56||75.00||79.91|
|Gastrointestinal||Esophageal Squamous Cell Carcinoma||41||53.66||61.59|
|Anal Squamous Cell Carcinoma||39||66.67||69.47|
|Gynecological||Uterine Papillary Serous Carcinoma||157||93.63||95.47|
|Endometrioid Ovarian Cancer||64||43.75||48.24|
|Clear Cell Ovarian Cancer||48||58.33||63.93|
|Liver Pancreaticobiliary||Pancreatic Adenocarcinoma||209||94.26||95.06|
|Pancreatic Neuroendocrine Tumor||77||76.62||82.47|
|Small Cell Lung Cancer||28||57.14||65.60|
|Upper Tract Urothelial Carcinoma||47||85.11||85.53|
Anatomic sites retrieval
Although we usually know the anatomic site where the tissue is resected nowadays, it is still possible that the site information is missed in some old whole slide image database. A search engine that can return the slide with the same sites is beneficial to archive the database. We used the diagnostic slide from TCGA and follow the paper to group slides into 13 categories which results in 11,561 whole slide images. We built FISH pipeline on this database and the goal was to retrieve slides with the same anatomic site as the queried one. FISH achieved mMV@10 on average which was slightly better than Yottixel () (Figure 6a). We compared the mMV@10 in this experiment as this was the best performance reported in the Yottixel’s paper. It important to note that FISH is over 15 faster than Yottixel as shown in the rightmost box plot in the Figure 6b) although the gap between two methods is small. The detail study of speed between FISH and Yottixel can be found in Speed and Interpretability and individual retrieval results are available in Supplementary Table 5.
Analysis of Speed and interpretability
Speed and interpretability are essential properties of consideration for whole slide image search engines in addition to accuracy. Fast search speeds enable the usability of search engine in large databases in the digital pathology era and interpretability makes the system easy to debug and more robust to unexpected errors. We demonstrated that FISH has all these desired properties in this section.
We show how FISH interpret the results of query slide in Figure 7a. For a query slide, FISH returns the regions in the slide that are useful for defining the similarity of cancer type. This is allows us to examine these regions and ensure the search system returns the results based on some evidence agreed by pathologist instead of unmeaningful regions such as debris. More examples are shown in the Extended Data Figure 1-4 We conducted three interpretation studies using TCGA-KIRC, TCGA-OV and TCGA-STAD respectively to understand FISH’s interpretability across different levels of performance (in terms of differences in mMV@5 scores). For each study, we randomly selected 30 queries which contained at least 1 correct retrieval in the results and then extracted the ROIs found in the query slide. We asked a pathologist to rate whether the ROIs agree with their judgement by “agree”, “partially agree” (i.e., if the pathologist agrees with at least one of ROIs) and “disagree”. For example, in the study of TCGA-KIRC, the prompt was whether the ROI contain features of KIRC. The results were shown in Figure 7b. The key finding was that the ratio of agree plus partially agree exceeded in all studies.
We used the same TCGA data in the anatomic site retrieval experiment to evaluate query speed. We applied weighted sampling to to select slides from each site to create database of size 500, 1,000, 2,000, 3,000, 4,000, and 5,000, 7,000, 9,000 together with the original dataset of size 11,561. We implemented both method in Python and evaluated them on the same machine for a fair comparison. The average query speeds of both methods are reported in Figure 6b
. Since we observed that Yottixel was inefficient beyond 3,000 slides, we use the same 100 queries sampled from the databases to calculate the average query speed of FISH and Yottixel instead of using all data when the size of the database goes beyond 3,000. On the contrary, the average query speed of FISH remained almost constant with low variances throughout the experiments, which agree with our theoretical results. This result is highly encouraging as it demonstrates that FISH can scale with the growing number of slides in the digital pathology era while maintaining a relatively constant query speed.
Patch level retrieval
We show that FISH can also perform patch-level retrieval in O(1) query speed, although it is not designed for the task. In this task, we view each patch query as a single mosaic fed into the FISH search pipeline. Since there is only one mosaic, there is no need for the ranking module. We get the top-K results by directly sorting the predictions by their hamming distance. We evaluated FISH on Kather100k without color normalization (NCT-CRC-HE-100K-NONORM) and BWH in-house prostate data.
Kather100k contains 9 types of tissue of size and in-house prostate data contain 4 types of annotations (Normal tissue, Gleason score 3, Gleason score 4 and Gleason score 5) of size cropped from slides at 20. We resized them to before feeding into our pipeline. We observed that FISH achieves (Figure 8a) and (Figure 8b) macro-average mMV@5 on Kather100k and in-house prostate data. The individual retrieval results are available in Supplementary Table 6 and Supplementary Table 7 for Kather100k and in-house prostate. More example results can be found in Extended Data Figures 5-6. We also conducted the speed test on our Kather100k data. To efficiently curate larger dataset for testing speed, we applied data augmentation by directly adding noise to the latent code of each patch from VQ-VAE encoder instead of raw image data. For each latent code, we added a binary array whose element equals to and
with probabilityand respectively as noise to augment the data. All augmented data shared the same texture feature with the original one. We curated the dataset kather1M and kather10M by augmenting each patch 10 and 100 times respectively and used 100k patch in original data as query to test the query speed. We observed that the median query speed of FISH ranges from 0.15 to 0.25s and remains unaffected all the way to 10M. Figure 8c-d. Note that the work most related to our study reports 25s per query on the 10M patch data.
In summary, we show that FISH addresses several key challenges in whole slide image search: speed, accuracy, and scalability. Our experiments demonstrate that FISH is an interpretable histology image search pipeline, achieving constant speed search after training with only slide-level labels. This constant search speed and lack of reliance on pixel-level annotations will only become increasingly important as institutions’ WSI repositories grow to hundreds of thousands or millions of slides. We also showed that FISH has strong performance on unbalanced datasets commonly seen in real world histopathology, and can generalize to independent test cohorts, rare diseases, and can even be used as a search engine for patch retrieval.
To the best of our knowledge, our study presents the first search pipeline that is evaluated on the most diverse and largest dataset of diagnostic slides to date, while also reporting speed performance, an essential metric for a histology search engine. We are also the first to evaluate the whole slide image search engine on rare type cancers. Additionally, FISH is the first search pipeline that can provide interpretable results for interrogation by pathologists.
Although our combination of a VQ-VAE and the vEB tree is key to success in our method, this approach is limited by the expressiveness of the integer indices created in this way. The accuracy of the method could be increased by increasing the length of these indices, but at the cost of increasing the size of metadata needed for searching and decreasing the speed of the search itself, as the vEB tree would need to visit more neighbors before finding the optimal candidates to return. One line of future work is to design a better indexing system whereby the distances would be more semantically relevant to expedite the searching process of the vEB tree, though one can imagine a system where institutions tweak the index for more speed or greater accuracy, depending on their needs. In addition, due to the limited access to large annotated patch data, the performance of FISH on large scale patch level retrieval has not yet been fully investigated. As such, evaluating FISH on millions or even billions annotated patch data is also a promising future direction.
Human-in-the-loop computing has been identified as a potential way to bring deep learning-based applications for medical images closer to the clinic. Allowing end-users to give feedback and then using that feedback to iteratively refine the system can allow algorithms to better generalize to unseen data. Many deep learning-based medical image segmentation models have utilized this concept[35, 36, 37], but it is not commonly used in histology image search systems. In our study, we have shown that FISH can return interpretable semantic descriptors for both query and result slides, making it feasible to build a feedback loop into FISH whereby pathologists could agree or disagree with semantic descriptors to refine or expand the search without any additional training or fine-tuning. This may be especially useful in complex settings such as rare disease retrieval where finding additional data to improve search results may be impossible. By providing researchers and pathologists with a novel and efficient way of searching, sharing, and accessing knowledge and by leveraging human-in-the-loop computing, FISH shows promise in the seamless integration into the digital pathology workflow and demonstrates its potential role in medical education, research, and even the clinical setting.
FISH is a histology image search pipeline that addresses the scalability issue of speed, storage, and pixel-wise label scarcity. It builds upon a set of mosaics preprocessed from whole slide images without pixel-wise labels to save storage and labelling cost and achieves search speed by leveraging the benefits from discrete latent code of VQ-VAE, Guided Search Algorithm, and Ranking Algorithm. We present these essential components of FISH in this section.
Discrete Latent Code of VQ-VAE. VQ-VAE is a variant of VAE that introduces a training objective that allows discrete latent code. Let be the latent space (i.e., codebook) where is the number of discrete codeword and is the dimension of codeword. We use and in our experiment. To decide the codeword of the given input, an encoder encodes input as . The final codeword of and the training objective function are given by
where is a hyperparamter and sg denotes stop gradient operation. A stop gradient operation acts as identify function in forward pass while have zero gradient during the backward. The first term in the objective function optimizes the encoder and decoder to have a good reconstruction, the second term is used to update the codebook, and the third term is used to prevent the encoder’s output from diverting the latent space too far. The detail architecture of our VQ-VAE is shown in Extended Data Figure 7. We reordered the codebook based on the value of the first principle component and change the latent code accordingly as we found the reordered codebook can provide more semantic concept of the original input image (Extended Data Figure 8).
Feature Extraction, Index Generation and Index Encoding. We show how each mosaic can be represent by a tuple composed of mosaic index and mosaic texture feature . To get , we encode and re-map the latent code by the encoder and reorderd codeook in VQ-VAE. The index is determined by following equations.
We insert each into vEB tree for fast search. Note that the time complexity of all operations in vEB tree are . Based on the property of vEB tree, can be determined by
where is the minimum integer that makes inequality hold. Since our codebook size ranges from 0 to 127, we can determine the maximum summation in each level. Solving the inequality, we can get the minimum of to satisfy the inequality is . Because is a constant that only depends on the index generation pipeline, our search performance is . To get , we use DenseNet128 to extract feature from patch at 20 then follow the algorithm proposed in the paper to binarize it.
In addition to creating the tuple to represent the mosaic, we also make a hash table with as key and value as metadata of the mosaic. The metadata includes the texture feature , the slide name associated with the mosaic, the coordinate of the slide where this mosaic is cropped, the slide file format, and the diagnosis of the slide. Note that different mosaics could share the same key. In this case, the value is a list that stores all metadata.
Guided Search Algorithm. Given the query slide represented as with mosaics where each tuple is composed of the index of mosaic and its texture feature , we apply Guided-Search to each tuple and return the corresponding results . The output takes the form of . Each is a set of tuple consists of the indices of the similar mosaics in the database and the associated information . Each element contains the hamming distance between and the query mosaic along with metadata of , including the diagnosis and site of slide where from, the position where is cropped, and the slide file format.
The drawback to use only for query is that the current mosaic index is sensitive to the minor change in . For example, a mosaic that differs from another by incur difference, which put two mosaics far away to be searched by the vEB tree. To address this issue, we create a set of candidates indices and along with the original by adding and subtracting an integer for times from . We call helper functions Forward-Search and Backward-Search to search the neighbor indices in and respectively. Both functions are only include the neighbor indices whose hamming distance from the query is smaller than a threshold . The details of algorithms are shown from Algorithm 1-3.
Results Ranking Algorithm. Our ranking function Ranking (Algorithm 4) takes the results from Guided-Search as the input. The output is the top 5 similar slides given the query slide . The intuition of Ranking is to find the most promising mosaics in based on the uncertainty. It relies on three helper functions, which are Uncertainty-Cal (Algorithm 5), Clean (Algorithm 6) and Filtered-By-Prediction (Algorithm 7).
Uncertainty-Cal (Algorithm 5) takes as the input and calculates the uncertainty for each by entropy. The lower the entropy, the less uncertain the mosaic and vice versa. The output is the entropy of along with records that summarize the diagnosis occurrences and hamming distance of each element in . The disadvantage of counting the occurrences naively in the entropy calculation is that the most frequent diagnosis in the anatomic site dominates the result and therefore downplays the importance of others. We introduce a weighted occurrence approach to address this issue. The approach counts the diagnosis occurrences by considering the percentage of the diagnosis in the given site and the diagnosis position in the retrieval results. It calculates the weight of each diagnosis in the anatomic sites by the reciprocal of the number of diagnosis. We normalize the weights such that summation of them equals a constant . A diagnosis’s final occurrence in is the multiplication of the normalized weight of diagnosis and the inverse of position where the diagnosis appears in . Therefore, the same diagnosis can have different weighted occurrence because of its position in . By doing so, the less frequent diagnosis and the one with lower hamming distance (i.e., diagnosis close to the front of the retrieval results) gain more importance in the ranking process. After this stage, we also summarized by three metadata , , and to facilitate the subsequent processes, which are defined as
: A nested hash table that stores the index of in as the key and its weighted diagnosis occurrences table as value.
: An array that stores tuples composed of the index , the entropy, the hamming distance of all mosaics, and the total number of mosaics for each in .
: An array that stores the total number of mosaics for each in .
Clean (Algorithm 6) aims to remove the outlier and the mosaics that are less similar to the query one in . It takes summaries of mosaic and from the previous stage as input, removing whose result length is less than or greater than quantile. Besides, we take the average of the mean of hamming distance in top 5 mosaics for each as a threshold denoted by , using it to filter out whose mean of hamming distance in top 5 retrieval is greater than . After cleaning the results, we sort it based on uncertainty calculated from Uncertainty-Cal in ascending order.
We can return the slide from in the front of the sorted based on the uncertainty However, the drawback is that the low uncertainty of first several could be caused by the domination of the most frequent diagnosis in the given anatomic site. For example, the most frequent occurrences of the top 5 entries in could be KIRC, BLCA, KIPR, KIRP, and KIRP in the urinary site. In this case, the query slide should be better diagnosed as KIRP based on the majority vote. Therefore, the first and second entries that dominate the urinary site cases should not be considered during retrieval. We leverage the Filtered-By-Prediction (Algorithm 7) to mitigate the issue. The function takes the summation of the diagnosis occurrences from the top 5 certain in . It first uses the diagnosis with the maximum score as a pseudo ground truth diagnosis from top 5 certain . Afterwards, it removes whose maximum occurrence diagnosis disagrees with the pesudo ground truth.
To return final results of slide query , we take slide name and its diagnosis in pointed by one by one. If the uncertainty of is zero, we take all . On the contrary, we use again to ignore whose hamming distance is greater than the threshold. We sort first by uncertainty in the ascending order then by the hamming distance in the descending order if the uncertainty is tie.
Training details of VQ-VAE We used a sampled version of TCGA slide data in the first experiment (i.e., disease subtype retrieval in TCGA) to train our VQ-VAE. For each slide, we sample 10 1024x1024 patches at 20. The model was trained by Adam optimizer with a learning rate of without weight decay and AMSgrad. We used default setting for other hyperparamters in Adam (i.e., and in VQ-VAE was also set to 1.
Ablation Study We conducted ablation study on our ranking module to test the benefit of each function. Specifically, we compared the performance of following four settings: (1) Naive: removing Clean and Filtered-By-Prediction and treating each diagnosis occurrence in the mosaic retrieval result equally (i.e., replacing the assignment in line 4 in Algorithm 5 with 1.) (2)Weighed count: applying Uncertainty-Cal to the ranking module only. (3)Clean: applying Uncertainty-Cal and Clean to the ranking module. (4)Filter: applying all functions to the ranking module.
Visualization We build confusion matrix for each site, using each slide diagnosis as ground truth along the x-axis and as predicted diagnosis along the y-axis. For the hamming distance matrix, we inspect the hamming distance between the query slide and each result in one by one, adding the hamming distance to the associated diagnosis label and infinity to others. The infinity here is defined as hamming distance threshold plus 1 as is the maximum distance we can have in our pipeline. The final hamming distance matrix is obtained by dividing the total number of slides in the given anatomic site.
For all experiments, we remove slide with the same patient id as the query slide in the database (i.e., leave-one-patient-out evaluation). We use the mean majority vote () results in the top retrieval (mMV@k) instead of top-k accuracy over all instances in the data as this metric is more suitable to medical domain. We also use mean average precision at (mAP@k) to further evaluate the retrieval performance. Specifically,
where is the number of slide, is the ground truth diagnosis of slide , and the is the predicted diagnosis of taken from the majority vote of top retrieval . is another indicator function that output if the two inputs are the same and otherwise. We use to denote the the number of times when predicted diagnosis matches the ground truth among the top retrievals of . Note that mAP@k is more lenient metric compared to mMV@k as a model can get mAP@k by put only one relevant slide at the first place among top k retrievals while mMV@k score is still zero in this case. Therefore, higher mMV@k is more important in our application but we still report mAP@k to quantify model’s capability to give the relevant slide higher rank. To fairly compare with the best results in the paper. We set for all our experiments except anatomic site retrieval, where is set to 10. In few cases, it is possible that the number of retrieval results is less than . We consider a query is correct if the number of correct retrieval is greater than .
Computational Hardware and Software
We stored all whole slide images (WSIs), patches, segmentation, mosaics across multiple disk with total size around 27T. Segmentation, patching mosaic extraction and search of WSIs were performed on CPU (AMD Ryzen Threadripper 3970X 32-Core Processor). The VQ-VAE pretraining and feature extraction were performed on 4 NVIDIA 2080Ti GPU. The whole FISH pipeline was written in Python (version 3.7.0) with the following external package: h5py (2.10.0), matplotlib(3.3.0), numpy (1.19.1), opencv-python (184.108.40.206), pillow (7.2.0), pandas (1.1.0), scikit-learn (0.23.1), seaborn (0.10), scikit-image (0.17.2), torchvision (0.6.0) tensorboard (2.3.0) and tqdm (4.48.0). We used Pytorch (1.5.0) for deep learning. All plots were created by matplotlib (version 3.2.2) and seaborn (version 0.10.1). The internal function in Google Slide was used to plot pie chart.
There are three dataset in our slide level retrieval experiment which are the diagnostic slide in The Cancer Genome Atlas (TCGA), Clinical Proteomic Tumor Analysis Consortium (CPTAC) and BWH in-house data.
TCGA diagnostic slide. We downloaded all diagnostic slide from TCGA website. To fairly compare with Yottixel, we used slides from the same 13 anatomic sites for anatomic site retrieval and the same 29 diagnosis for disease subbtype retrieval. The detail slide and patient number are reported in Extended Table 1.
CPTAC diagnostic slide We downloaded the tumor tissue slide of from the official website. There are 503 CPTAC-CCRC slides from 216 patient, 544 CPTAC-UCEC slides from 240 patients, 679 CPTAC-LUSC slides from 210 patients, 669 CPTAC-LUAD slides from 224 patients and CPTAC-SKCM 283 slides from 93 patients. All slides are at 20.
BWH in house dataset In this cohort, each whole slide image is from different patient. For the prostate data used in patch-level retrieval, we collected 23 slides at 20 and annotated regions in each slide by GP3, GP4, GP5, and normal. The detail slide and patient number are reported in Extended Table 2.
Segmentation. We used the automatic segmentation tool in CLAM to generate the segmentation mask for each slide. The tool first applies binary threshold to a downsampled whole slide image on the HSV color space to generate a binary mask then refines the mask by median blurring and morphological closing to remove the artifacts. After getting the approximate contours of the tissue, the tool filters out the tissue contours and cavities based on certain area threshold.
Patching. After segmentation, we cropped the contours into patches without overlapping at 20. For 40 whole slide, we first cropped it into patches then downsampled them into to get the equivalent patches at 20.
Mosaic generation. We followed the the mosaic generation process proposed in the paper. The algorithm first apply K-mean clustering on the RGB features extracted from each patches with number of cluster . Within each cluster, we run K-mean clustering again on the coordinate of each patch by setting the number of cluster equals to
of the cluster size. If the number of cluster is less than 1 in the second stage, we took all coordinates within that cluster. Except the number of clusters, we used all default settings in Scikit-learn for K-mean clustering. To get better quality of mosaics, we collected 101 patches for both debris/pen smudges and tissue to train a logistic regression based on local binary pattern histogram feature to remove the unmeaningful regions. We used default setting from Scikit-learn package in logistic regression and used rotate invariant binary pattern from Scikit-image package withand . The bin number of histogram was set to 128.
The TCGA diagnostic whole slide image are available from TCGA website and CPTAC data is available from the NIH cancer imaging archive. The kather100k data are available from link provided in the paper. Reasonable requests for in-house BWH whole slide and prostate data may be addressed to the corresponding author.
Code Availability We implemented all our methods in Python and using Pytorch as the primary package for training VQ-VAE. All scripts, checkpoints, preprocessed mosaics, and pre-built database to reproduce our experiment in the paper are available at https://github.com/mahmoodlab/FISH. All source code is licensed under GNU GPv3.
C.C. and F.M. conceived the study and designed the experiments. C.C. performed the experiments. C.C. M.Y.L. D.F.K.W T.C. A.J.S. and F.M. analyzed the results. D.W. conducted the reader study. All authors wrote and approved the final paper.
This work was supported in part by internal funds from BWH Pathology, Google Cloud Research Grant and Nvidia GPU Grant Program and NIGMS R35GM138216 (F.M.). The content is solely the responsibility of the authors and does not reflect the official views of the National Institutes of Health, or the National Institute of General Medical Sciences.
The authors declare that they have no competing financial interests.
The study was approved by the Mass General Brigham (MGB) IRB office under protocol 2020P000233.
Dermatologist-level classification of skin cancer with deep neural networks.Nature 542, 115–118 (2017).
A survey on active learning and human-in-the-loop deep learning for medical image analysis.Medical Image Analysis 102062 (2021).