The advances in digital image processing and machine learning for digital pathology are showing practical results. The advantage of such techniques is the ability to assist pathologists for higher accuracy and efficiency. Such algorithms lead to more reliable diagnosis by presenting computer-based second opinions to the clinician. Digital Pathology (DP) uses Whole Slide Imaging (WSI) as a base for diagnosis. Unlike the traditional pathology workflow in which the tissue samples are inspected under a microscope and stored in physical archives, WSI enables the digitization of glass slides to very high-resolution digital images (slides/scans). The introduction of such technologies has led to the development of countless methods combining machine learning and image processing to support the diagnostic workflow which is labour-intensive, time costly, and subject to human errors . The digitization of the biopsy samples has simplified parts of the analysis, however, it has also introduced several challenges. There are only a few public digital datasets available for machine-learning purposes . In addition, the existing datasets are generally unlabeled because of the tedious and costly nature of the manual delineation of regions of interest in digital images. Moreover, DP methods suffer from the image imperfections caused by the presence of artifacts and the absence of accurate methods for tissue (foreground) extraction 
. Content-based image retrieval (CBIR) is considered as a practical solution for processing unlabeled data. Retrieving similar cases from pathology archives alongside their treatment records may help pathologists to write their reports much more confidently. Finally, the requirements of WSI for memory usage and computational power is problematic for IT infrastructures of hospitals and clinics. Therefore, it is desired to have solutions that make the image processing more memory efficient and computationally less expensive. This paper addresses the reduction of data dimensionality by clustering images with in order to provide a compact representation of the scans for algorithmic processing. Our techniques are developed under the constraint of working with unlabeled data, a constrained that is motivated by the reality of the clinical workflow.
2 Related Works
Tissue examination under a microscope reveals important information to render accurate diagnosis and thus, provide effective treatment for different diseases . DP offers several opportunities and also presents challenges to the image processing architectures . Presently, only a small fraction of glass slides are digitized , but even if WSI was more widely available, there are a number of technical issues that would need to be addressed for their effective usage. One of the main challenges is data management and storage . Most importantly, the large dimensions of the WSI files require a large amount of memory and a expensive computational power.
Content-Based Image Retrieval (CBIR) is an approach to find images with similar visual content to a query image by searching a large archive of images. This is helpful in medical imaging and DP databases where text annotations alone might be insufficient to precisely describe an image [25, 29]. In order to retrieve similar images, a proper feature representation is needed . In CBIR, accuracy and fast search for similar images from large datasets are important. Therefore, various techniques for dimensionality reduction of features are used to speed up CBIR systems 
. Some of these techniques include principal component analysis (PCA), compact bilinear pooling and fast approximate nearest neighbor search . Image subsetting methods [1, 2]
have been used to choose a small region of the whole slide images for computational analysis while reducing the size of the image for a better tissue representation. Other image subsetting algorithms use sparsity models for multi-channel representation and classification, and expectation maximization by logistic regression[26, 11]. Generally, the magnification is commonly used for many diagnostic tasks , . As well, dividing the whole slides into small patches (or tiles) of 256256 to 10001000 pixels is a common strategy to overcome the large dimensionality of the WSI data [23, 21]. This approach results in thousands of patches that should be analyzed individually. Low-resolution approaches are considerably faster, however, they may loose the local morphology. One possible solution is regional averaging where a region is not considered region of interest (ROI) unless it extends over multiples patches. On the other hand, this can cause missing small ROIs such as small or isolated tumors. Another solution would be to analyze the complete image on low resolution and then refine this result on high-resolution patches by using a registration on each patch 
. In this manner, the local morphology is taken into account. However, one major downside is the significantly longer run time. In this work, We propose unsupervised learning using handcrafted and deep features, followed by patch selection through Gaussian mixture models, to provide a more compact representation of the digital slides for image indexing and search purposes.
3 Materials and Methods
Dataset and Data Preparation – We used the KimiaPath24 dataset to evaluate our experiments. This dataset contain 24 WSIs. The slides show diverse organs and tissue types with different texture patterns . The glass slides were captured by a digital scanner in bright field using a 0.75 NA lens 111TissueScope LE scanner by Huron Digital Pathology. The dataset contains 1325 test images (patches) of size 1000 1000 pixels (0.5mm 0.5mm) from all 24 cases. Fig. 1 shows some example patches (the dataset can be downloaded online222http://kimia.uwaterloo.ca/kimia_lab_data_Path24.html).
All training and test patches are down-sampled from pixels to
pixels in order to be more easily processed by for the feature extraction. We patched WSIs without overlap and then we removed all patches with high background homogeneity (more than) . As a result, we created 27,055 training patches from 24 WSIs. The presented dataset comprising of diverse body parts may be suitable for intra-class search operations such as metastasis and floater detection.
Methodology – Fig. 3 illustrates our approach. We divided the whole slide image into many patches, extract features, cluster the patches and then selected a subset of patches to represent the scan. We applied image search to verify the accuracy loss as a consequence of data reduction. We have performed search by extracting same features from the test set and then compared them against features from all training cases by calculating the Euclidean distance as a measure of (dis)similarity. The most similar patch is considered to be the output of the CBIR system.
Convolutional Neural Networks (CNN) can learn general features that are not specific to the dataset or task . The deeper layers are more specific to the task of the network. Many results indicate that extracted features from CNNs (deep features) are highly discriminative . These features are extracted from different layers of the CNN depending on the degree of specificity of the feature . Usually, the features are extracted from the last layer before the classification layer which allows getting the most specific and high abstraction features that can be used for another task (dimensionality reduction, unsupervised learning, etc.). In this work, we used all 4096 outputs of the last layer before the classification layer in the VGG16 network, a pre-trained network model with 16 layers [10, 9]. The second feature extraction method is a handcrafted method that uses the LBP algorithm (local binary patterns) [19, 28]. The obtained feature vectors are histograms of uniform and rotation-invariant patterns. The LBP vector is a concatenation of two vectors. The first one has a radius parameter of 3 pixels and 24 pixels to consider resulting in 26 bins. The second one has a radius parameter of 1 pixel and 8 pixels to consider set to 8 resulting in 10 bins. The concatenated histogram will be of 36 dimensions (bins) for each patch.
All patches of a scan are represented by two different sets of features for comparison, namely deep features and LBP histograms. We then train SOM to cluster each patch. We do not know how many clusters each scan may contain. Hence, each scan is split into a given number of clusters found by the SOM algorithm. The range of number of clusters found by SOM was between 10 and 20. Parameter tuning is performed to shed light on variance and map size (see Fig.2). We used GMMs  for patch selection. The number of representatives for each cluster is investigated from range to of the total number of the patches. It is important to point out that the deep features have a large feature vector (more than 4000 elements). Therefore, we used PCA to reduce the dimensionality of deep features. We kept 95 of the variance for each vector which yielded a new feature vector of 1078 elements. We also experimented with random patch selection which provided slightly worse results compared to GMMs.
While trying to minimize the number of clusters and maximize the variance ratio, we observed that these two parameters are positively correlated. Fig. 2 shows the variation of these two parameters versus the map size. We can see that with an increasing ratio, the number of clusters is reaching large numbers. Based on empirical knowledge, a desirable number of clusters would be less than 30. We can see that 20 may be regarded as a suitable value for the map size as it provides a good compromise between the number of clusters and the inter-/intra-variance ratio. It is important to point out that changing the SOM’s learning rate did not have much impact. With these values, we are able to cluster 18 clusters per scan on average. Some of the clusters contain few patches (less than 1 of the total number of patches). We merged such clusters with the closest cluster (using Euclidean distance) resulting in a smaller number of clusters. In case of important clusters removed by merging, GMM may still select those patches. However, one must keep in mind that main purpose of the CBIR systems is generally recognizing dominant tissue patterns and not detecting minute cellular details. The latter is a subject for detection and segmentation algorithms.
LBP descriptor outperformed deep features 3 times out of 4. Other (deeper) networks may perform better, however, they also require more resources. The LBP histogram has 36 bins while the VGG16 feature vector length is 1078 (after application of PCA). Fig. 4 shows two examples for sample patches that SOM groups together using deep features. Fig. 5 illustrates two sets of patches selected by GMMs from SOM clusters. The accuracy calculation aims to compare the performance of the proposed method for image retrieval using LBP and deep features separately by comparing it to the accuracy obtained using the training data set (27,055 patches). We have used the KimiaPath24 guidelines to calculate the patch-to-scan accuracy , whole-scan accuracy and the total accuracy . LBP’s performance improved with the increase of selected data whereas for VGG16 features the performance only improved with the increase of selected data in the case of random selection. Indeed, with GMM selection, performance decreased with more data. This might be due to the length of VGG feature vectors. Fig. 6 gives a general overview of the performances evaluation while Table 1 reports all accuracy measurements (_r and _g are indicate random selection and GMM selection, respectively).
|GMM Selection||Features||Feature Length|
Performance of both LBP and deep features generally drops as a result of patch selection, a fact that can be considered during the algorithm design. However, the run time and memory requirements can be considerably reduced which can be an advantage in dealing with large WSI archives. For CBIR systems in histopathology, retrieval of similar images is a major challenge because of the enormous size of the archives. The results of our experiments showed that for the algorithmic purposes such as image search the size of the image indexing (i.e., feature calculation) can be drastically reduced while keeping the relevant information and characteristics of each scan. Keeping 50% of the patches and using LBP descriptor and GMM selection reduces the index size and, expectedly, the computational requirements by 50% and reaches a CBIR accuracy of 65% (for the first match) only 4% less than feature extraction for the entire data.
-  Adiga, U., Malladi, R., Fernandez-Gonzalez, R., de Solorzano, C.O.: High-throughput analysis of multispectral images of breast cancer tissue. IEEE Transactions on Image Processing 15(8), 2259–2268 (Aug 2006)
-  Aiad, H.A., Abdou, A.G., Bashandy, M.A., Said, A.N., Ezz-Elarab, S.S., Zahran, A.A.: Computerized nuclear morphometry in the diagnosis of thyroid lesions with predominant follicular pattern. Ecancermedicalscience 3, 146 (Sep 2009), can-3-146[PII]
-  Ali Sharif Razavian, Hossein Azizpour, J.S.S.C.: Cnn features off-the-shelf: an astounding baseline for recognition (2014)
-  AlZubaidi, A.K., Sideseq, F.B., Faeq, A., Basil, M.: Computer aided diagnosis in digital pathology application: Review and perspective approach in lung cancer classification. In: New Trends in Information & Communications Technology Applications (NTICT), 2017 Annual Conference on. pp. 219–224. IEEE (2017)
-  Babaie, M., Kalra, S., Sriram, A., Mitcheltree, C., Zhu, S., Khatami, A., Rahnamayan, S., Tizhoosh, H.R.: Classification and retrieval of digital pathology scans: A new dataset. In: CVMI Workshop@ CVPR (2017)
-  Babaie, M., Tizhoosh, H.R., Zhu, S., Shiri, M.: Retrieving similar x-ray images from big image data using radon barcodes with single projections. arXiv preprint arXiv:1701.00449 (2017)
-  Chan, S.H., Zickler, T.E., Lu, Y.M.: Demystifying symmetric smoothing filters. CoRR abs/1601.00088 (2016)
-  Cooper, L.A.D., Carter, A.B., Farris, A.B., Wang, F., Kong, J., Gutman, D.A., Widener, P., Pan, T.C., Cholleti, S.R., Sharma, A., Kurc, T.M., Brat, D.J., Saltz, J.H.: Digital pathology: Data-intensive frontier in medical imaging. Proceedings of the IEEE 100(4), 991–1003 (April 2012)
-  Garcia-Gasulla, D., Parés, F., Vilalta, A., Moreno, J., Ayguadé, E., Labarta, J., Cortés, U., Suzumura, T.: On the behavior of convolutional nets for feature extraction 61 (03 2017)
-  Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
Hou, L., Samaras, D., Kurc, T.M., Gao, Y., Davis, J.E., Saltz, J.H.: Patch-based convolutional neural network for whole slide tissue image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2424–2433 (2016)
-  Khatami, A., Babaie, M., Khosravi, A., Tizhoosh, H.R., Nahavandi, S.: Parallel deep solutions for image retrieval from imbalanced medical imaging archives. Applied Soft Computing 63, 197–205 (2018)
-  LeCun, Y., Haffner, P., Bottou, L., Bengio, Y.: Object recognition with gradient-based learning. In: Shape, Contour and Grouping in Computer Vision. pp. 319–. Springer-Verlag, London, UK, UK (1999)
-  Lotz, J., Olesch, J., Müller, B., Polzin, T., Galuschka, P., Lotz, J.M., Heldmann, S., Laue, H., González-Vallinas, M., Warth, A., Lahrmann, B., Grabe, N., Sedlaczek, O., Breuhahn, K., Modersitzki, J.: Patch-based nonlinear image registration for gigapixel whole slide images. IEEE Trans. Biomed. Eng. 63(9), 1812–1819 (Sept 2016)
-  Madabhushi, A., Lee, G.: Image analysis and machine learning in digital pathology: Challenges and opportunities 33 (07 2016)
-  Marshall, B.: A brief history of the discovery of helicobacter pylori pp. 3–15 (2016)
Moriya, T., Roth, H.R., Nakamura, S., Oda, H., Nagara, K., Oda, M., Mori, K.: Unsupervised pathology image segmentation using representation learning with spherical k-means. In: Medical Imaging 2018: Digital Pathology. vol. 10581, p. 1058111. International Society for Optics and Photonics (2018)
-  Nikolas Stathonikos, Mitko Veta, A.H., van Diest., P.J.: Going fully digital: Perspective of a dutch academic pathology lab. Journal of Pathology Informatics. 4(15) (2013)
-  Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29(1), 51 – 59 (1996)
-  Onega, T., Weaver, D., Geller, B., Oster, N., N A Tosteson, A., Carney, P., Nelson, H., H Allison, K., O’Malley, F., J Schnitt, S., Elmore, J.: Digitized whole slides for breast pathology interpretation: Current practices and perceptions 27 (03 2014)
-  Pitiot, A., Bardinet, E., Thompson, P., Malandain, G.: Piecewise affine registration of biological images for volume reconstruction 10, 465–83 (07 2006)
-  Robboy, S.J., Altshuler, B.S., Chen, H.Y.: Retrieval in a computer-assisted pathology encoding and reporting system (caper). American journal of clinical pathology 75(5), 654–661 (2016)
-  Roberts, N., Magee, D., Song, Y., Brabazon, K., Shires, M., Crellin, D., Orsi, N.M., Quirke, R., Quirke, P., Treanor, D.: Toward routine use of 3d histopathology as a research tool. Am J Pathol 180(5), 1835–1842 (May 2012)
-  Shaimaa Al-Janabi, et al.: Whole slide images for primary diagnostics of urinary system pathology: a feasibility study. Journal of Pathology Informatics. 3(4), 91–96 (12 2014)
-  Sridhar, A., Doyle, S., Madabhushi, A.: Content-based image retrieval of digitized histopathology in boosted spectrally embedded spaces. Journal of Pathology Informatics 6(1), 41 (2015)
-  Srinivas, U., Mousavi, H.S., Monga, V., Hattel, A., Jayarao, B.: Simultaneous sparsity model for histopathological image representation and classification. IEEE transactions on medical imaging 33(5), 1163–1179 (2014)
-  Tizhoosh, H., Babaie, M.: Representing medical images with encoded local projections. IEEE Trans. Biomed. Eng. (2018)
-  Topi, M., Timo, O., Matti, P., Maricor, S.: Robust texture classification by subsets of local binary patterns. In: 15th International Conference on Pattern Recognition. ICPR. vol. 3, pp. 935–938 vol.3 (Sept 2000)
-  Yang, L., Jin, R., Mummert, L., Sukthankar, R., Goode, A., Zheng, B., Hoi, S., Satyanarayanan, M.: A boosting framework for visuality-preserving distance metric learning and its application to medical image retrieval 32, 30–44 (01 2010)
-  Zhang, X., Liu, W., Dundar, M., Badve, S., Zhang, S.: Towards large-scale histopathological image analysis: Hashing-based image retrieval. IEEE Transactions on Medical Imaging 34(2), 496–506 (Feb 2015)