DeepMCAT: Large-Scale Deep Clustering for Medical Image Categorization

by   Turkay Kart, et al.
Imperial College London

In recent years, the research landscape of machine learning in medical imaging has changed drastically from supervised to semi-, weakly- or unsupervised methods. This is mainly due to the fact that ground-truth labels are time-consuming and expensive to obtain manually. Generating labels from patient metadata might be feasible but it suffers from user-originated errors which introduce biases. In this work, we propose an unsupervised approach for automatically clustering and categorizing large-scale medical image datasets, with a focus on cardiac MR images, and without using any labels. We investigated the end-to-end training using both class-balanced and imbalanced large-scale datasets. Our method was able to create clusters with high purity and achieved over 0.99 cluster purity on these datasets. The results demonstrate the potential of the proposed method for categorizing unstructured large medical databases, such as organizing clinical PACS systems in hospitals.



There are no comments yet.


page 11


Unsupervised Joint Mining of Deep Features and Image Labels for Large-scale Radiology Image Categorization and Scene Recognition

The recent rapid and tremendous success of deep convolutional neural net...

Unsupervised 3D End-to-End Medical Image Registration with Volume Tweening Network

3D medical image registration is of great clinical importance. However, ...

Search Result Clustering in Collaborative Sound Collections

The large size of nowadays' online multimedia databases makes retrieving...

Curating Subject ID Labels using Keypoint Signatures

Subject ID labels are unique, anonymized codes that can be used to group...

Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation

Despite tremendous progress in computer vision, there has not been an at...

Deep Variational Clustering Framework for Self-labeling of Large-scale Medical Images

We propose a Deep Variational Clustering (DVC) framework for unsupervise...

Exploring large scale public medical image datasets

Rationale and Objectives: Medical artificial intelligence systems are de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Highly curated labelled datasets have recently been emerging to train deep learning models for specific tasks in medical imaging. Thanks to these fully-annotated images, supervised training of convolutional neural networks (CNNs), either from scratch or by fine-tuning, has become a

dominant approach for automated biomedical image analysis. However, the data curation process is often manual and labor-intensive as well as requiring expert domain knowledge. This time-consuming procedure is simply not practical for each single task in medical imaging, and therefore, automation is a necessity.

The first step of data curation in medical imaging typically starts from data cleaning where desired images are extracted from a hospital image database such as a PACS system. Due to the nature of such image databases in hospitals, these systems often record important attributes such as image sequences in an unstructured fashion as meta-data in the DICOM header of the images. Meta-data in the DICOM standard, the most widely adapted format for data storage in medical imaging, may seem as a reliable option for automated annotation but it is often incorrect, incomplete and inconsistent. This represents a major challenge for data curation. Gueld et. al. [8] analyzed the quality of the DICOM tag Body Part Examined in 4 imaging modalities at Aachen University Hospital and found that, in 15% of the cases, the wrong information had been entered for the tag because of the user-originated errors. Misra et. al. [11] reported that labelling with the user-defined meta-data containing inconsistent vocabulary may introduce human-reporting bias in datasets, which degrades the performance of deep learning models. Categorization can be even more difficult for images stored in other formats, e.g. NIfTI in neuroimaging, where meta-data is limited and/or simply not available for image categorization.

To categorize medical images in a realistic scenario, designing fully supervised methods would require a prior knowledge about the data distribution of the entire database, accounting for long-tailed rare classes and finally devoting significant effort to accurately and consistently obtaining manual ground-truth. In this work, we propose a different paradigm by efficiently using abundant unlabelled data and perform unsupervised learning. Specifically, we demonstrate that large-scale datasets of cardiac magnetic resonance (CMR) images can be categorized with a generalizable clustering approach that uses basic deep neural network architectures. Our intuition is that categorization of unknown medical images can be achieved if clusters with high purity are generated from learned image features without any supervision. Our approach builds on a recent state-of-the-art method, DeepCluster


Our main contributions are the following: (i) we show that pure clusters for CMR images can be obtained with a deep clustering approach; (ii) we investigate end-to-end training of the approach for both class-balanced dataset and highly imbalanced data distributions, the latter being particularly relevant for medical imaging applications where diseases and abnormal cases can be rare; (iii) we discuss the design considerations and evaluation procedures to adapt deep clustering for medical image categorization. To the best of our knowledge, this is the first study to perform simultaneous representation learning and clustering for cardiac MR sequence/view categorization and evaluating its performance on a large-scale imbalanced dataset (n = 192,272 images).

2 Related Work

A number of self-supervised and unsupervised methodologies have been explored to train machine learning models with abundant unlabelled data. In self-supervised learning (SSL), a pretext task is defined to train a model without ground-truth. While several studies have been explored in the context of self-supervision

[7, 3], domain expertise is typically needed to formulate a pretext task unlike our work. Similar to self-supervised learning, different strategies of unsupervised learning have been implemented with generative networks [6] and deep clustering [20] to learn visual features. In this study, we focus on unsupervised deep clustering approaches at large scale. Although this has been investigated in a number of studies for natural images [5, 4], various attempts in medical imaging have explored them with only limited amount of curated data in contrast to our methodology.

Moriya et. al. [12] extended the JULE framework [20] for simultaneously learning image features and cluster assignments on 3D patches of micro-computed tomography (micro-CT) images with a recurrent process. Perkonigg et. al. [15]

utilized a deep convolutional autoencoder with clustering whose loss function is a sum of reconstruction loss and clustering loss to predict marker patterns of image patches. Ahn et. al.


implemented an ensemble method of deep clustering methods based on K-means clustering. Pathan et. al.

[14] showed clustering can be improved iteratively with joint training for segmentation of dermoscopic images. Maicas et. al. [10] combined deep clustering with meta training for breast screening.

One related approach to our study is the ”Looped Deep Pseudo-task Optimization” (LDPO) framework proposed by Wang et. al. [19]. LDPO extracts image features with joint alternating optimization and refine clusters. It requires a pre-trained model (trained on medical or natural images) at the beginning to extract features from radiological images and then fine-tunes the model paramaters by joint learning. Therefore, the LDPO framework starts with a priori information and strong initial signal about input images. On the contrary, our model is completely unsupervised and trained from scratch with no additional processing. In addition, we do not utilize any stopping criteria, which is another difference from LPDO [19].

3 Method

Our method builds upon the framework of DeepCluster [4]. The idea behind their approach is that a CNN with random parameters

provides a weak signal about image features to train a fully-connected classifier reaching an accuracy (12%) higher than the chance (0.1%)

[13]. DeepCluster [4] combines CNN architectures and clustering approaches, and it proposes a joint learning procedure. The joint training alternates between extracting image features by the CNN and generating pseudo-labels by clustering the learned features. It optimizes the following objective function for a training set :


Here denotes a classifier parametrized by ,

denotes the features extracted from image

, denotes the pseudo-label for this image and denotes the multinominal logistic loss [4]

. Pseudo-labels are updated with new cluster assignments at every epoch. To avoid trivial solutions where output of the CNN is always same, the images are uniformly sampled to balance the distribution of the pseudo-labels


In this study, we keep parts of DeepCluster [4]

such as VGG-16 with batch normalization

[18] as the deep neural architecture and K-means [9] as the clustering method, and then we adapt the rest for cardiac MR image categorization, illustrated in Fig. 1. To begin with, we add an adaptive average pooling layer between the VGG’s last feature layer and the classifier. In DeepCluster [4], PCA is performed for dimensionality reduction which results in 256 dimensions whereas we preserve the original features. These features are -normalized before clustering. DeepCluster [4]

feeds Sobel-filtered images to the CNN instead of raw images. In contrast, our method uses raw cardiac MR images in our experiments. We utilize heavy data augmentations including random rotation, resizing and cropping with random scale/aspect ratio for both training and clustering. Lastly, we normalize our images with z-scoring independently instead of using global mean and standard deviation.

Figure 1: Entire processing pipeline of our method based on DeepCluster [4]

We utilize the UK Biobank cardiac MR dataset which is open to researchers and contains tens of thousands of subjects. The whole dataset contains 13 image sequences/views, including short-axis (SA) cine, long-axis (LA) cine (2/3/4 chamber views), flow, SHMOLLI, etc [16]. These images are in 2D, 2D + time or 3D + time. UK Biobank employs a consistent naming convention for different cardiac sequences and view-planes. We generated ground-truth labels using this naming convention and classified images into 13 categories [2]. To investigate the effect of class distribution on our methodology as well as the training stability, we designed three experiment settings using subsets of the entire dataset: (i) a subset of 3 well-balanced classes (LA 2 / 3 / 4 chamber views), and (ii) the large dataset of and (iii) the smaller dataset of high class imbalance of 13 classes. In these datasets, 2D images at t=0 were saved in PNG format for faster loading and training. If the images are in 3D + time, every single slice in z direction at t=0 were saved. Total numbers were 47,637 images in the dataset (i), 192,272 images in the dataset (ii), 23,943 images in dataset (iii). Example images are illustrated in Fig. 4, and the class distributions are reported at the Table 2 in the supplementary material.

4 Results and Discussion

In our experiments, we followed a systematic analysis of the proposed methodology. We want to answer these four questions below:

  1. Is it feasible to categorize uncurated large-scale cardiac MR images based on their cluster assignments?

  2. How does the class balance affect deep clustering for medical images?

  3. How stable is training given there are no clear stopping criterion?

  4. How should we interpret the evaluation metrics?

4.0.1 Experiment settings:

For training, we set the total number of epochs as 200. Our optimizer was stochastic gradient descent (SGD) with momentum 0.9 and weight decay of 1e-5. Our batch size was 256 and initial learning rate was 0.05. In the literature, there is a large body of empirical evidence which indicates that over-segmentation improves the performance of a deep clustering method

[4]. Based on this evidence, we set the number of clusters to be 8 times of number of classes in the datasets, which corresponded to 24 for the dataset of 3 well-balanced classes, and 104 for the datasets with 13 classes.

4.0.2 Evaluation metrics:

We used normalized mutual information (NMI) [17] and cluster purity (CP) [17] to evaluate the clustering quality of our models.


Here is the mutual information between and and is the entropy. For our experiments, we calculate two NMI values: NMI against the previous cluster assignments () and NMI against ground-truth labels.


Here is the number of images, are the cluster assignments at epoch and is the ground-truth labels.

Accurate interpretation of our metrics, CP and NMI, is important. CP has a range from 0 to 1, which shows poor and perfect clusters, respectively. As the number of clusters increases, CP generally tends to increase until every image forms a single cluster, which achieves perfect clusters. In addition, we utilize NMI which signifies the mutual shared information between cluster assignments and labels. If clustering is irrespective of classes, i.e. random assignments, NMI has a value of 0. On the other hand, if we can form classes directly from cluster assignments, then NMI has a value of 1. The number of clusters also affects the NMI value but normalization enables the clustering comparison [17]. In our experiments, we did not employ any stopping criteria; thus, we always used the last model. In addition, during the training, we did not use NMI between cluster assignments and ground-truth, or cluster purity for validation.

# of
# of
# of
t vs t-1
t vs labels
(i) 47,637 3 24 0.675 0.519 0.997
(ii) 192,272 13 104 0.782 0.605 0.991
(iii) 23,943 13 104 0.745 0.609 0.994
Table 1: Performance of our method for different data configurations after 200 epochs

4.0.3 Discussion:

Metrics and loss progression throughout the training are given at Fig. 2. Results of our deep clustering method, which are calculated from features at the 200th epoch, are given at Table 1. Our method is able to reach a clustering purity above 0.99 for both class balanced and imbalanced datasets, which shows the feasibility of the deep clustering pipeline to categorize large-scale medical images without any supervision or labels. The class imbalance does not affect overall performance but balanced classes provide a more stable purity throughout the training. We also show that a relatively smaller dataset can be enough for efficient clustering with high cluster purity.

Figure 2: Training metrics of our method for different data configurations

Additionally, we want to extend the discussion about deep clustering at [4] to medical imaging in a realistic scenario. One major challenge in deep clustering is the lack of a stopping criterion. Supervised training with labelled data as a stopping criterion could be utilized but this usually requires the prior knowledge of classes, which may not be possible to have beforehand at an unstructured hospital database. Pre-defined threshold-based methods on evaluation metrics, e.g. NMI and purity from adjacent epochs [19], could be another option but their robustness has yet to be proven. This is why it is important to investigate whether the training diverges. For this aim, we trained the dataset (iii) with 1000 epochs to observe the training stability. As we can see from Fig 3, although we observed some fluctuations in metrics from time to time, they were stable throughout the training, which is similar to the observation at [4].

Figure 3: Training stability and metrics for 1000 epochs

Lastly, we observed that changes in NMI and CNN loss could indicate changes in clustering quality. Normally, we expect to see a steady increase in NMI and a steady decrease in CNN loss during the training. A sudden decrease in NMI and/or a sudden increase in CNN loss may be a sign of worse clusters generated. However, steady decrease in CNN loss does not necessarily mean better cluster purity. Therefore, we think that it is beneficial to closely monitor not one but all metrics for unusual changes as well as to consider other metrics of clustering.

5 Conclusion

In this work, we propose an unsupervised deep clustering approach with end-to-end training to automatically categorize large-scale medical images without using any labels. We have demonstrated that our method is able to generate highly pure clusters (above 0.99) under both balanced and imbalanced class distributions. In future work, expanding the evaluation, adapting deep clustering approaches to other clinical tasks and improving their robustness and generalizability are some of interesting avenues that could be explored.

6 Acknowledgement

This work is supported by the UK Research and Innovation London Medical Imaging and Artificial Intelligence Centre for Value Based Healthcare. This research has been conducted using the UK Biobank Resource under Application Number 12579.


  • [1] E. Ahn, A. Kumar, D. Feng, M. Fulham, and J. Kim (2019) Unsupervised feature learning with k-means and an ensemble of deep convolutional neural networks for medical image classification. arXiv preprint arXiv:1906.03359. Cited by: §2.
  • [2] W. Bai, M. Sinclair, G. Tarroni, O. Oktay, M. Rajchl, G. Vaillant, A. M. Lee, N. Aung, E. Lukaschuk, M. M. Sanghvi, F. Zemrak, K. Fung, J. M. Paiva, V. Carapella, Y. J. Kim, H. Suzuki, B. Kainz, P. M. Matthews, S. E. Petersen, S. K. Piechnik, S. Neubauer, B. Glocker, and D. Rueckert (2018) Automated cardiovascular magnetic resonance image analysis with fully convolutional networks. J Cardiovasc Magn Reson 20 (1), pp. 65. External Links: ISSN 1097-6647, Document Cited by: §3.
  • [3] W. Bai, C. Chen, G. Tarroni, J. Duan, F. Guitton, S. E. Petersen, Y. Guo, P. M. Matthews, and D. Rueckert (2019) Self-supervised learning for cardiac mr image segmentation by anatomical position prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 541–549. Cited by: §2.
  • [4] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 132–149. Cited by: §1, §2, Figure 1, §3, §3, §3, §4.0.1, §4.0.3.
  • [5] M. Caron, P. Bojanowski, J. Mairal, and A. Joulin (2019) Unsupervised pre-training of image features on non-curated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2959–2968. Cited by: §2.
  • [6] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §2.
  • [7] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §2.
  • [8] M. O. Gueld, M. Kohnen, D. Keysers, H. Schubert, B. B. Wein, J. Bredno, and T. M. Lehmann (2002) Quality of DICOM header information for image categorization. In Medical Imaging 2002: PACS and Integrated Medical Information Systems: Design and Evaluation, E. L. Siegel and H. K. Huang (Eds.), Vol. 4685, pp. 280 – 287. Cited by: §1.
  • [9] J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Cited by: §3.
  • [10] G. Maicas, C. Nguyen, F. Motlagh, J. C. Nascimento, and G. Carneiro (2020) Unsupervised task design to meta-train medical image classifiers. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1339–1342. Cited by: §2.
  • [11] I. Misra, C. L. Zitnick, M. Mitchell, and R. Girshick (2016) Seeing through the human reporting bias: visual classifiers from noisy human-centric labels. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 2930–2939. Cited by: §1.
  • [12] T. Moriya, H. R. Roth, S. Nakamura, H. Oda, K. Nagara, M. Oda, and K. Mori (2018) Unsupervised segmentation of 3d medical images based on clustering and deep representation learning. In Medical Imaging 2018: Biomedical Applications in Molecular, Structural, and Functional Imaging, Vol. 10578, pp. 1057820. Cited by: §2.
  • [13] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §3.
  • [14] S. Pathan and A. Tripathi (2020) Y-net: biomedical image segmentation and clustering. arXiv preprint arXiv:2004.05698. Cited by: §2.
  • [15] M. Perkonigg, D. Sobotka, A. Ba-Ssalamah, and G. Langs (2020) Unsupervised deep clustering for predictive texture pattern discovery in medical images. arXiv preprint arXiv:2002.03721. Cited by: §2.
  • [16] S. E. Petersen, P. M. Matthews, J. M. Francis, M. D. Robson, F. Zemrak, R. Boubertakh, A. A. Young, S. Hudson, P. Weale, S. Garratt, et al. (2015) UK biobank’s cardiovascular magnetic resonance protocol. Journal of cardiovascular magnetic resonance 18 (1), pp. 1–7. Cited by: §3.
  • [17] H. Schütze, C. D. Manning, and P. Raghavan (2008) Introduction to information retrieval. Vol. 39, Cambridge University Press Cambridge. Cited by: §4.0.2, §4.0.2.
  • [18] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
  • [19] X. Wang, L. Lu, H. Shin, L. Kim, M. Bagheri, I. Nogues, J. Yao, and R. M. Summers (2017)

    Unsupervised joint mining of deep features and image labels for large-scale radiology image categorization and scene recognition

    In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 998–1007. Cited by: §2, §4.0.3.
  • [20] J. Yang, D. Parikh, and D. Batra (2016) Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5147–5156. Cited by: §2, §2.

7 Supplementary Material

dataset (i) dataset (ii) dataset (iii)
AO 0 7859 982
FLOW 0 7782 971
FLOW MAG 0 7782 971
FLOW PHA 0 7782 971
LA (2 ch) 15868 7931 990
LA (3 ch) 15889 7943 992
LA (4 ch) 15880 7937 990
LVOT 0 7831 979
SA 0 83372 10339
SHMOLLI 0 7565 944
SHMOLLI T1MAP 0 7560 944
CINE TAG 0 23367 2926
Table 2: Class distributions for all datasets
Figure 4: Example cardiac images from the datasets