Machine Learning has revolutionized life science research, especially in Neuroimaging and Bioinformatics [gao2017, serra2018], such as by modeling interactions between whole brain genomics/imaging [park2012, medland2014] and identifying Alzheimer’s Disease (AD)-related proteins [zhao2019]
. Especially, Deep Learning can achieve accurate computer-assisted diagnosis when large-scale annotated training samples are available. In Medical Imaging, unfortunately, preparing such massive annotated datasets is often unfeasible[han2020AIAI, cheplygina2019]; to tackle this pervasive problem, researchers have proposed various data augmentation techniques, including Generative Adversarial Network (GAN)-based ones [goodfellow2014, FridAdar, Han2, Han3, han2019CIKM] ; alternatively, Rauschecker et al.
combined Convolutional Neural Networks (CNNs), feature engineering, and expert-knowledge Bayesian network to derive brain Magnetic Resonance Imaging (MRI) differential diagnoses that approach neuroradiologists’ accuracy for
diseases. However, even exploiting these techniques, supervised learning still requires many images with pathological features, even for rare disease, to make a reliable diagnosis; nevertheless, it can only detect already-learned specific pathologies. In this regard, as physicians notice previously unseen anomaly examples using prior information on healthy body structure, unsupervised anomaly detection methods leveraging only large-scale healthy images can discover and alert overlooked diseases when their generalization fails.
Towards this, researchers reconstructed a single medical image via GANs [Schlegl]
, AutoEncoders (AEs)[Uzunova], or combining them, since GANs can generate realistic images and AEs, especially Variational AEs (VAEs), can directly map data onto its latent representation [Chen]; then, unseen images were scored by comparing them with reconstructed ones to discriminate a pathological image distribution (i.e., outliers either in the learned feature space or from high reconstruction loss). However, those single image reconstruction methods mainly target diseases easy-to-detect from a single image even for non-expert human observers, such as glioblastoma on MR images [Chen] and lung cancer on Computed Tomography (CT) images [Uzunova]. Without considering continuity between multiple adjacent images, they cannot directly discriminate diseases composed of the accumulation of subtle anatomical anomalies, such as AD. Moreover, no study has shown so far how unsupervised anomaly detection is associated with either disease stages, various (i.e., more than types of) diseases, or multi-sequence MRI scans.
Therefore, this paper proposes unsupervised Medical Anomaly Detection GAN (MADGAN), a novel two-step method using GAN-based multiple adjacent brain MRI slice reconstruction to detect various diseases at various stages on multi-sequence structural MRI (Fig. 1): (Reconstruction) Wasserstein loss with Gradient Penalty (WGAN-GP) [Gulrajani, han2018] + 100 loss—trained on healthy brain axial MRI slices to reconstruct the next ones—reconstructs unseen healthy/abnormal scans; the loss generalizes well only for unseen images with a similar distribution to the training images while the WGAN-GP loss captures recognizable structure; (Diagnosis) Average loss per scan discriminates them, comparing the ground truth/reconstructed slices; the loss clearly discriminates the healthy/abnormal scans as squared error becomes huge for outliers. Using Receiver Operating Characteristics (ROCs) and their Area Under the Curves (AUCs), we evaluate the diagnosis performance of AD on T1-weighted (T1) MRI scans, and brain metastases/various diseases (e.g., small infarctions, aneurysms) on contrast-enhanced T1 (T1c) MRI scans. Using healthy T1 and healthy T1c scans for training, our Self-Attention (SA) MADGAN approach can detect AD at a very early stage, Mild Cognitive Impairment (MCI), with AUC , and AD at a late stage with AUC , while detecting brain metastases with AUC .
Our main contributions are as follows:
MRI Slice Reconstruction: This first multiple MRI slice reconstruction approach can reliably predict the next slices from the previous ones only for unseen images similar to training data by combining SAGAN and loss.
Unsupervised Anomaly Detection: This first unsupervised multi-stage anomaly detection reveals that, like physicians’ way of performing a diagnosis, massive healthy data can aid early diagnosis, such as of MCI, while also detecting late-stage disease much more accurately by discriminating with loss.
Various Disease Diagnosis: This first unsupervised various disease diagnosis can reliably detect the accumulation of subtle anatomical anomalies and hyper-intense enhancing lesions, such as AD and brain metastases on multi-sequence MRI scans.
Alzheimer’s disease diagnosis
Even though the clinical, social, and economic impact of early AD diagnosis is of paramount importance [arvanitakis2019diagnosis]—primarily associated with MCI detection [moscoso2019prediction]—it generally relies on subjective assessment by physicians (e.g., neurologists, geriatricians, and psychiatrists). Towards quantitative and reproducible approaches, many traditional supervised Machine Learning-based methods—which relies on handcrafted MRI-derived features—were proposed in the literature [salvatore2015, nanni2019]. In this context, diffusion-weighted MRI tractography enables reconstructing the brain’s physical connections that can be subsequently investigated by complex network-based techniques. Lella et al. [lella2020]
employed the whole brain structural communicability as a graph-based metric to describe the AD-relevant brain connectivity disruption. This approach achieved comparable performance with classic Machine Learning models—namely, Support Vector Machines, Random Forests, and Artificial Neural Networks—in terms of classification and feature importance analysis.
In the latest years, Deep Learning has achieved outstanding performance by exploiting more multiple levels of abstraction and descriptive embeddings in a hierarchy of increasingly complex features [lecun2015]: Liu et al. devised a semi-supervised CNN to significantly reduce the need for labeled training data [liu2014early]; for clinical decision-making tasks, Suk et al. integrated multiple sparse regression models (i.e., Deep Ensemble Sparse Regression Network) [suk2017]; Spasov et al. proposed a parameter-efficient CNN for 3D separable convolutions, combining dual learning and a specific layer to predict the conversion from MCI to AD within years [spasov2019]; different from CNN-based approaches, Parisot used a semi-supervised Graph Convolutional Network trained on a sub-set of labeled nodes with diagnostic outcomes to represent sparse clinical data [parisot2018]. However, to the best of our knowledge, no existing work has conducted fully unsupervised anomaly detection for AD diagnosis since capturing subtle anatomical differences between MCI and AD is challenging.
Unsupervised medical anomaly detection
Unsupervised disease diagnosis is challenging because it requires estimating healthy anatomy’s normative distributions only from healthy examples to detect outliers either in the learned feature space or from high reconstruction loss. The latest advances in Deep Learning, mostly GANs[goodfellow2014] and VAEs [kingma2013], have allowed for the accurate estimation of the high-dimensional healthy distributions. Except for discriminative-boundary-based approaches including [alaverdyan2020regularized], almost all unsupervised medical anomaly detection studies have leveraged reconstruction: as pioneering research, Schlegl et al. proposed AnoGAN to detect outliers in the learned feature space of the GAN [schlegl2017unsupervised]; then, the same authors presented fast AnoGAN that can efficiently map query images onto the latent space [Schlegl]; since the reconstruction-based models often suffer from many false positives, Chen et al. penalized large deviations between original/reconstructed images in gliomas and stroke lesion detection on brain MRI [chen2020]. However, to the best of our knowledge, all previous studies are based on 2D/3D single image reconstruction, without considering continuity between multiple adjacent slices. Moreover, no existing work has investigated how unsupervised anomaly detection is associated with either disease stages, various (i.e., more than two types of) diseases, or multi-sequence MRI scans.
Self-Attention GANs (SAGANs)
Zhang et al. proposed SAGAN that deploys an SA mechanism in the generator/discriminator of a GAN to learn global and long-range dependencies for diverse image generation [zhang2019self]
; for further performance improvement, they suggested to apply the SA modules to large feature maps. The SAGANs have shown great promise in various tasks, such as human pose estimation[wang2019improving]sharma2019robust], photo-realistic image de-quantization [zhang2020deep], and large-scale image generation [brock2018large]
. This SAGAN trend also applies to Medical Imaging to extract multi-level features for better super-resolution/denoising and lesion characterization: to mitigate the problem of thin slice thickness, Kudoet al. and Li et al. applied the SA modules to GANs on CT and MRI scans, respectively [kudo2019virtual, li2020super]; similarly, in [lan2020sc], the authors proposed to fuse plane SA modules and depth SA modules for low-dose 3D CT denoising; Lan et al. synthesized multi-modal 3D brain images using SA conditional GAN [lan2020sc]; Ali et al. incorporated SA modules into progressive growing of GANs to generate realistic and diverse skin lesion images for data augmentation [ali2019data]. However, to the best of our knowledge, no existing work has directly exploited the SAGAN for medical disease diagnosis.
AD MRI dataset: OASIS-3
We use a longitudinal dataset of / T1 brain axial MRI slices containing both normal aging subjects/AD patients, extracted from the Open Access Series of Imaging Studies-3 (OASIS-3) [lamontagne2018oasis]. The
slices are zero-padded to reachpixels. Relying on Clinical Dementia Rating (CDR) [Morris], common clinical scale for the staging of dementia, the subjects are comprised of:
Unchanged CDR : Cognitively healthy population;
CDR : Very mild dementia ( MCI);
CDR : Mild dementia;
CDR : Moderate dementia.
Since our dataset is longitudinal and the same subject’s CDRs may vary (e.g., CDR to CDR ), we only use scans with unchanged CDR to assure certainly healthy scans. As CDRs are not always assessed simultaneously with the MRI acquisition, we label MRI scans with CDRs at the closest date. We only select brain MRI slices including hippocampus/amygdala/ventricles among whole axial slices per scan to avoid over-fitting from AD-irrelevant information; the atrophy of the hippocampus/amygdala/cerebral cortex, and enlarged ventricles are strongly associated with AD, and thus they mainly affect the AD classification performance of Machine Learning [Ledig]. Moreover, we discard low-quality MRI slices. The remaining dataset is divided as follows:
Training set: Unchanged CDR ( subjects/ scans/ slices);
Test set: Unchanged CDR ( subjects/ scans/ slices),
CDR ( subjects/ scans/ slices),
CDR ( subjects/ scans/ slices),
CDR ( subjects/ scans/ slices).
The same subject’s scans are included in the same dataset. The datasets are strongly biased towards healthy scans similarly to MRI inspection in the clinical routine. During training for reconstruction, we only use the training set—structural MRI alone—containing healthy slices to conduct unsupervised learning. We do not use a validation set as our unsupervised diagnosis step is non-trainable.
Brain metastasis and various disease MRI dataset
This paper also uses a non-longitudinal, heterogeneous dataset of /// T1c brain axial MRI slices, collected by the authors (National Center for Global Health and Medicine, Tokyo, Japan) and currently not publicly available for ethical restrictions. The dataset contains both healthy subjects, brain metastasis patients [rundo2018NC], and patients with various diseases different from brain metastases. The slices are resized to pixels. The various diseases include but are not limited to:
White matter lesions;
Conforming to T1 slices, we also only select T1c slices including hippocampus, amygdala, and ventricles—a large portion of various diseases also appear in the mid-brain. The remaining dataset is divided as follows:
Training set: Normal ( subjects/ scans/ slices);
Test set: Normal ( subjects/ scans/ slices),
Brain Metastases ( subjects/ scans/ slices),
Various Diseases ( subjects/ scans/ slices).
Since we cannot collect large-scale T1c scans from healthy patients like OASIS-3 dataset, during training for reconstruction, we use both T1/T1c training sets containing healthy slices simultaneously for the knowledge transfer. In the clinical practice, T1c MRI is well-established in detecting various diseases, including brain metastases [arvold2016updates], thanks to its high-contrast in the enhancing region—however, the contrast agent is not suitable for screening studies. Accordingly, such inter-sequence knowledge transfer is valuable in computer-assisted MRI diagnosis. During testing, we make an unsupervised diagnosis on T1 and T1c scans separately.
MADGAN-based multiple adjacent brain MRI slice reconstruction
To model strong consistency in healthy brain anatomy (Fig. 1), in each scan, we reconstruct the next MRI slices from the previous ones using an image-to-image GAN (e.g., if a scan includes slices for , we reconstruct all possible setups: ; ; …; ). As Fig. 2 shows, our MADGAN uses a U-Net-like [Ronneberger, RundoUSEnet] generator with convolutional layers in encoders and deconvolutional layers in decoders respectively with skip connections, as well as a discriminator with
decoders. We apply batch normalization to both convolution with Leaky Rectified Linear Unit (ReLU) and deconvolution with ReLU. Between the designated convolutional/deconvolutional layers and batch normalization layers, we apply SA modules[zhang2019self] for effective knowledge transfer feature recalibration between T1 and T1c slices; we compare the MADGAN models with a different number of the SA modules: (i) no SA modules (i.e., MADGAN); (ii) 3 (red-contoured) SA modules (i.e., 3-SA MADGAN); (iii
) 7 (red- and blue-contoured) SA modules (i.e., 7-SA MADGAN). To confirm how reconstructed slices’ realism and anatomical continuity affect medical anomaly detection, we also compare the MADGAN models with different loss functions: (i) WGAN-GP loss + 100 loss (i.e., MADGAN); (ii) WGAN-GP loss (i.e., MADGAN w/o loss).
Each MADGAN training lasts for steps with a batch size of . We use learning rate for Adam optimizer [kingma2014]. Such as in RGB images, we concatenate adjacent grayscale slices into channels. During training, the generator uses two dropout [srivastava2014dropout]
layers with 0.5 rate. We flip the discriminator’s real/synthetic labels once in three times for robustness. The framework is implemented on TensorFlow.
Unsupervised medical anomaly detection
During diagnosis, we use average loss per scan since squared error is sensitive to outliers and it significantly outperformed other losses (i.e., loss, Dice loss, Structural Similarity loss) in our preliminary paper [han2020CIBB]. To evaluate its unsupervised AD diagnosis performance on a T1 MRI test set, we show ROCs —along with the AUC values—between CDR vs (i) all the other CDRs; (ii) CDR ; (iii) CDR ; (iv) CDR . We also show the AUCs under different training steps (i.e., k, k, k, k, M steps) and confirm the effect of calculating average loss (among whole slices or continuous 10 slices exhibiting the highest loss) per scan. Moreover, we visualize pixelwise loss between real/reconstructed 3 slices, along with distributions of average loss per scan of CDR to know how disease stages affect its discrimination. In exactly the same manner, we evaluate the diagnosis performance of brain metastases/various diseases on a T1c MRI test set, showing ROCs/AUCs between normal vs (i) brain metastases + various diseases; (ii) brain metastases; (iii) various diseases.
Reconstructed brain MRI slices
Fig. 3 illustrates example real T1 MRI slices from a test set and their reconstruction by MADGAN and 7-SA MADGAN. Similarly, Figs. 4 and 5 show example real T1c MRI slices and their reconstructions. Figs. 6 and 7 indicate distributions of average loss per scan on T1 and T1c scans, respectively. Thanks to loss’ good realism sacrificing diversity (i.e., generalizing well only for unseen images with a similar distribution to training images) and WGAN-GP loss’ ability to capture recognizable structure, the MADGAN can successfully capture T1-specific appearance and anatomical changes from the previous slices. Meanwhile, the 7-SA MADGAN tends to be less stable in keeping texture but more sensitive to abnormal anatomical changes due to the SA modules’ feature recalibration, resulting in moderately higher average loss than the MADGAN.
Since the models are trained only on healthy slices, as visualized by an overimposed Jet colormap, reconstructing slices with higher CDRs tends to comparatively fail, especially around hippocampus, amygdala, cerebral cortex, and ventricles due to their insufficient atrophy after reconstruction. The T1c scans show much lower average loss than the T1 scans due to darker texture. Since most training images are the T1 slices with brighter texture than the T1c slices, reconstruction quality clearly decreases on the T1c slices, occasionally exhibiting bright texture. Accordingly, reconstruction failure from anomaly contributes comparatively less to the average loss, especially when local small lesions, such as brain abscess and enhanced lesions, appear—unlike global big lesions, such as multiple cerebral infarction and blood component retention. However, the average loss remarkably increases on brain metastases scans due to their hyper-intensity, especially for the 7-SA MADGAN.
Unsupervised anomaly detection results
Figs. 8 and 9 show AUCs of unsupervised anomaly detection on T1 and T1c scans under different training steps, respectively. The AUCs generally increase as training progresses, but more SA modules require more training steps until convergence due to their feature recalibration—7-SA MADGAN might perform even better if we continue its training. All the best results in specific tasks, except for CDR vs CDR , are from the SA models (e.g., 7-SA MADGAN w/o loss under 900k steps: AUC in CDR vs CDR , 3-SA MADGAN under 300k steps: AUC in normal vs brain metastases, 3-SA MADGAN under 600k steps: AUC in normal vs various diseases); thus, whereas the SA models, which do not know the task to optimize in an unsupervised manner, perform unstably, we might use them similarly to supervised learning if we could obtain good parameters for a certain disease. Without loss, the AUCs tend to decrease, also accompanying large fluctuations; 7-SA MADGAN w/o loss performs well on the T1 scans but poorly on the T1c scans due to the instability.
Figs. 10 and 11 illustrate ROC curves and their AUCs on T1 and T1c scans under M training steps, respectively. Since brains with higher CDRs accompany stronger anatomical atrophy from healthy brains, their AUCs between unchanged CDR remarkably increase as CDRs increase. MADGAN and 7-SA MADGAN both achieves good AUCs, especially for higher CDRs—The MADGAN obtains AUC in CDR vs CDR , respectively; the discrimination between healthy subjects vs MCI patients (i.e., CDR vs CDR ) is extremely difficult even in a supervised manner [Ledig]. Whereas detecting various diseases is difficult in an unsupervised manner, the 7-SA MADGAN outperforms the MADGAN and achieves AUC in brain metastases detection. As Tables 1 and LABEL:tab:T1c show, the effect of how to calculate average loss (among whole slices or continuous 10 slices exhibiting the highest loss) per scan is limited.
Discussion and conclusions
Using massive healthy data, our MADGAN-based multiple MRI slice reconstruction can reliably discriminate AD patients from healthy subjects for the first time in an unsupervised manner; to detect the accumulation of subtle anatomical anomalies, our solution leverages a two-step approach: (Reconstruction) loss generalizes well only for unseen images with a similar distribution to training images while WGAN-GP loss captures recognizable structure; (Diagnosis) loss clearly discriminates healthy/abnormal data as squared error becomes huge for outliers. Using healthy T1 MRI scans for training, our approach can detect AD at a very early stage, MCI, with AUC while detecting AD at a late stage with AUC . Accordingly, this first unsupervised anomaly detection across different disease stages reveals that, like physicians’ way of performing a diagnosis, large-scale healthy data can reliably aid early diagnosis, such as of MCI, while also detecting late-stage disease much more accurately.
To confirm its ability to also detect other various diseases, even on different MRI sequence scans, we firstly investigate how unsupervised medical anomaly detection is associated with various diseases and multi-sequence MRI scans, respectively. Due to the different texture of T1/T1c slices, reconstruction quality clearly decreases on the data-sparse T1c slices, and thus reconstruction failure from anomaly contributes comparatively less to the average loss. Nevertheless, we generally succeed to unravel diseases hard-to-detect and easy-to-detect in an unsupervised manner: it is hard to detect local small lesions, such as brain abscess and enhanced lesions; but, it is easy to detect hyper-intense enhancing lesions, such as brain metastases (AUC ), especially for 7-SA MADGAN thanks to its feature recalibration. Our visualization of differences between real/reconstructed slices might play a key role in understanding and preventing various diseases, including rare disease.
As future work, we will investigate more suitable SA modules in a reconstruction model, such as Dual Attention Network that capture feature dependencies in both spatial/channel dimensions [fu2019dual]; here, optimizing where to place how many SA modules is the most relevant aspect. We will validate combining new loss functions for both reconstruction/diagnosis, including sparsity regularization [zhou2020sparse], structural similarity [haselmann2018anomaly], and perceptual loss [tuluptceva2020anomaly]. Lastly, we plan to collect a higher amount of healthy T1c scans to reliably detect and locate various diseases, including cancers and rare diseases. Integrating multi-modal imaging data, such as Positron Emission Tomography with specific radiotracers [rundoCMPB2017], might further improve disease diagnosis [brier2016], even when analyzed modalities are not always available [li2014multimodal].
List of abbreviations used
Area Under the Curve: AUCs, AutoEncoder: AE, Alzheimer’s Disease: AD, Clinical Dementia Rating: CDR, Contrast-enhanced T1-weighted: T1c, Convolutional Neural Network: CNN, Computed Tomography: CT, Generative Adversarial Network: GAN, Magnetic Resonance Imaging: MRI, Medical Anomaly Detection Generative Adversarial Network: MADGAN, Mild Cognitive Impairment: MCI, Open Access Series of Imaging Studies-3: OASIS-3, Receiver Operating Characteristic: ROC, Rectified Linear Unit: ReLU, Self-Attention: SA, T1-weighted: T1, Variational AutoEncoder: VAE, Wasserstein loss with Gradient Penalty: WGAN-GP.
The authors declare that they have no competing interests.
Conceived the idea: CH, LR, ZAM, KM. Designed the code: CH, LR, ZAM. Collected the T1c dataset: TN. Implemented the code: CH. Performed the experiments: CH. Analyzed the results: CH, LR. Wrote the manuscript: CH, LR. Critically read the manuscript and contributed to the discussion of the whole work: KM, TN, ZAM, YS, SK, ES, HN, SS.
This research was partially supported both by AMED Grant Number JP18lk1010028 and The Mark Foundation for Cancer Research and Cancer Research UK Cambridge Centre [C9685/A25177]. Additional support has been provided by the National Institute of Health Research (NIHR) Cambridge Biomedical Research Centre. Zoltán Ádám Milacski was supported by Grant Number VEKOP-2.2.1-16-2017-00006. The OASIS-3 dataset has Grant Numbers P50 AG05681, P01 AG03991, R01 AG021910, P50 MH071616, U24 RR021382, and R01 MH56584.
|CDR = 0 vs||CDR = 0.5 + 1 + 2||CDR = 0.5||CDR = 1||CDR = 2|
|MADGAN (10 slices)||0.764||0.745||0.793||0.830|
|MADGAN w/o Loss||0.693||0.689||0.699||0.711|
|MADGAN w/o Loss (10 slices)||0.705||0.697||0.717||0.736|
|3-SA MADGAN (10 slices)||0.739||0.725||0.760||0.810|
|3-SA MADGAN w/o Loss||0.728||0.715||0.748||0.785|
|3-SA MADGAN w/o Loss (10 slices)||0.735||0.721||0.756||0.806|
|7-SA MADGAN (10 slices)||0.764||0.743||0.798||0.835|
|7-SA MADGAN w/o Loss||0.759||0.727||0.809||0.894|
|7-SA MADGAN w/o Loss (10 slices)||0.746||0.710||0.803||0.868|
|Normal vs||BM + VD||BM||VD|
|MADGAN (10 slices)||0.769||0.905||0.607|
|MADGAN w/o Loss||0.688||0.773||0.586|
|MADGAN w/o Loss (10 slices)||0.696||0.778||0.597|
|3-SA MADGAN (10 slices)||0.760||0.871||0.626|
|3-SA MADGAN w/o Loss||0.677||0.749||0.589|
|3-SA MADGAN w/o Loss (10 slices)||0.708||0.780||0.622|
|7-SA MADGAN (10 slices)||0.776||0.917||0.608|
|7-SA MADGAN w/o Loss||0.233||0.063||0.436|
|7-SA MADGAN w/o Loss (10 slices)||0.234||0.091||0.405|