I Introduction
According to [1]
, the economic costs of mental disorders have the highest impact on economic growth, direct and indirect costs and the statistical value of life. One essential tool for better understanding mental illness is to use noninvasive neuroimaging (e.g., structural magnetic resonance imaging (MRI) images) along with machine learning to learn brain structure.
Deep Learning has been integral to the successes of machine learning for numerous demanding realworld applications, e.g., stateoftheart image classification [2] and selfdriving cars [3]
. While many of Deep Learning’s successes involve supervised learning, supervised approaches can fail when data annotation (e.g., labels) is limited or unavailable. When there is sufficient data, supervised models can not only perform well on holdout sets but provide representations that generalize well to other supervised settings
[4]. However, when there is insufficient data, a supervised learner tends to discriminate on lowlevel (e.g., pixellevel, trivial) information, which hurts generalization performance. A model that generalizes well needs to extract meaningful highlevel information (e.g., a collection of important features at the input level). In order to address this, many successful applications of machine learning to neuroscience rely on unsupervised learning [5, 6, 7, 8]to extract representations of brain imaging data. These representations are then used as input to an offtheshelf classifier (i.e., semisupervised learning).
However, prior work on unsupervised learning of brain imaging data is either linear or weakly nonlinear [5, 6] or are highly restrictive in parameterization [7], and do not represent flexible methodology for learning representations.
In this work, we explore using DIM [9]
to learn deep nonlinear representations of neuroimaging data as an output of a convolutional neural network. DIM works by maximizing the mutual information between a highlevel feature vector and lowlevel feature maps of a highly flexible convolutional
encodernetwork by training a second neural network that maximizes a lower bound on a divergence (probabilistic measure of difference) between the joint or the product of marginals of the encoder input and output. The estimates provided by this second network can be used to maximize the mutual information of the features in the encoder with the input. Unlike other popular unsupervised autoencoding approaches such as VAE
[10], DIM doesn’t require a decoder. Hence it significantly reduces memory requirements of the model for volumetric data.We evaluate DIM by performing a downstream classification task between four groups: patients with stable and progressive MCI, with Alzheimer’s disease and healthy controls, using only the resulting representation from DIM as input to the classifier. We compare DIM to two convolutional networks with AlexNet [11] and ResNet [12] inspired architectures trained with supervised learning. On strict evaluation, we show comparable performance to supervised methods and to previously reported [13, 14, 15, 16] classification performance.
Ii Materials and Methods
Iia Deep InfoMax
Let and be the input and output variables of a neural network encoder, with parameters , where , and are its domain and range. We wish to find the parameters that maximize the following objective:
(1) 
where is the mutual information estimate provided by a different network with parameters , and is the output of the encoder.
A parametric estimator for the mutual information can be found by training a statistics network to maximize a lower bound based on the Fencheldual [17] or the DonskerVaradhan representation [18, 19]
of the Kullback–Leibler divergence
. The DonskerVaradhanbased estimator is a consistent, asymptotically unbiased estimator has been shown to outperform nonparametric estimators, and can also be used to improve deep generative models
[19]. However, is unbounded, which can be problematic if the above estimators are used for training deterministic neural network encoders. [9] showed that using an estimator based on the JensenShannon divergence (JSD) (i.e., simple binary crossentropy) is more stable and works well in practice, and it has been shown that this estimator also yields a good estimator for mutual information [9, 20]:(2) 
where is a statistics network with parameters , (softplus function) and is another input sampled from the data distribution independently from . In addition, the NoiseContrastive variant of the estimator (NCE) [21] was shown to work well in practice [9]:
(3) 
Here, are a set of samples where are a set of negative samples drawn from the data distribution, such that there is exactly one positive example in ( occurs exactly once).
[9] showed that maximizing the mutual information between the complete input and output of an encoder are insufficient for learning good representations for downstream classification tasks, as this approach can still focus on lowerlevel “trivial” localized details. Instead, they show that maximizing the mutual information between the highlevel representation, and patches of an input image can achieve highly competitive results. The intuition is that this approach encourages the highlevel representation to learn information that is shared across the input. It is suitable for many classification tasks, as we expect that classdiscriminative features should be evident across many spatial locations of the input. For a convolutional encoder , the local DIM objective can be written in a compact form:
(4) 
where is a feature map location from encoder (with a limited receptive field corresponding to an input patch with size ) at some intermediate layer of the network.
Due to stronger performance of AlexNet architecture (Section (IIB)) in our experiments (see Section (IV)) we used it as an encoder for DIM method. Last linear layer of AlexNet we changed with a layer for dimensional output representation.
To estimate mutual information using eq. (4) we used the encodeanddotproduct architecture (Fig. from [9]). First, patches taken from third convolutional layer of AlexNet were mapped using convolutional encoderanddot architecture (Tab. from [9]) with units and their representation — linear encoderanddot architecture (Tab. from [9]). Then flattened encoded mappings of patches and representations were combined using the dot product to create real and fake samples efficiently. The real sample is a dot product of a “local” patch and its “global” representation mappings, while fake — between mapping of some “local” patch with global representation coming from an unrelated input. Eventually we estimated JSD based loss eq. (2) and NCE — eq. (3) using these samples. Since NCE needs to have more negative samples to be competitive with JSD [9], all possible combinations between the patch and representation mappings were used a similar way to create negative samples.
To evaluate the performance of the learned representation by DIM, we trained three additional neural networks using as input features output from last convolutional layer with size , the first fully connected layer with units, and final fully connected layer with dimensional representation, which we call as Conv, FC, and . The classifiers are composed of one fullyconnected layer with hidden units, dropout [22] with
[23]and a ReLU
[24] activation.AlexNet 

3D Conv  BN 3D  ReLU  MP 3D 
3D Conv  BN 3D  ReLU  MP 3D 
3D Conv  BN 3D  ReLU 
3D Conv  BN 3D  ReLU 
3D Conv  BN 3D  ReLU  MP 3D 
Linear  BN 1D  ReLU 
Linear  SoftMax  ArgMax 
ResNet 
3D Conv  BN 3D  ReLU  MP 3D 
Residual Layer 1 
BB0  2 x (3D Conv  BN 3D  ReLU) 
BB1  2 x (3D Conv  BN 3D  ReLU) 
Residual Layer 2 
BB0  3D Conv  BN 3D  ReLU 
BB0  3D Conv  BN 3D  ReLU 
BB0 downsample  3D Conv  BN 3D 
BB1  2 x (3D Conv  BN 3D  ReLU) 
Residual Layer 3 
BB0  3D Conv  BN 3D  ReLU 
BB0  3D Conv  BN 3D  ReLU 
BB0 downsample  3D Conv  BN 3D 
BB1  2 x (3D Conv  BN 3D  ReLU) 
MaxPool 3D 
Linear  BN 1D  ReLU 
Linear  SoftMax  ArgMax 
IiB Supervised baselines
As baselines we have considered supervised methods — two convolutional networks, one based on a simplified AlexNet [11] architecture and the other a ResNet [12]
architecture. Both networks use convolutions and max pooling with volumetric kernels, batch normalization, ReLU and two fully connected layers in the end (see Tab. (
I) for details). The notations in Tab. (I) denotes: BN for batch normalization, BB — a basic block, MP — max pooling with kernel sizeand stride
, for convolutions— a number of input and output channels, a kernel size, a stride and a padding respectively). Crossentropy loss used as a training objective.
IiC Regularization
For small datasets, it is common to penalize the number of the model parameters by driving most of them to zero using regularization. Formally, this penalty is defined as:
(5) 
where is parameter vector of the model and — coefficient. regularization imposes a sparse solution. This penalty is added to JSD, NCE and crossentropy losses in different setting. For our experiments we used .
Iii Experiments
Iiia Datasets and preprocessing
For the downstream classification task, the data was obtained from the ADNI database adni.loni.usc.edu (for uptodate information, see www.adniinfo.org). We use T1w MRI images of subjects with four different groups: patients with stable, and progressive MCI, Alzheimer’s disease and healthy controls.
Structural MRI (sMRI) data was preprocessed to grey matter volume (modulated) maps using SPM12 toolbox. To segment grey matter, the MRI images were spatially normalized and smoothed by 6 mm full width at half maximum (FWHM) 3D Gaussian kernel. After quality control, two subjects from ADNI dataset were excluded. The final dataset consisted of subjects with a volume size of .
IiiB Experimental setup
IiiB1 Data
The dataset was divided in approximately and subjects for crossvalidation and holdout test sets using a stratified split. Then, subjects were split into five stratified folds.
For AlexNet and ResNet architectures, we used simple data augmentation of the training dataset to reduce overfitting to the small number of annotated samples available. Our augmentation consisted of zero padding and random cropping to size
along all dimensions along with randomly flipping the input with probability
for each axis. The whole brain was included in the crop.For DIM, we didn’t use data augmentation, but we used zero padding to make sure that input size is equal to along all dimensions.
IiiB2 Training
The models were trained using the AMSGrad [25] optimizer with learning rate for CNN models and for DIM using a batch size of but dropping the incomplete last batch. The training of the supervised architectures was performed for epochs, DIM — for epochs as pretraining and for epochs for training the classifiers on top of frozen features from the encoder.
IiiB3 Evaluation
IiiB4 Implementation and hardware
Iv Results
The final trained models used further to evaluate the performance were selected based on the bestbalanced accuracy but from a checkpoint where the validation score was lower than the training score. We gave the model a burnin period before applying this rule to deal with initial stochasticity. The models notations are as follows: Aug denotes augmentation of the training dataset, the first sparse — a model trained with regularization, the second — a classifier on top of the frozen features from encoders trained using regularization, SS — stands for training an unsupervised model with an additional supervised loss from classifier.
Table II
reports the balanced accuracy rates including mean, standard deviation values, and the gap between mean values on crossvalidation and holdout. The bold text distinguishes the best scores and the name of the models. The last column shows
value and statistic for the onesided Wilcoxon test. The boldvalues indicate acceptance of the null hypothesis. The test was performed to compare each method with the best model (
Sparse AlexNet Aug) based on the five values of balanced accuracy on holdout. An alternative hypothesis is that the model Sparse AlexNet Aug is better. Fig. 1 highlights the distributions of the performance.With all modifications, ResNet shows a lower performance on holdout (at most ) than AlexNet. It is reasonable since the capacity of the ResNet architecture is larger and the dataset is small. For Wilcoxon test also rejects H0 supporting the worse performance of ResNet. Performance of JSD Conv, JSD Conv SS, Sparse JSD Sparse Conv, AlexNet, AlexNet Aug is statistically indistinguishable from that of Sparse AlexNet Aug. Follows that unsupervised DIM has comparable performance to supervised methods.
Among DIM variants, JSD has higher scores than NCE. Lower scores of NCE can be explained by its requirement of a large number of negative samples during training to be competitive with JSD. Our dataset is not large enough to support the needed level of negative sampling.
The best score with convolutional features——was obtained by an encoder and classifier trained with regularization which is the Sparse JSD Sparse Conv model. For features from the fullyconnected layer — JSD FC SS model with using semisupervised loss was the best. However, Sparse JSD Sparse FC has similar results and a smaller mean gap but it has a lower mean crossvalidation score by . For the smallest dimensional representation, semisupervised model JSD Z SS gives the best performance , but similar result were obtained by Sparse NCE Z model. Semisupervised loss and regularization improved models’ generalization by reducing the gap between crossvalidation and holdout scores. The observed degradation in performance between Conv, FC, and can be explained by the reduced capacity of the features. regularization and dropout could also be adjusted. However, a more compact input representation can be of independent use, for example, for dimensionality reduction.
In previous studies, the best reported accuracy for the ResNet architecture in a 4class sMRI classification task was [13]
, while stacked autoencoders (SAE)
[15] reached for sMRI only and for sMRI+PET , and DWSMTL [16] — for sMRI or for sMRI+PET+CSF . Our values can’t be completely comparable since the evaluation is different. Reproduced ResNet can be used as a proxy to estimate performance relative to this prior work. Note, however, it is not one of the bestperforming methods in our study.V Conclusions
This work proposes an unsupervised method DIM for learning representations from structural neuroimaging data. The evaluation of the prediction of progression to Alzheimer’s disease demonstrates results comparable to supervised methods. In the future, we will scale up our experiments with increased sample size and address the cases of other diseases. Our future efforts will also be focused on the multimodal fusion of brain imaging data [31] to increase the predictive strength of the model.
Acknowledgement
This study is supported by NIH grants R01EB020407, R01EB006841, P20GM103472, P30GM122734.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH1220012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. HoffmannLa Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.;Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
References
 [1] S. Trautmann, J. Rehm, and H.U. Wittchen, “The economic costs of mental disorders: Do our societies react appropriately to the burden of mental disorders?” EMBO reports, p. e201642951, 2016.

[2]
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. IEEE, 2017, pp. 2261–2269.  [3] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “Upsnet: A unified panoptic segmentation network,” arXiv preprint arXiv:1901.03784, 2019.
 [4] L. Zhang, T. Xiang, and S. Gong, “Learning a deep embedding model for zeroshot learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 3010–3019.

[5]
V. D. Calhoun, T. Adali, G. D. Pearlson, and J. Pekar, “A method for making group inferences from functional mri data using independent component analysis,”
Human brain mapping, vol. 14, no. 3, pp. 140–151, 2001. 
[6]
R. D. Hjelm, V. D. Calhoun, R. Salakhutdinov, E. A. Allen, T. Adali, and S. M. Plis, “Restricted boltzmann machines for neuroimaging: an application in identifying intrinsic networks,”
NeuroImage, vol. 96, pp. 245–260, 2014.  [7] E. Castro, R. D. Hjelm, S. M. Plis, L. Dinh, J. A. Turner, and V. D. Calhoun, “Deep independence network analysis of structural brain imaging: application to schizophrenia,” IEEE transactions on medical imaging, vol. 35, no. 7, pp. 1729–1740, 2016.
 [8] S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt, J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, and V. D. Calhoun, “Deep learning for neuroimaging: a validation study,” Frontiers in neuroscience, vol. 8, p. 229, 2014.
 [9] R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
 [10] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[11]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.  [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [13] A. Abrol, M. Bhattarai, A. Fedorov, Y. Du, S. Plis, and V. Calhoun, “Deep residual learning for neuroimaging: An application to predict progression to alzheimer’s disease,” bioRxiv, p. 470252, 2018.
 [14] S. Vieira, W. H. Pinaya, and A. Mechelli, “Using deep learning to investigate the neuroimaging correlates of psychiatric and neurological disorders: Methods and applications,” Neuroscience & Biobehavioral Reviews, vol. 74, pp. 58–75, 2017.
 [15] S. Liu, S. Liu, W. Cai, H. Che, S. Pujol, R. Kikinis, D. Feng, M. J. Fulham, and ADNI, “Multimodal neuroimaging feature learning for multiclass diagnosis of alzheimer’s disease,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 4, pp. 1132–1140, April 2015.

[16]
H.I. Suk, S.W. Lee, D. Shen, A. D. N. Initiative et al.
, “Deep sparse multitask learning for feature selection in alzheimer’s disease diagnosis,”
Brain Structure and Function, vol. 221, no. 5, pp. 2569–2587, 2016.  [17] S. Nowozin, B. Cseke, and R. Tomioka, “fgan: Training generative neural samplers using variational divergence minimization,” in Advances in Neural Information Processing Systems, 2016, pp. 271–279.
 [18] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certain markov process expectations for large time. iv,” Communications on Pure and Applied Mathematics, vol. 36, no. 2, pp. 183–212, 1983.
 [19] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mine: mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018.
 [20] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational lower bounds of mutual information.”
 [21] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
 [22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[24]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814.  [25] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
 [26] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The balanced accuracy and its posterior distribution,” in Pattern recognition (ICPR), 2010 20th international conference on. IEEE, 2010, pp. 3121–3124.
 [27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
 [29] rdevon/cortex: A machine learning library for pytorch. [Online]. Available: https://github.com/rdevon/cortex
 [30] rdevon/dim: Deep infomax (dim), or ”learning deep representations by mutual information estimation and maximization”. [Online]. Available: https://github.com/rdevon/DIM
 [31] V. D. Calhoun and J. Sui, “Multimodal fusion of brain imaging data: a key to finding the missing link (s) in complex mental illness,” Biological psychiatry: cognitive neuroscience and neuroimaging, vol. 1, no. 3, pp. 230–244, 2016.