1 Introduction
The brain is a vastly complex organ whose proper function relies on the simultaneous operation of multitudes of distinct biological processes. As a result, individual imaging techniques often capture only a single facet of the information necessary to understand a dysfunction or perform a diagnosis. As an illustration, structural MRI (sMRI) captures static but relatively precise anatomy, while fMRI measures the dynamics of hemodynamic response but with substantial noise. Brain imaging analyses with a single modality have been shown to potentially lead to misleading conclusions calhoun2016multimodal ; plis2011effective , which is unsurprising given fundamental differences in measured information of the modalities as all of the modalities are flawed on their own in some way.
To address the limitations of unimodal analyses, it is natural to turn to multimodal data to leverage a wealth of complementary information, which is key to enhancing our knowledge of the brain and developing robust biomarkers. Unfortunately, multimodal modeling is often a challenge, as finding points of convergence between different multimodal views of the brain is a nontrivial problem. We propose selfsupervised approaches to improve our ability to model joint information between modalities, thereby allowing us to achieve the following three goals. Our primary goal in addressing multimodal modeling is to understand how to represent multimodal neuroimaging data by exploiting unique and joint information in two modalities. As the second goal, we want to understand the links between different modalities (e.g., T1weighted structure measurements and resting functional MRI data). The final goal is to understand how to exploit colearning baltruvsaitis2018multimodal , especially in case one modality is particularly hard to learn. By achieving these three goals in this work, we address three out of the five multimodal challenges baltruvsaitis2018multimodal : representation, alignment, and colearning leaving only generative translation and fused prediction for future work. Moreover, we present a general framework for selfsupervised multimodal neuroimaging. The proposed approach can capitalize on the available joint information to show competitive performance relative to supervised methods. Our approach opens the door to additional data discovery. It enables characterizing subject heterogeneity in the context of imperfect or missing diagnostic labels and, finally, can facilitate visualization of complex relationships.
1.1 Related work
In neuroimaging, linear independent component analysis (ICA)
comon1994independent , and canonical correlation analysis (CCA) hotelling1992relations are commonly used for latent representation learning and intermodal link investigation. Joint ICA (jICA) moosmann2008joint performs ICA on concatenated representation for each modality. jICA has been extended with multiset canonical correlation plus ICA (mCCA+ICA) sui2011discriminating and spatial CCA (sCCA) sui2010ccawhich mitigate limitations of jICA or CCA applied separately. These jICA or jICAadjacent methods all estimate a joint representation. Another approach, parallel ICA (paraICA)
liu2009combining , simultaneously learns independent components for fMRI and SNP data that maximize the correlation between specific multimodal pairs of columns, in the mixing matrix of different modalities. A recent improvement over paraICA, aNyway ICA duan2020any , can scale to any number of modalities and requires fewer assumptions.Most of the available multimodal imaging analysis approaches, including those mentioned above, rely on linear decompositions of the data. However, recent work suggests the presence of nonlinearities in neuroimaging data that can be exploited by deep learning (DL) abrol2021deep . Likewise, correspondence between modalities is unlikely to be linear calhoun2016multimodal . These findings motivate the need for deep nonlinear models. Within neuroimaging, supervised DL models have mostly proven to be successful due to their ease of use. These models show mostly unparalleled performance mainly in the abundance of training data: a data sample paired with a corresponding label. However, supervised models are prone to the shortcut learning phenomenon geirhos2020shortcut
; when a model hones in on trivial patterns in the training set that are sufficient to classify the available data but are not generalizable to data unseen at training. Next, it has been shown that supervised models can memorize noisy labels
arpit2017closer which are commonplace in healthcare pechenizkiy2006class ; rokham2020addressing . Furthermore, supervised methods have also been shown to be datainefficient CPCv2 while labels in medical studies are costly and scarce. Finally, in many cases, diagnostic labels are based on selfreports and interviews and thus may not accurately reflect the underlying biology rokham2020addressing. Many of these problems can be addressed with unsupervised learning and, more recently, selfsupervised learning (SSL)
dosovitskiy2014discriminative . In SSL, a model trains on a proxy task that does not require externally provided labels. SSL has been shown to improve robustness NEURIPS2019_a2b15837 , dataefficiency CPCv2 , and can outperform supervised approaches on image recognition tasks SWAV .In the early days of the current wave of unsupervised deep learning, common approaches were based on deep belief networks (DBNs)
srivastava2012learning ; plis2014deep, and deep Boltzmann machines (DBMs)
srivastava2012multimodal ; hjelm2014restricted ; SUK2014569 . However, DBNs and DBMs are difficult to train. Later, deep canonical correlation analysis (DCCA) dcca was introduced for multiview unsupervised learning. DCCA dcca, and its successor, deep canonically correlated autoencoder (DCCAE)
DCCAEare trained in a twostage procedure. In the first stage, a neural network trains unimodally via layerwise pretraining or using an autoencoder. In the second stage, CCA is used to capture joint information between modalities. Due to the need for a decoder, the autoencoding approaches demand high computational and memory requirements for full brain data as most brain segmentation models are still working on 3D patches
fedorov2017end ; fedorov2017almost ; henschel2020fastsurfer .Among many selfsupervised learning approaches, we are specifically interested in methods that use maximization of mutual information, such as Deep Infomax (DIM) DIM and contrastive predictive coding (CPC) CPCv1 . These methods can naturally be extended to modeling multimodal data fedorov2021self compared to other selfsupervised pretext tasks misra2020self (e.g. relative position doersch2015unsupervised , rotation gidaris2018unsupervised
zhang2016colorful ). The maximization of mutual information in these methods allows a predictive relationship between representations at different levels as a learning signal for training. Specifically, the learning signal in DIM DIMis the relationship between the intermediate representation of a convolutional neural network (CNN) and the whole representation of the input. In (CPC)
CPCv1 , for example, this is done between the context and a future intermediate state. Both DIM and CPC have been successfully extended and applied unimodally for the prediction of Alzheimer’s disease from sMRI fedorov2019prediction, transfer learning with fMRI
mahmood2019transfer ; mahmood2020whole , and brain tumor, pancreas tumor segmentation, and diabetic retinopathy detection NEURIPS2020_d2dc6368 . In addition, these models do not reconstruct as part of their learning objective, unlike autoencoders. The reconstructionfree model saves a lot of compute and memory, especially for volumetric medical imaging applications.In multiview and multimodal settings, selfsupervised learning has enabled stateoftheart results on various computer vision problems via maximization of mutual information between different views of the same image
amdim ; cmc ; simclr , and multimodal data (e.g., visual, audio, text) miech2020end ; alayrac2020self ; Radford2021LearningTV ; CMIM . These SSL approaches capture the joint information between two corresponding distorted or augmented images, imagetext, videoaudio, or videotext pairs. In the multiview case amdim ; cmc ; simclr , the models learn transformationinvariant representations by capturing the joint information while discarding information unique to a transformation. In the case of multimodal data, the models learn modalityinvariant demian representations known as retrieval models. The same ideas have been extended to the multidomain scenario to learn domaininvariant feng2019self representations. However, in the case of neuroimaging, when one modality can capture the anatomy, and the other can capture brain dynamics, the joint information alone will not be sufficient due to the information content that each of the modalities is measuring. We hypothesize that we additionally need to capture unique modalityspecific information.Most of the described multiview, multidomain, and multimodal work can be viewed as a coordinated representation learning baltruvsaitis2018multimodal . In terms of coordinated representation learning, we learn separate representations for each view, domain, or modality. But the representations are coordinated through an objective function by optimizing a similarity measure with possible additional constraints (e.g., orthogonality in CCA). In this case, the objective function mainly captures joint information between the global latent representation of modalities that summarize the whole input. However, such framework only considers a globalglobal relationship between modalities. To resolve this limitation, we can consider intermediate representations, namely as local representation that captures local information about the input. That would allows us to capture localtolocal, glocaltoglobal, localtoglobal multimodal relationships. Previously, augmented multiscale DIM (AMDIM) amdim , crossmodal DIM (CMDIM) sylvain2019locality ; sylvain2020zeroshot , and spatiotemporal DIM (STDIM) anand2019unsupervised used local intermediate representation of convolutional layers to capture multiscale relationships between multiple views, modalities or time frames. Thus, we hypothesize that multiscale relationships between modalities can also be used to learn representations from multimodal neuroimaging data. We extend the coordinated representation learning framework to a multiscale coordinated representation framework for multimodal data to verify this.
1.2 Contributions
First, we propose a multiscale coordinated framework as a family of models. The family of the models is inspired by many published SSL approaches based on the maximization of mutual information that we combine in a complete taxonomy. The family of methods within this taxonomy covers multiple inductive biases that can capture joint intermodal and unique intramodel information. In addition, it covers multiscale multimodal relationships in the input data.
Secondly, we provide a methodology to exhaustively evaluate learned representation. We thoroughly investigate the models on a multimodal dataset OASIS3 OASIS3 by evaluating performance on two classification tasks, measuring the amount of joint information in representations between modalities, and interpreting the representation in the brain voxelspace.
Our results provide strong evidence that selfsupervised models yield useful predictive representations for classifying a spectrum of Alzheimer’s phenotypes. We show that selfsupervised models are able to uncover regions of interest, such as the hippocampus yang2022human , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study on T1, and hippocampus 10.3389/fnagi.2018.00037 ; zhang2021regional , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction and superior medial frontal gyrus cheung2021diagnostic on fALFF. Furthermore, selfsupervised models can uncover multimodal links between anatomy and brain dynamics that are also supported by previous literature, such as Thalamus (on fALFF)  Precuneus (on T1) cunningham2017structural , precuneus (T1) and hippocampus (fALFF) kim2013hippocampus ; ryu2010measurement , and precuneus (T1) and middle cingulate cortex (fALFF) rami2012distinct ; bailly2015precuneus .
2 Materials and Methods
In this section, we first describe the foundation of our framework. Then we describe the evaluation of learned representation.
2.1 Overview of coordinated representation learning
To help the reader understand the foundation of our framework, we start with the idea of coordinated representation learning baltruvsaitis2018multimodal .
Let be an arbitrary sample of related input modalities collected from the same subject. Taken together, a set of samples comprise a multimodal dataset . In our case, the modalities of interest are sMRI and rsfMRI represented as T1 and fALFF volumes, respectively. Thus, and , a tensor. While intermodal relationships can be as complex as a concurrent acquisition of neuroimaging modalities, we pair sMRI and rsfMRI simply based on the temporal proximity between sessions of the same subject from the OASIS3 OASIS3 dataset.
For each modality , coordinated representation learning introduces an independent encoder that maps the input, , to a lower
dimensional representation vector
. Thus, each modality has a separate encoder and a corresponding representation.In this work, we parameterize encoder for each modality with a volumetric deep convolutional neural network. To learn each encoder’s parameters, we optimize an objective function that incorporates the representation of each modality and coordinates them. The objective encourages each of the encoders to learn an encoding for their respective modality that is informed by the other modalities. This crossmodal influence is learned during training and is captured in the parameters of each encoder. Hence, a representation of an unseen subject’s modality will capture crossmodal influences without a contingency on the availability of the other modalities. One common choice with respect to crossmodal influences is to coordinate representations across modalities by maximizing their similarity metric frome2013devise or correlation via CCA dcca between representation vectors.
To summarize, in coordinated representation learning, modalityspecific encoders learn to generate representations in a crosscoordinated manner guided by an objective function.
The main limitation of the coordinated representation framework is its exclusive focus on capturing joint information between modalities instead of also capturing information that is exclusive to each modality. Thus, DCCA dcca and DCCAE DCCAE employ coordinated learning only as a secondary stage after pretraining the encoder. The first stage in these methods focuses on learning modalityspecific information via layerwise pretraining of the encoder in DCCA, or pretraining of the encoder with an autoencoder (AE) in DCCAE. On the other hand, deep collaborative learning (DCL) DCL attempts to capture modalityspecific information using a supervised objective for phenotypical information with respect to each modality, in addition to CCA.
An additional limitation that previous work has not considered is the use of intermediate representation in the encoder. Intermediate representations have been integral to the success of UNet architecture in biomedical image segmentation ronneberger2015u , brain segmentation tasks henschel2020fastsurfer , achieving stateoftheart results with selfsupervised learning on natural image benchmarks with DIM DIM and AMDIM amdim , and achieving nearsupervised performance in Alzheimer’s disease progression prediction, with selfsupervised pretraining fedorov2019prediction .
In our work, we address these limitations by proposing a multiscale coordinated learning framework.
2.1.1 Multiscale coordinated learning
To motivate multiscale coordinated learning, we reintroduce intermediate representations and explain how they can benefit multimodal modeling.
Each encoder produces intermediate representations. Specifically, if the encoder is a convolutional neural network (CNN) with layers, each subsequent layer
in the CNN represents a larger part of the original input data. Furthermore, each of these scales, which correspond to the depth of a layer, is an increasingly nonlinear transformation of the input and produces a more abstract representation of that input relative to the previous scales. The intermediate representations of layer
are convolutional features , where is the number of locations in the convolutional features of layer , and is the number of channels. These features are also often referred to as activation maps within the network. For example, if the input is a 3D cube, the arbitrary feature size within a layer of the CNN will be . Thus, each of the intermediate representations will have locations. Each location in the intermediate representation has a receptive field araujo2019computing that captures a certain subset of the input sample. Each intermediate representation thus captures some of the input’s local information, while the latent representation () captures the input’s global information.With two scales and two modalities, we can define multiscale coordinated learning based on four objectives, which are schematically shown in Figure 1. The ConvolutiontoRepresentation (CR) objective captures modalityspecific information as localtoglobal intramodel interactions. The Cross ConvolutiontoRepresentation (XX) objective captures joint intermodal localtoglobal interactions between the local representations in one modality and the global representation in another modality. The RepresentationtoRepresentation (RR) objective captures joint information between global intermodal representations as globaltoglobal interactions. The ConvolutiontoConvolution (CC) objective captures joint information between local intermodal representations as localtolocal interactions.
Thus, we can capture modalityspecific information and multimodal relationships at multiple scales. These extensions cover two of the previously mentioned limitations in the coordinated learning framework. Our extensions also allow us to define a full taxonomy of models that can be constructed based on these four principal interactions and to show how these compare to or supersede related work.
To construct an effective objective based on these multiscale coordinated interactions, first, we will define an estimator of mutual information. This estimator will be used to define each of the four objectives as a mutual information maximization problem that can be used to encourage the interactions between the corresponding representations. Lastly, we explain how one can construct an objective for a multiscale coordinated representation learning problem, based on a combination of the four basic objectives between global and local features, and, additionally, show how these compare to related work.
2.1.2 Mutual Information Maximization
To estimate mutual information between random variables
and, we use a lowerbound, based on the noisecontrastive estimator (InfoNCE)
CPCv1 .(1) 
where the samples construct pairs: sampled from the joint (positive pair) and sampled from the product of the marginals (negative pair). The and represent dimensional representation vectors, and can be local or global representations. The function in equation 1 is a scoring function that maps its input vectors to a scalar value and is supposed to reflect the goodness of fit. This functions is also known as the critic function Tschannen2020On . The encoder is optimized to maximize the critic function for a positive pair and minimize it for a negative pair, such that . Our choice of the critic function is a scaled dotproduct amdim , and is defined as:
(2) 
2.1.3 Taxonomy for multiscale coordinated learning
Given the mutual information estimator, we can construct four basic objectives and then use those to construct a full taxonomy of interactions, which is shown in Figure 2. Each option within the taxonomy specifies a unique optimization objective . Notably, the first row of the figure shows the principal losses: CR, XX, RR, and CC,  that we defined before. The remaining parts of the taxonomy are constructed by adding a composition of the principle losses together. For example, the 5th combinationRRCC is the sum of the two basic objectives RR and CC.
To discuss the options in the taxonomy, we first reintroduce some notations. A local representation is any arbitrary location in the convolutional feature from a convolutional layer , with locations. A location is represented as the dimensional vector, where is the number of channels of the convolutional activation map for layer . The choice of the layer
is a hyperparameter and can be guided by the following intuition. First, the chosen layer
should not be too close to the last layer or be the last layer because it will capture similar information to the global representation. In this case, the local and global content of the input could be very similar, with almost or exactly the same receptive field. Although a single layer difference between the local and global representations can be used as a strategy for layerwise pretraining, as in Greedy Deep InfoMax (GIM) lowe2019putting , and the least case is not meaningful. Secondly, the chosen layer should not be too close to or be the first layer. This will lead to a local representation with a very small receptive field that only captures hyperlocal information of the input. The first layer will essentially capture the intensity of a voxel, which has not worked well, as previously shown in DIM DIM .The global representation is the encoder’s latent representation that summarizes the whole input. The global representation is a dimensional vector, where is also a hyperparameter. With this global representation, we also define a dimensional space, wherein we compute scores with the critic function . The the local representation, however, is a dimensional vector. To overcome the difference in size, we add an additional local projection head . This projection head takes the dimensional local representation from layer in the encoder and projects it to the dimensional space so we can compute scores with in this dimensional space. This projection is also parameterized by a neural network and separate for each modality. In addition, we introduce a global projection head . Both the local and global projection heads are shown to improve the training performance in DIM DIM and in SimCLR simclr , respectively.
The first objective (CR), in the top left corner of Figure 2
, trains two independent encoders, one for each modality, with a unimodal loss function that maximizes the mutual information between local
and global representations. This objective directly implements the Deep InfoMax (DIM) DIM objective. The idea behind this approach is to maximize the information between the lowest and the highest scales of the encoder. In other words, the local representations are driven to be predictive of the global representation. The objective for an arbitrary layer is defined as:(3) 
, where we only define the following objective for a modality . The objective has to be computed for each modality.
The CR objective can be extended to the multimodal case by measuring the intermodal mutual information between local and global representations of modality and , respectively. We call this multimodal objective Cross ConvolutiontoRepresentation (XX), and it is shown second from the top left in Figure 2. This objective has previously been used in the context of augmented multiscale DIM (AMDIM) amdim , crossmodal DIM (CMDIM) sylvain2020zeroshot , and spatiotemporal DIM (STDIM) anand2019unsupervised . We define it as
(4) 
, where the objective is defined for a pair of modalities and . The objective has to be computed for all possible pairs of modalities. In case of symmetric coordinated fusion, the symmetry has to be preserved for modalities and by computing both and , whereas for asymmetric fusion, this is not the case.
The third elementary objective measures mutual information between the global representation of one modality and the global representation of another modality , . This objective is called RepresentationtoRepresentation (RR) and is the third in the top row of Figure 2. This interaction has been used in many prior contrastive multiview work cmc ; simclr ; moco ; SWAV and DCCA) dcca . The RR objective is defined as
(5) 
, where we the objective is defined for a pair of modalities and .
The fourth elementary objective is similar to RR, but only maximizes the mutual information between the two intermodal local representations and , where and are arbitrary locations in layers within modality and , respectively. This objective is called ConvolutiontoConvolution (CC) and is shown as the fourth from the top left in Figure 2. The CC objective has been used in AMDIM amdim , CMDIM sylvain2020zeroshot , and STDIM anand2019unsupervised . Due to a large number of possible pairs of locations between the activation maps in each encoder, we reduce the computational costs by sampling arbitrary locations, which was proposed in AMDIM amdim . Thus, after sampling an arbitrary location from the convolutional activation map for one modality, we compute the objective in a similar way to XX, by treating sampled locations as the global representation.
(6) 
, where is a sampled location from locations. The objective is defined for a pair of modalities and .
By combining these four primary objectives, we can construct more complicated objectives, as shown in Figure 2. For example, the XXCC objective for two modalities (as and ) can be written as
(7) 
The goal would be to find parameters that maximize . The objective is repeated with the modalities flipped to preserve the symmetry of XX and CC. Removing the symmetry is intuitively similar to guiding the representations of one modality by the representations of another modality, which may be interesting for future work on asymmetric fusion. The XXCC objective coordinates representations locally with the CC objective on convolutional activation maps and coordinates representations across scales in the encoder with XX. The local representations of one modality should be predictive of the global and local representations of the other modality.
2.1.4 Baselines and other objectives
We compare our method to an autoencoder (AE), a deep canonical correlation autoencoder (DCCAE) DCCAE , and a supervised model. Each type of model is a highperforming version of the three main categories of alternative approaches to our framework. The AE and supervised models are trained separately for each modality, while the DCCAE is trained jointly on all modalities. By Supervised, we refer to a unimodal model that is trained to predict a target using crossentropy loss.
In addition to defining a unified framework that covers multiple existing approaches, our taxonomy contains a novel unpublished approach that combines different combinations of the four objectives. One novel approach combines the CR objective with the objective of the DCCAE, which we call CRCCA. The CR objective allows us to train using modalityspecific information, and the CCA objective aligns the representations between modalities. This leads to the following objective:
(8) 
A second novel approach combines the AE objective with our RR objective to create the RRAE objective. The AE objective ensures the learning of modalityspecific representations, and the RR objective enforces the alignment of representations across modalities, similar to the CCA objective in the DCCAE. The final objective of the RRAE is as follows.
(9) 
where is the mean squared reconstruction error for the AE with an additional decoder , and modality :
(10) 
The baseline schemes are shown in Figure 3.
2.2 Analysis of the model representation
It is important to validate the representations that the model learns, which in the proposed framework is done in three steps. The first step evaluates the representations using classification tasks with logistic regression. This step uses the features that are extracted by the encoder from the input. The goal of this evaluation is to ensure the discriminative power of the pretrained features. In the second step, we compute the similarity between representations to measure how much joint information has been captured by the model. The third and last step consists of two analyses to explore the relationship between the latent space and the brain space to assess voxelwise group differences based on saliency gradients for each of the
dimensions of the representation.2.2.1 Classification evaluation
To evaluate the discriminative performance of the representations captured by the model, we train logistic regression on frozen representations from the last layer of the encoder. Note that most selfsupervised learning algorithms evaluate the discriminative power of representations with a linear evaluation protocol based on linear probes linear_probes . We chose to use logistic regression, however, due to faster training times.
2.2.2 Alignment analysis
To evaluate the alignment between representations of different modalities, we use central kernel alignment (CKA) CKA . CKA is shown to be effective CKA as a method to identify the correspondence between representations of networks with different initializations, compared to CCAbased similarity measures SVCCA ; PWCCA . CKA is considered to be a normalized version of the HilbertSchmidt Independence Criterion (HSIC) gretton2005measuring . The CKA measure for a pair of modalities and is defined as:
(11) 
where is the dimension of the latent representation, is a matrix of dimensional representation for samplesm, is the Frobenius norm.
Our results are only evaluated using CKA since we find it fedorov2021self to be the most robust to noise, which reinforces findings in previous literature CKA ; CKAstructure that suggest the same.
2.2.3 Saliency explanation of the representation in brain space
To explain the representations in brain space, we adapt the integrated gradients algorithm sundararajan2017axiomatic . We want to understand the representations rather than the saliency of a specific label. Hence we propose a simple adaptation. Instead of using a target variable, we compute gradients with respect to each dimension of the representation. This is done by setting the specific dimension in the vector to 1 and all other dimensions to 0.
3 Experiments and results
In this section, we present our findings on two classification tasks with the OASIS3 dataset, investigating the relative performance of selfsupervised and fully supervised approaches. In addition, to understand the inductive biases of different multimodal objectives, we calculate CKA to measure joint information captured in representations between modalities. Secondly, we compare group differences between the supervised and best selfsupervised model from the first stage. Lastly, we explore multimodal links between T1 and fALLF.
The overall scheme of our experimental setup is shown in Figure 4 which consists of initial pretraining and further evaluation on the classification task, alignment of the representation, and analysis of representation through saliency in brain space.
3.1 Dataset
Here we validate our method on OASIS3 OASIS3 , which is a multimodal neuroimaging dataset including multiple Alzheimer’s disease phenotypes.
Each subject in this dataset is represented by a T1 volume, and a fractional amplitude of lowfrequency fluctuation (fALFF) zou2008improved volume, which is generated from T1w and restingstate fMRI (rsfMRI) images. The purpose of the T1 volume is to account for the anatomy of the brain, and the fLAFF volume captures the restingstate dynamics. Both T1 volumes and fALFF volumes have previously been shown to be informative for studying not only Alzheimer’s disease he2007regional but also other cases (e.g., chronic smokers wang2017altered ).
The T1w images were brainmasked with BET in FSL fsl (v 6.0.20), linearly transformed to MNI space, and subsampled to 3mm after preprocessing. 15 T1w images were discarded because these images did not pass the initial visual quality check. The rsfMRI was registered using MCFLIRT in FSL fsl
(v 6.0.20) to the first image. The specific parameters for MCFLIRT are: a 3stage search level (8mm, 4mm, 4mm); 20mm fieldofview, 256 histogram bins (for matching); 6 degreesoffreedom (DOF) for transformation; a scaling factor of 6mm; the normalized correlation values across the volumes as the cost function (smoothed to 1mm), and interpolation was computed using splines. The fALFF maps we computed using REST
rest are within the 0.01 to 0.1 Hz power band. The final volume size for both modalities is . Although it is not required to register the volumes in MNI space, we perform this registration to simplify the analysis and interpretability of our method. Otherwise, we tried to minimize the preprocessing of the data to retain as much information in the original data that the neural network can then learn as possible. In addition, the subsampling to mm has been done to reduce computational needs, while applications on mm we considered as future work.NonHispanic Caucasian subjects are the largest cohort () in the dataset. Thus, we selected ( HC, AD, unlabeled) nonHispanic Caucasian subjects as the main set for the pretraining. OASIS3 OASIS3 contains a large number of subjects that are neither classified as AD nor can readily be called controls. These subjects belong to one of 21 diagnostic categories, including some forms of cognitive impairment, frontotemporal dementia (FTD), Diffuse Lewy body disease (DBLD), and vascular dementia from preclinical cohort and followed longitudinal progression. We have combined all such subjects into a separate third class.
After matching up the scans of each modality that are closest in date out of all available scans for a subject, the final dataset contains pairs. The pairs are split into stratified folds ( subjects ( pairs), ()), and holdout — (). The number of pairs is greater than the number of subjects because some subjects have multiple scans. Thus, we utilized more pairs during pretraining but used only one pair of images for each subject in the final evaluation. For the 2way classification, we do not use unlabeled data, while for 3way, we use unlabeled data as a "noisy" phenotypic thirdclass.
Before feeding the images into the neural network, the intensities of the T1 and fALFF were normalized using minmax rescaling to the unit interval ([0, 1]). We augment the dataset with random crops of size 64 after reflective padding with size eight from all sides during pretraining. The decision for the preprocessing and augmentation was based on evaluations of the supervised baseline. We also considered histogram standardization, znormalization, random flips, and balanced data sampler
balanced_data_sampler . However, the results were not substantially different. Thus to reduce the computational cost, we use on simple minmax rescaling and random crops.3.2 Learning and evaluating representation
3.2.1 Pretraining
To train our model that is schematically shown in Figure 4, we have to choose an architecture for an encoder and global and local projection heads. The local projection head is needed to project local representation to a dimensional channel space to ensure that the critic scores between the global and local representations will be computed in the same space. The global projection head is needed due optimization process. As it has been shown by authors of SimCLR simclr the last projection to the representation can develop a lower rank condition which is beneficial to the optimization of the objective but can be destructive to the representation.
For our encoder, we choose the architecture from the
deep convolutional generative adversarial networks
(DCGAN) dcgan . This architecture provides a simple, fully convolutional structure and has a specialized decoder which is important for the performance of generative approaches. We used volumetric convolutional layers for the experiments with neuroimaging OASIS3 dataset. Most of the hyperparameters we left as in the original work dcgan. We swapped the last tanh activation functions in the decoder with a sigmoid because the intensities of the input images are scaled into the unit interval. The last layer projects activations from the previous layer to the final
dimensional representation vector that we named global. All convolutional layers are initialized with Xavier uniform xavier_uniform and gain related to the activation function paszke2019pytorch . Each modality has its encoder with DCGAN architecture.For a local projection head we choose architecture that similar to AMDIM amdim . The projection head represents one ResNet block from a third layer of DCGAN architecture with feature size . One direction in the block consists of convolutional layers (kernel size , number of output and hidden channels , Xavier uniform initialization xavier_uniform ). The second direction consists of one convolutional layer (kernel size , number of output channels , initialization as identity). The projection heads are individual for each modality, and we added it only if the model has CC objective.
For a global projection head we follow SimCLR simclr . We perform a hyperparameter search for the number of hidden layers in the projection head for each model that can use the projection head (except Supervised, AE, CC). We have considered cases: without a projection head, with a linear projection head, and a projection head with 1, 2, or 3 hidden layers. The number of output dimensions in the projection layers equals .
In addition, following AMDIM amdim we add regularization to InfoNCE objective by penalizing the squared scores computed by the critic function as with , and cliping the scores by with .
We perform the training of the models on OASIS3 dataset with RAdam liu2019variance optimizer with learning rate (). The pretraining step in our framework has been performed for epochs. For each trained model, we saved checkpoints based on the best validation loss.
3.2.2 Evaluation
After the pretraining step, to evaluate the discriminative performance of representations learned by the various objectives in our taxonomy, we perform two classification tasks. The first task is a binary classification of Alzheimer’s Disease (AD) vs. Healthy Cohort (HC). The second task is a ternary classification with an additional phenotypical class. The first task is easier than the second one because the latter has an added class.
The logistic regression is used to evaluate the discriminative performance of the learned representation of the data. The logistic regression is trained on global representation after extracting it with a pretrained encoder. We use logistic regression (from scikitlearn scikitlearn ) to perform classification tasks. The hyperparameters of the logistic regression were optimized with Optuna optuna_2019 for iterations. The selections of the hyperparameters are performed based on the validation dataset. The search space for hyperparameters is defined as follows: inverse regularization strength is sampled loguniformly with interval , for the elastic net penalty, the mixing parameter is uniformly sampled from unit interval . The logistic regression is trained using SAGA solver defazio2014saga . We use a ROC AUC and onevsone (OVO) ROC AUC Macro hand2001simple as a scoring function for a hyperparameter search for binary and ternary classification, respectively. The OVO strategy for ROC AUC metrics in multiclass classification is computed as the average AUC of all possible pairwise combinations of classes. In addition, it is insensitive to class imbalance for macro averaging scikitlearn . Classification is performed separately for each modality by training logistic regression on the representations extracted from that modality using the corresponding convolutional encoder.
After running logistic regression for each of the ten checkpoints, we select the checkpoint with the maximum ROC AUC metric for a binary case and OVO ROC AUC Macro metric for ternary on the validation set. After choosing the checkpoint, we choose a projection head based on crossvalidation for some models that need the projection head during pretraining. Since multimodal models are paired, we select the checkpoint based on average performance on both modalities. While for unimodal models, we pair checkpoints based on the number of epochs, so the model trained longer together are paired together.
Lastly, the alignment score CKA is computed to measure the joint information content of the representation between modalities as a measure of the inductive bias of the training objective. The alignment score CKA is also computed on global representation .
3.2.3 Results
The classification results on a holdout test dataset are shown for both tasks in Figure 5. The performance is reported with a median and interquartile range (IQR) of the ROC AUC and oneversusone (OVO) ROC AUC Macro (average) hand2001simple metrics for binary and ternary classification tasks, respectively. Additionally, ,we report CKA as the measure of joint information between representations of different modalities for each model.
Overall, the Supervised model outperforms selfsupervised models on T1 for both tasks: 86.2 (86.186.8) for 2way classification and 72.9 (72.077.0) for 3way classification. However, the performance gap on T1 is small as the selfsupervised multimodal model RRXXCC achieves 84.0 (83.084.9) for binary and 70.9 (67.972.1) for ternary classifications.
Most selfsupervised models can achieve better classification performance on "noisier" fALFF than Supervised model in the 2way classification task. For 3way classification, the gap is reduced while the RRXX model achieves 63.2 (59.167.8) versus 62.4 (58.7 70.1) for Supervised. This supports the benefits of multimodal learning, which could be seen as a regularization effect.
The unimodal autoencoder (AE) model can perform well on the simple binary classification task. However, the performance significantly drops on the harder ternary classification task. Unimodal models such as CR and AE are outperformed by most of the multimodal models. Evidently, multimodal extension of AE with CCA as DCCAE can improve performance. However, DCCAE can be outperformed by most of selfsupervised decoderfree models on 2way classification task and XXCC, RRXX, RRXXCC, XX on 3way classification task. Therefore we could achieve more robust performance with the proposed models while reducing the computational cost of the decoder for each modality.
Overall, the proposed selfsupervised models such as XX, XXCC, RRXX and RRXXCC perform robustly on both tasks and retain their ranking relative to the other models. Additionally, judging by the higher CKA alignment measure (0.630.73), these models capture joint information between modalities. While there are other models—RRAE, RRCC and RR—that achieve higher CKA alignment yet are not as robust. We hypothesize that the joint information alone is not the answer to the problem, but the architecture of the model is important. Note that the XX, XXCC, RRXX and RRXXCC models capture localglobal relationship between modalities. While the RRAE, RRCC and RR models only capture joint information on globalglobal or locallocal representation level. Thus, given the empirical evidence in Figure 5, the localglobal relationship XX is an essential building block for multimodal data because it allows us to capture complex multiscale relationships between modalities.
3.3 Interpretability
3.3.1 Explaining group differences between HC and AD
In this subsection, we explain the performance of the models by analyzing the saliency maps. As a point of interest and comparison, we select Supervised and RRXX models. The Supervisedmodel performs the best with T1 input volumes and utilizes target labels. Thus it is a solid baseline to analyze group differences. The RRXX model performs the best in the ternary classification task and precisely does well for fALFF input volumes. We use these two models to generate saliency maps and interpret what the models have learned.
For each selected model, we compute integrated gradients sundararajan2017axiomatic along each dimension in the 64dimensional representation and discard the negative gradients. After computing saliency gradients for each dimension of the latent representation, we apply brain masking, rescale gradient values to a unit interval and smooth them with a Gaussian filter (). Then we perform a voxelwise twosided test with MannWhitney UTest and compute Rank Bisseral Correlation (RBC) as an effect size. After selecting the voxels with a , using 3dClusterize from AFNI cox1996afni , we find clusters with at least 200 voxels. Then we apply whereami from AFNI cox1996afni to match those clusters with the ROIs defined in the template that is used in the Neuromark pipeline du2020neuromark . We call this template the Neuromark atlas in the rest of the text. To create the Neuromark atlas, the spatial ICA components of the Neuromark template have been combined to create an atlas by simple overlapping. Then atlas has been added to AFNI cox1996afni environment.
We select only the top brain ROI for each cluster based on the overlap and include only the ROIs that have been found in at least two folds. The overlap in the clustered saliencies with ROIs is measured using DICE. The final results are summarized for all dimensions in Figure 6 where we report the maximum DICE overlap for both models and both modalities.
The results in Figure 6 suggest that higher discriminative performance is related to a sparse choice of the ROIs. Specifically, the Supervised model seems to be sparser than the RRXX model on T1 volumes because its binary ROC AUC performance is , compared to . The RRXX, however, seems sparser than the Supervised for fALFF data because its 2way ROC AUC performance is compared to .
Another results in Figure 6 that the Supervised model has higher DICE and stronger contrast compared to the RRXX. It suggests stronger localization of the saliency maps for Supervised model that can be explained by the use of the labels to learn representation. As a selfsupervised model learns without labels, a lower contrast can support the idea that selfsupervised learning learns less taskspecific general representation.
In addition, in Figure 6, the models tend to capture information from the frontal lobe regions and less of the posterior part of the brain in their representations on fALFF.
3.3.2 Detailed analysis of group differences with best selfsupervised and supervised models
In this analysis, we compare saliency maps for group differences for Supervised and RRXX models specifically for most discriminative dimensions in the representation vector. The discriminative dimensions are the dimensions in with the highest and lowest beta scores in the trained logistic regression. After selecting the dimensions, we compute saliencies and perform a voxelwise test as in the previous subsection to get RBC maps with significant voxels.
In Figure 7 we show RBC maps for Supervisedmodel, and in Figure 8 we show RBC maps for RRXX for only the first fold on the holdout test set.
The Supervised model has bigger clusters on T1, while the selfsupervised model RRXX has more local and smaller clusters. Given that Supervised has better performance on T1 than RRXX, these RBC maps might explain the performance gap in 2way classification. However, given the reduced gap in 3way classification, it might also indicate that the Supervised model might overfit the task and uses more regions than is needed.
As we can see in Figure 8, the selfsupervised model RRXX able to pick on T1 the discriminative regions that supported by the literature: such as hippocampus yang2022human , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study . The regions that are found for Supervisedare also supported by the literature: such as precuneus guennewig2021defining and anterior cingulate cortex yu2021human .
On fALFF, Supervised and RRXX models share similar behavior, including more local and smaller clusters. The supervised model shows precuneus, consistent with prior work focused on fALFF wang2021comparative . The RRXX model shows hippocampus 10.3389/fnagi.2018.00037 ; zhang2021regional , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction and superior medial frontal gyrus cheung2021diagnostic .
3.3.3 Exploring multimodal links
This section explores multimodal links between the T1 and fALFF modalities. To perform this analysis, we compute an asymmetric correlation matrix between all pairs of dimensions in dimensional global representation of the T1 and fALFF. Then we select one ROI in the Neuromark atlas and find a dimension in the representation vector with a cluster with the highest DICE overlap with this ROI in RBC maps. After finding this dimension, we find a second dimension from another modality with the highest positive and negative correlation from the correlation matrix. Then we connect the first ROI with each ROI captured by a second dimension with an edge with correlation values as a weight. We repeat the same procedure for each of the 53 ROIs in the Neuromark cluster and each modality.
The final summarization of the multimodal relationships is shown in Figure 9. Note, we show the top 64 edges with maximum by absolute values weights, and, specifically, we only focus on the selfsupervised multimodal model RRXX. However, we also show the same diagram for the Supervised model in a restricted way. Because the correlation of the Supervised is much lower, and the model is learning representation unimodally, thus the relationships are more likely to be spurious. In the figure, we show ROIs for T1 on the left side with blue hues and fALFF on the right side with red hues. Additionally, we group ROIs by functional networks defined in the Neuromark atlas.
One positively correlated link (Pearson’s ) has been found by RRXX is thalamus 5 (fALFF)  precuneus 48 (T1) that has been associated with changes in consciousness cunningham2017structural . Another negatively correlated link (Pearson’s ) with precuneus 48 (T1) and hippocampus 37 (fALFF) has been associated with Alzheimer’s disease kim2013hippocampus ; ryu2010measurement . The negatively correlated link (Pearson’s ) between precuneus 48 (T1) and middle cingulate cortex 39 (fALFF) could be related with findings in these works rami2012distinct ; bailly2015precuneus .
Overall, the selfsupervised model RRXX can learn meaningful multimodal relationships that clinicians can further explore.
3.4 Hardware, reproducibility, and code
The experiments were performed using an NVIDIA V100. The code is implemented mainly using PyTorch
(paszke2019pytorch, ) and Catalyst (catalyst, ) frameworks. The code is available at github.com/Entodi/fusion for reproducibility and further exploration by the scientific community.4 Discussion
4.1 Multiscale coordinated selfsupervised models
The proposed selfsupervised multimodal multiscale coordinated models can capture useful representation and multimodal relationships in the data. Compared to existing unimodal (CR and AE) and multimodal (DCCAE) counterparts, these models achieve higher discriminative performance in ROC AUC on downstream tasks. While some of them can capture higher joint information content between modalities as measured by CKA. Furthermore, these models can produce representations that, compared to the Supervised model, show competitive performance on T1 and outperform fALLF.
We show strong empirical evidence that the XX is the most important relationship to encourage when high discriminative performance is the goal. This results is evidence of the importance and existence of multiscale localtoglobal multimodal relationships in the functional and structural neuroimaging data, which other relationships can not capture.
However, not all multimodal variants from the taxonomy of Figure 2 result in robust and useful representations. Specifically, our experiments show that the CC relationship should not be used separately from other objectives, as the CC will optimize only the layers below the chosen one because the last layer will behave as a random projection. We show it only for a complete picture of achievable classification performance with all objectives in the taxonomy.
The CCAbased objectives did not show the good results as would be expected based on the current literature. However, our taxonomy revealed that the DCCA is related to SimCLR simclr which led us to develop the RR model. While CCA maximizes correlations, SimCLR
maximized cosine similarity between representations of different modalities. However, the
SimCLR objective has one more important difference: it performs an additional discrimination step on cosine similarity scores. Thus it does two things: maximizes the similarity between modalities and simultaneously performs additional discrimination of pairs based on similarity. This task is more challenging because it needs to capture richer information to classify pairs of representations from different modalities based on similarity. In addition, the CCAobjective is prone to numerical instability thanks to its implementation in DCCAE DCCAE . RR does not have such issues. We recommend using “softer” optimization based on mutual information estimators with deep neural networks and not the “exact” solutions based on linear algebra in DCCAE DCCAE .While AE imposes an additional computational complexity due to the decoder, it has not shown benefits to the discriminative performance of the model. Specifically, AE struggles to deal with ternary classification tasks. Multimodal models from our proposed taxonomy have a reduced computational burden lacking a volumetric decoder. These findings concur with the poor performance of autoencoders on datasets with natural images DIM . We hypothesize that autoencoders, to achieve greater performance, may require encoders and decoders of high capacity. However, this will considerably increase the difficulty of training large volumetric models.
4.2 Future Work
The models we have constructed in this work do not disentangle representation into joint and unique modalityspecific representations. The analysis between CKA and downstream performance shows existence of a joint subspace between modalities, and a specific amount of joint information measured by CKA is important to learn representation valuable for downstream tasks. Future work could consider models that explicitly represent factors of the joint and unique components. Some related ideas have been explored in work on natural images when disentangling content and style von2021self ; lyu2021understanding , similarly, for neural data with variational autoencoders liu2021drop .
In our analysis, we do not consider the family of multimodal generative variational models kingma2013auto . Currently, volumetric variational models are computationally expensive, and the field is under active development given many models that have been proposed recently. Including all possible models with all possible underlying technology was not precisely our goal and would make the already extensive list of models hard to analyze. Future work may consider variational models under the same taxonomy for a fair comparison and detailed analysis of multimodal fusion applications.
There is more than can be done concerning the explainability of the models. Currently, a common choice to model neuroimaging data is to use a convolutional neural network (CNN) abrol2021deep . However, the simple application of CNNs leads to representation, where each dimension captures multiple ROIs. This effect creates difficulties in analyzing the crossmodal relationship between modalities. The multimodal links between ROIs can only be measured by the correlation between dimensions of a representation in different modalities. Thus the measured links do not represent the multimodal link between one ROI and another ROI but rather between dimensions. Future work may consider ROIbased representations.
In addition, as we want to focus on unsupervised models without using group labels, we used HC and AD groups to show group differences or identify ROIs in our analysis. However, the data may contain phenotypically small groups of patients that are not represented by HC or AD groups. It will be hard to do group analysis in such a scenario because we do not have the labels. Thus future work can consider additional clustering of the representation for finding such subgroups that explainability methods can further analyze.
5 Conclusions
In this work, we presented a novel multiscale coordinated framework for representation learning from multimodal neuroimaging data. We showed that selfsupervised approaches can learn meaningful and useful representations which capture regions of interest with group differences without accessing group labels during the pretraining stage. We developed evaluation methodologies to access the properties of representations learned by models within the family of models in downstream task analysis, measurements of joint subspace, and explainability evaluations.
We outperformed previous unsupervised models AE and DCCAE on all classification tasks and modalities. In addition, our family of models does not require a decoder that saves computational and memory requirements. In addition, we can outperform the Supervised model on fALFF. This result suggests future use of the proposed multimodal objectives for asymmetric fusion as a regularization technique. Further, our findings suggest the importance of multiscale localtoglobal multimodal relationships XX = that considerably improve the performance and multimodal alignment over previous methods and within the proposed family of models. This result suggests that there exist multiscale relationships between local structure and global summary of the inputs in different modalities that previously have been neglected in multimodal representation learning.
The RRXX model, selected based on the best classification performance and higher joint information content via CKA, was able to capture important regions of interest related to Alzheimer’s disease such as hippocampus yang2022human ; 10.3389/fnagi.2018.00037 ; zhang2021regional , thalamus elvsaashagen2021genetic , parietal lobule greene2010subregions , occipital gurys yang2019study , middle temporal gyrus hu2022brain , subthalamus hypothalamus chan2021induction , and superior medial frontal gyrus cheung2021diagnostic . Importantly, the RRXX model is able to capture multimodal links between regions that are supported by the literature such as thalamusprecuneus cunningham2017structural , precuneushippocampus kim2013hippocampus ; ryu2010measurement , and precuneus middle cingulate cortex rami2012distinct ; bailly2015precuneus .
The showcased benefits of applying a comprehensive approach, evaluating a taxonomy of methods, and performing extensive qualitative and quantitative evaluation suggest that multimodal representation learning is a field with significant potential in neuroimaging, despite being in a nascent state. Our work lays a foundation for future robust and increasingly more interpretable multimodal models.
Acknowledgments
This work was funded by the National Institutes of Health (NIH) grants R01MH118695, RF1AG063153, 2R01EB006841, RF1MH121885, and the National Science Foundation (NSF) grant 2112455.
Data were provided by OASIS3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly.
References
 (1) V. D. Calhoun, J. Sui, Multimodal fusion of brain imaging data: a key to finding the missing link (s) in complex mental illness, Biological psychiatry: cognitive neuroscience and neuroimaging 1 (3) (2016) 230–244.
 (2) S. M. Plis, M. P. Weisend, E. Damaraju, T. Eichele, A. Mayer, V. P. Clark, T. Lane, V. D. Calhoun, Effective connectivity analysis of fmri and meg data collected under identical paradigms, Computers in biology and medicine 41 (12) (2011) 1156–1165.
 (3) T. Baltrušaitis, C. Ahuja, L.P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE transactions on pattern analysis and machine intelligence 41 (2) (2018) 423–443.
 (4) P. Comon, Independent component analysis, a new concept?, Signal processing 36 (3) (1994) 287–314.
 (5) H. Hotelling, Relations between two sets of variates, in: Breakthroughs in statistics, Springer, 1992, pp. 162–190.
 (6) M. Moosmann, T. Eichele, H. Nordby, K. Hugdahl, V. D. Calhoun, Joint independent component analysis for simultaneous eeg–fmri: principle and simulation, International Journal of Psychophysiology 67 (3) (2008) 212–221.
 (7) J. Sui, G. Pearlson, A. Caprihan, T. Adali, K. A. Kiehl, J. Liu, J. Yamamoto, V. D. Calhoun, Discriminating schizophrenia and bipolar disorder by fusing fmri and dti in a multimodal cca+ joint ica model, Neuroimage 57 (3) (2011) 839–855.
 (8) J. Sui, T. Adali, G. Pearlson, H. Yang, S. R. Sponheim, T. White, V. D. Calhoun, A cca+ ica based model for multitask brain imaging data fusion and its application to schizophrenia, Neuroimage 51 (1) (2010) 123–134.
 (9) J. Liu, G. Pearlson, A. Windemuth, G. Ruano, N. I. PerroneBizzozero, V. Calhoun, Combining fmri and snp data to investigate connections between brain function and genetics using parallel ica, Human brain mapping 30 (1) (2009) 241–255.
 (10) K. Duan, V. D. Calhoun, J. Liu, R. F. Silva, anyway independent component analysis, in: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2020, pp. 1770–1774.
 (11) A. Abrol, Z. Fu, M. Salman, R. Silva, Y. Du, S. Plis, V. Calhoun, Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning, Nature communications 12 (1) (2021) 1–17.
 (12) R. Geirhos, J. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, F. Wichmann, Shortcut learning in deep neural networks, arXiv preprint arXiv:2004.07780.
 (13) D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al., A closer look at memorization in deep networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 233–242.

(14)
M. Pechenizkiy, A. Tsymbal, S. Puuronen, O. Pechenizkiy, Class noise and supervised learning in medical domains: The effect of feature extraction, in: 19th IEEE symposium on computerbased medical systems (CBMS’06), IEEE, 2006, pp. 708–713.
 (15) H. Rokham, G. Pearlson, A. Abrol, H. Falakshahi, S. Plis, V. D. Calhoun, Addressing inaccurate nosology in mental health: A multilabel data cleansing approach for detecting label noise from structural magnetic resonance imaging data in mood and psychosis disorders, Biological Psychiatry: Cognitive Neuroscience and Neuroimaging 5 (8) (2020) 819–832.
 (16) O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, A. v. d. Oord, Dataefficient image recognition with contrastive predictive coding (2020) 4182–4192.
 (17) A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised feature learning with convolutional neural networks, Advances in neural information processing systems 27.
 (18) D. Hendrycks, M. Mazeika, S. Kadavath, D. Song, Using selfsupervised learning can improve model robustness and uncertainty, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
 (19) M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, A. Joulin, Unsupervised learning of visual features by contrasting cluster assignments, in: H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 9912–9924.
 (20) N. Srivastava, R. Salakhutdinov, Learning representations for multimodal data with deep belief nets, in: International conference on machine learning workshop, Vol. 79, 2012, p. 3.
 (21) S. M. Plis, D. R. Hjelm, R. Salakhutdinov, E. A. Allen, H. J. Bockholt, J. D. Long, H. J. Johnson, J. S. Paulsen, J. A. Turner, V. D. Calhoun, Deep learning for neuroimaging: a validation study, Frontiers in neuroscience 8 (2014) 229.
 (22) N. Srivastava, R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing systems 25.

(23)
R. D. Hjelm, V. D. Calhoun, R. Salakhutdinov, E. A. Allen, T. Adali, S. M. Plis, Restricted boltzmann machines for neuroimaging: an application in identifying intrinsic networks, NeuroImage 96 (2014) 245–260.
 (24) H.I. Suk, S.W. Lee, D. Shen, Hierarchical feature representation and multimodal fusion with deep learning for ad/mci diagnosis, NeuroImage 101 (2014) 569–582. doi:https://doi.org/10.1016/j.neuroimage.2014.06.077.
 (25) G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in: International conference on machine learning, PMLR, 2013, pp. 1247–1255.
 (26) W. Wang, R. Arora, K. Livescu, J. Bilmes, On deep multiview representation learning, in: International conference on machine learning, PMLR, 2015, pp. 1083–1092.
 (27) A. Fedorov, J. Johnson, E. Damaraju, A. Ozerin, V. Calhoun, S. Plis, Endtoend learning of brain tissue segmentation from imperfect labeling, in: 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, 2017, pp. 3785–3792.
 (28) A. Fedorov, E. Damaraju, V. Calhoun, S. Plis, Almost instant brain atlas segmentation for largescale studies, arXiv preprint arXiv:1711.00457.
 (29) L. Henschel, S. Conjeti, S. Estrada, K. Diers, B. Fischl, M. Reuter, Fastsurfera fast and accurate deep learning based neuroimaging pipeline, NeuroImage 219 (2020) 117012.
 (30) R. D. Hjelm, A. Fedorov, S. LavoieMarchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio, Learning deep representations by mutual information estimation and maximization, in: International Conference on Learning Representations, 2019.
 (31) A. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748.
 (32) A. Fedorov, T. Sylvain, E. Geenjaar, M. Luck, L. Wu, T. P. DeRamus, A. Kirilin, D. Bleklov, V. D. Calhoun, S. M. Plis, Selfsupervised multimodal domino: in search of biomarkers for alzheimer’s disease, in: 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), IEEE, 2021, pp. 23–30.

(33)
I. Misra, L. v. d. Maaten, Selfsupervised learning of pretextinvariant representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.
 (34) C. Doersch, A. Gupta, A. A. Efros, Unsupervised visual representation learning by context prediction, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
 (35) S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rotations, arXiv preprint arXiv:1803.07728.
 (36) R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: European conference on computer vision, Springer, 2016, pp. 649–666.
 (37) A. Fedorov, R. D. Hjelm, A. Abrol, Z. Fu, Y. Du, S. Plis, V. D. Calhoun, Prediction of progression to alzheimer’s disease with deep infomax, in: 2019 IEEE EMBS International conference on biomedical & health informatics (BHI), IEEE, 2019, pp. 1–5.
 (38) U. Mahmood, M. M. Rahman, A. Fedorov, Z. Fu, S. Plis, Transfer learning of fmri dynamics, arXiv preprint arXiv:1911.06813.
 (39) U. Mahmood, M. M. Rahman, A. Fedorov, N. Lewis, Z. Fu, V. D. Calhoun, S. M. Plis, Whole milc: generalizing learned dynamics across tasks, datasets, and populations, in: International Conference on Medical Image Computing and ComputerAssisted Intervention, Springer, 2020, pp. 407–417.

(40)
A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner,
C. Lippert,
3d
selfsupervised methods for medical imaging, in: H. Larochelle, M. Ranzato,
R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information
Processing Systems, Vol. 33, Curran Associates, Inc., 2020, pp. 18158–18172.
URL https://proceedings.neurips.cc/paper/2020/file/d2dc6368837861b42020ee72b0896182Paper.pdf  (41) P. Bachman, R. D. Hjelm, W. Buchwalter, Learning representations by maximizing mutual information across views, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
 (42) Y. Tian, D. Krishnan, P. Isola, Contrastive multiview coding, arXiv preprint arXiv:1906.05849.
 (43) T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations, in: International conference on machine learning, PMLR, 2020, pp. 1597–1607.
 (44) A. Miech, J.B. Alayrac, L. Smaira, I. Laptev, J. Sivic, A. Zisserman, Endtoend learning of visual representations from uncurated instructional videos, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9879–9889.
 (45) J.B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, A. Zisserman, Selfsupervised multimodal versatile networks, Advances in Neural Information Processing Systems 33 (2020) 25–37.
 (46) A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: ICML, 2021.
 (47) T. Sylvain, F. Dutil, T. Berthier, L. Di Jorio, M. Luck, D. Hjelm, Y. Bengio, Cmim: Crossmodal information maximization for medical imaging, in: ICASSP 2021  2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 1190–1194. doi:10.1109/ICASSP39728.2021.9414132.
 (48) K. Saito, Y. Mukuta, Y. Ushiku, T. Harada, Demian: Deep modality invariant adversarial network, arXiv preprint arXiv:1612.07976.
 (49) Z. Feng, C. Xu, D. Tao, Selfsupervised representation learning from multidomain data, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3245–3255.
 (50) T. Sylvain, L. Petrini, D. Hjelm, Locality and compositionality in zeroshot learning, arXiv preprint arXiv:1912.12179.
 (51) T. Sylvain, L. Petrini, D. Hjelm, Zeroshot learning from scratch (zfs): leveraging local compositional representations (2020). arXiv:2010.13320.
 (52) A. Anand, E. Racah, S. Ozair, Y. Bengio, M. Côté, D. Hjelm, Unsupervised state representation learning in atari, in: NeurIPS, 2019.
 (53) P. J. LaMontagne, T. L. Benzinger, J. C. Morris, S. Keefe, R. Hornbeck, C. Xiong, E. Grant, J. Hassenstab, K. Moulder, A. G. Vlassenko, M. E. Raichle, C. Cruchaga, D. Marcus, Oasis3: Longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and alzheimer disease, medRxivarXiv:https://www.medrxiv.org/content/early/2019/12/15/2019.12.13.19014902.full.pdf, doi:10.1101/2019.12.13.19014902.
 (54) A. C. Yang, R. T. Vest, F. Kern, D. P. Lee, M. Agam, C. A. Maat, P. M. Losada, M. B. Chen, N. Schaum, N. Khoury, et al., A human brain vascular atlas reveals diverse mediators of alzheimer’s risk, Nature 603 (7903) (2022) 885–892.
 (55) T. Elvsåshagen, A. Shadrin, O. Frei, D. van der Meer, S. Bahrami, V. J. Kumar, O. Smeland, L. T. Westlye, O. A. Andreassen, T. Kaufmann, The genetic architecture of the human thalamus and its overlap with ten common brain disorders, Nature communications 12 (1) (2021) 1–9.
 (56) S. J. Greene, R. J. Killiany, A. D. N. Initiative, et al., Subregions of the inferior parietal lobule are affected in the progression to alzheimer’s disease, Neurobiology of aging 31 (8) (2010) 1304–1311.
 (57) H. Yang, H. Xu, Q. Li, Y. Jin, W. Jiang, J. Wang, Y. Wu, W. Li, C. Yang, X. Li, et al., Study of brain morphology change in alzheimer’s disease and amnestic mild cognitive impairment compared with normal controls, General psychiatry 32 (2).

(58)
X. Liu, W. Chen, Y. Tu, H. Hou, X. Huang, X. Chen, Z. Guo, G. Bai, W. Chen,
The
abnormal functional connectivity between the hypothalamus and the temporal
gyrus underlying depression in alzheimer’s disease patients, Frontiers in
Aging Neuroscience 10 (2018) 37.
doi:10.3389/fnagi.2018.00037.
URL https://www.frontiersin.org/article/10.3389/fnagi.2018.00037  (59) F. Zhang, B. Hua, M. Wang, T. Wang, Z. Ding, J.R. Ding, Regional homogeneity abnormalities of resting state brain activities in children with growth hormone deficiency, Scientific Reports 11 (1) (2021) 1–7.
 (60) Q. Hu, Y. Li, Y. Wu, X. Lin, X. Zhao, Brain network hierarchy reorganization in alzheimer’s disease: A restingstate functional magnetic resonance imaging study, Human Brain Mapping.
 (61) D. Chan, H.J. Suk, B. Jackson, N. Milman, D. Stark, S. Beach, L.H. Tsai, Induction of specific brain oscillations may restore neural circuits and be used for the treatment of alzheimer’s disease, Journal of Internal Medicine 290 (5) (2021) 993–1009.
 (62) E. Y. Cheung, Y. Shea, P. K. Chiu, J. S. Kwan, H. K. Mak, Diagnostic efficacy of voxelmirrored homotopic connectivity in vascular dementia as compared to alzheimer’s related neurodegenerative diseases—a resting state fmri study, Life 11 (10) (2021) 1108.
 (63) S. I. Cunningham, D. Tomasi, N. D. Volkow, Structural and functional connectivity of the precuneus and thalamus to the default mode network, Human Brain Mapping 38 (2) (2017) 938–956.
 (64) J. Kim, Y.H. Kim, J.H. Lee, Hippocampus–precuneus functional connectivity as an early sign of alzheimer’s disease: A preliminary study using structural and functional magnetic resonance imaging data, Brain research 1495 (2013) 18–29.
 (65) S.Y. Ryu, M. J. Kwon, S.B. Lee, D. W. Yang, T.W. Kim, I.U. Song, P. S. Yang, H. J. Kim, A. Y. Lee, Measurement of precuneal and hippocampal volumes using magnetic resonance volumetry in alzheimer’s disease, Journal of Clinical Neurology 6 (4) (2010) 196–203.
 (66) L. Rami, R. SalaLlonch, C. SoléPadullés, J. Fortea, J. Olives, A. Lladó, C. PenaGómez, M. Balasa, B. Bosch, A. Antonell, et al., Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer’s disease, Journal of Alzheimer’s disease 31 (3) (2012) 517–526.
 (67) M. Bailly, C. Destrieux, C. Hommet, K. Mondon, J.P. Cottier, E. Beaufils, E. Vierron, J. Vercouillie, M. Ibazizene, T. Voisin, et al., Precuneus and cingulate cortex atrophy and hypometabolism in patients with alzheimer’s disease and mild cognitive impairment: Mri and 18ffdg pet quantitative analysis using freesurfer, BioMed research international 2015.
 (68) A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visualsemantic embedding model, Advances in neural information processing systems 26.
 (69) W. Hu, B. Cai, A. Zhang, V. D. Calhoun, Y.P. Wang, Deep collaborative learning with application to the study of multimodal brain development, IEEE Transactions on Biomedical Engineering 66 (12) (2019) 3346–3359.
 (70) O. Ronneberger, P. Fischer, T. Brox, Unet: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computerassisted intervention, Springer, 2015, pp. 234–241.
 (71) A. Araujo, W. Norris, J. Sim, Computing receptive fields of convolutional neural networks, DistillHttps://distill.pub/2019/computingreceptivefields. doi:10.23915/distill.00021.
 (72) M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, M. Lucic, On mutual information maximization for representation learning, in: International Conference on Learning Representations, 2020.
 (73) S. Löwe, P. O’Connor, B. Veeling, Putting an end to endtoend: Gradientisolated learning of representations, Advances in neural information processing systems 32.
 (74) K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
 (75) G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes, arXiv preprint arXiv:1610.01644.
 (76) S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Similarity of neural network representations revisited, in: International Conference on Machine Learning, PMLR, 2019, pp. 3519–3529.
 (77) M. Raghu, J. Gilmer, J. Yosinski, J. SohlDickstein, Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., 2017.
 (78) A. S. Morcos, M. Raghu, S. Bengio, Insights on representational similarity in neural networks with canonical correlation, in: NeurIPS, 2018, pp. 5732–5741.
 (79) A. Gretton, O. Bousquet, A. Smola, B. Schölkopf, Measuring statistical dependence with hilbertschmidt norms, in: International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
 (80) T. Nguyen, M. Raghu, S. Kornblith, Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth, in: International Conference on Learning Representations, 2021.
 (81) M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: International conference on machine learning, PMLR, 2017, pp. 3319–3328.
 (82) A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434.
 (83) Q.H. Zou, C.Z. Zhu, Y. Yang, X.N. Zuo, X.Y. Long, Q.J. Cao, Y.F. Wang, Y.F. Zang, An improved approach to detection of amplitude of lowfrequency fluctuation (alff) for restingstate fmri: fractional alff, Journal of neuroscience methods 172 (1) (2008) 137–141.
 (84) Y. He, L. Wang, Y. Zang, L. Tian, X. Zhang, K. Li, T. Jiang, Regional coherence changes in the early stages of alzheimer’s disease: a combined structural and restingstate functional mri study, Neuroimage 35 (2) (2007) 488–500.
 (85) C. Wang, Z. Shen, P. Huang, H. Yu, W. Qian, X. Guan, Q. Gu, Y. Yang, M. Zhang, Altered spontaneous brain activity in chronic smokers revealed by fractional ramplitude of lowfrequency fluctuation analysis: a preliminary study, Scientific reports 7 (1) (2017) 1–7.
 (86) M. Jenkinson, P. Bannister, M. Brady, S. Smith, Improved optimization for the robust and accurate linear registration and motion correction of brain images, Neuroimage 17 (2) (2002) 825–841.
 (87) X.W. Song, Z.Y. Dong, X.Y. Long, S.F. Li, X.N. Zuo, C.Z. Zhu, Y. He, C.G. Yan, Y.F. Zang, Rest: a toolkit for restingstate functional magnetic resonance imaging data processing, PloS one 6 (9) (2011) e25031.
 (88) A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person reidentification, arXiv preprint arXiv:1703.07737.

(89)
X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
 (90) A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, highperformance deep learning library, Advances in Neural Information Processing Systems 32 (2019) 8026–8037.
 (91) L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning rate and beyond, arXiv preprint arXiv:1908.03265.
 (92) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikitlearn: Machine learning in python, Journal of Machine Learning Research 12 (2011) 2825–2830.
 (93) T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A nextgeneration hyperparameter optimization framework, in: ICKDM, 2019.
 (94) A. Defazio, F. Bach, S. LacosteJulien, Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives, arXiv preprint arXiv:1407.0202.
 (95) D. J. Hand, R. J. Till, A simple generalisation of the area under the roc curve for multiple class classification problems, Machine learning 45 (2) (2001) 171–186.
 (96) R. W. Cox, Afni: software for analysis and visualization of functional magnetic resonance neuroimages, Computers and Biomedical research 29 (3) (1996) 162–173.
 (97) Y. Du, Z. Fu, J. Sui, S. Gao, Y. Xing, D. Lin, M. Salman, A. Abrol, M. A. Rahaman, J. Chen, et al., Neuromark: An automated and adaptive ica based pipeline to identify reproducible fmri markers of brain disorders, NeuroImage: Clinical 28 (2020) 102375.
 (98) B. Guennewig, J. Lim, L. Marshall, A. N. McCorkindale, P. J. Paasila, E. Patrick, J. J. Kril, G. M. Halliday, A. A. Cooper, G. T. Sutherland, Defining early changes in alzheimer’s disease from rna sequencing of brain regions differentially affected by pathology, Scientific Reports 11 (1) (2021) 1–15.
 (99) M. Yu, O. Sporns, A. J. Saykin, The human connectome in alzheimer disease—relationship to biomarkers and genetics, Nature Reviews Neurology 17 (9) (2021) 545–563.
 (100) S.M. Wang, N.Y. Kim, D. W. Kang, Y. H. Um, H.R. Na, Y. S. Woo, C. U. Lee, W.M. Bahk, H. K. Lim, A comparative study on the predictive value of different restingstate functional magnetic resonance imaging parameters in preclinical alzheimer’s disease, Frontiers in psychiatry 12.
 (101) S. Kolesnikov, Accelerated deep learning r&d, https://github.com/catalystteam/catalyst (2018).
 (102) J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, F. Locatello, Selfsupervised learning with data augmentations provably isolates content from style, Advances in Neural Information Processing Systems 34.
 (103) Q. Lyu, X. Fu, W. Wang, S. Lu, Understanding latent correlationbased multiview learning and selfsupervision: An identifiability perspective, in: International Conference on Learning Representations, 2021.
 (104) R. Liu, M. Azabou, M. Dabagia, C.H. Lin, M. Gheshlaghi Azar, K. Hengen, M. Valko, E. Dyer, Drop, swap, and generate: A selfsupervised approach for generating neural activity, Advances in Neural Information Processing Systems 34.
 (105) D. P. Kingma, M. Welling, Autoencoding variational bayes, arXiv preprint arXiv:1312.6114.