Recently, there is increasing recognition that multimodal imaging data fusion can exploit the complementary information across different data, leading to better performance in terms of diagnosis and the analysis of mechanisms . Conventional multimodal fusion is often focused on matrix decomposition approaches. Among these methods, canonical correlation analysis (CCA) 
has been widely used to integrate multimodal data by detecting linear cross-data correlations. However, CCA fails when data have complex nonlinear interactions. To capture complex cross-data associations, deep neural network (DNN) based models, e.g., deep CCA, have been developed which employ deep network to extract high-level cross-data associations. These methods can lead to improved performance in terms of prediction/diagnosis [1, 23].
Beyond diagnosis, it is also important to uncover hidden disease mechanisms. This requires the data analysis model to be interpretable, i.e., with explicit and interpretable data representations. However, DNN is composed of a large number of layers and each layer consists of several nonlinear transforms/operations, e.g., nonlinear activation and convolution, resulting in difficulties in interpreting its data representations. Moreover, the captured cross-data associations are not guaranteed to be relevant to the variable of interest, e.g., disease. Instead, the associations may result from interest-irrelevant signals, e.g., noise and background. Therefore, it is not clear how to use the captured associations for disease mechanism analysis.
To address these issues, we develop an interpretable DNN based multimodal fusion model, Grad-CAM guided convolutional collaborative learning (gCAM-CCL), which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps indicating pixel-wise contributions of the inputs, enabling automated result interpretation. Moreover, the activation maps are class-specific, which can further promote class-difference analysis and biological mechanism analysis. In addition, the cross-data associations captured by gCAM-CCL are interest-related, e.g., disease-related. This is achieved by feeding the network representations to a collaborative layer which considers both cross-data interactions and the fitting to traits.
The rest of the paper is organized as follows. Section II describes the limitations of several existing multimodal fusion methods and how the proposed model addresses the limitations. Data collection and preprocessing procedures as well as experiments and results of applying gCAM-CCL to imaging genetic study can be found in Section III. A brief discussion was given in Section IV.
Ii-a Multimodal data fusion: analyzing cross-data association
Classical multimodal data fusion methods are often focused on cross-data matrix factorization. Among them, canonical correlation analysis (CCA)  has been widely used in multi-view/omics studies [9, 10]. CCA aims to find the most correlated variable pairs, i.e., canonical variables, and further association analysis can be performed accordingly.
Specifically, given two data matrices ( represents sample/subject size, and
represents the feature/variable sizes in two data sets), CCA seeks two optimal loading vectorsand which maximize the Pearson correlation , as in Eq. (1).
Solving optimization Eq. (1) will yield the most correlated canonical variable pair, i.e., and . More correlated canonical variable pairs (with lower correlations) can be obtained subsequently by solving the extended optimization problem, as formulated in Eq. (2).
where min(rank(), rank()).
CCA captures only linear associations and therefore it requires that different data/views follow the same distribution. However, different modality data, e.g., fMRI imaging and genetic data, may follow different distributions and have different data structures. As a result, CCA fails to detect the association between heterogeneous data-sets. To address this problem, Deep CCA (DCCA) was proposed by Andrew et al.  to detect more complicated correlations. DCCA introduces a deep network representation before applying CCA framework. Unlike linear CCA, which seeks the optimal canonical vectors , DCCA seeks the optimal network representation , as shown in Eq. (3).
where are two deep networks.
The introduction of deep network representation leads to a more flexible ability to detect both linear and nonlinear correlations. According to experiments on both speech data and handwritten digits data , DCCA’s representation was more effective in finding correlations compared to other methods, e.g., linear CCA, and kernel CCA. Despite DCCA’s superior performance, the detected associations are not guaranteed to be relevant to any phenotype of interest, e.g., disease. Instead, the detected associations, may be caused by irrelevant signals, e.g., background and noise. As a result, the use of detected associations is challenging for further disease mechanism analysis.
Ii-B Deep collaborative learning (DCL): phenotype-related cross-data association
To address the limitations of DCCA, we proposed a multimodal fusion model, deep collaborative learning (DCL) , which can capture phenotype-related cross-data associations by enforcing additional fitting to phenotype label, as formulated in Eq. (4).
where subject to ; ; , represent two deep networks; represents phenotype or label data.
As shown in Eq. (4), DCL seeks the optimal network representation to maximize cross-data correlations. Compared to DCCA, DCL’s representation retains label related information which guarantees label/phenotype related associations. In this way, further analysis of disease mechanisms can be performed and better classification performance can be achieved, according to the work described in . Moreover, DCL relaxes the requirement that projections and have to be in the same direction. This leads to a better representation of both phenotypical information and cross-data correlation in a more effective manner.
With the ability to capture both cross-data associations and trait-related signals, DCL can exploit complementary information from multimodal data, as demonstrated in a brain imaging study . However, DCL uses deep networks to extract high-level features, which are difficult to interpret, result in obstacles for identifying significant features/biomarkers. As a result, DCL can only be used for classification/diagnosis rather than exploring disease mechanisms, and consequently the medical impact of its applications is limited.
Ii-C Deep Network Interpretation: CAM based methods
Both DCCA and DCL use deep neural networks (DNN) for feature extraction. DNN employs a sequence of intermediate layers to extract high-level features. Each layer is composed of a number of complex operations, e.g., nonlinear activation, kernel convolution, batch normalization. DNN based models have found numerous successful applications in both computer vision and medical imaging fields, as a result of their superior ability to extract high-level features. However, the large number of layers and the complex/nonlinear operations in each layer bring about a difficulty in network explanation and feature identification. As a result, users may cast doubt on the reliability of the deep networks: whether deep networks make decisions based on the object of interest, or based on irrelevant/background information.
Ii-C1 Class Activation Mapping (CAM)
To make DNN explainable, Class Activation Mapping (CAM) method 
was proposed. CAM generates an activation map for each sample/image indicating pixel-wise contributions to the decision of interest, e.g., class label. Moreover, as its name tells, CAM’s activation maps are class-specific, providing more discriminative information for further class-specific analysis. This dramatically helps build trust in deep networks: for correctly classified images/samples, CAM explains how the classification decision is made by highlighting the object of interest; for incorrectly classified images/samples, CAM illustrates why incorrect decisions are made by highlighting the misleading regions.
CAM’s activation maps are obtained by computing an optimal combination of intermediate feature maps. As feature maps only exist in convolutional layers, CAM can be applied only to convolutional neural networks (CNN). A weight coefficient is needed for each feature map to evaluate its importance to the decision of interest. However, for most CNN based models, this weight is not provided. To solve this problem, a re-training procedure is introduced, in which the feature maps are used directly by a newly introduced layer to re-conduct classification. The weights can then be calculated using the parameters in the introduced layer accordingly. The detailed CAM method is described as follows.
For a pre-trained CNN-based model, assume that a target feature map layer consists of channels/feature-maps , where represent the height and width of each feature map, respectively. CAM discards all the subsequent layers and then introduces a new layer (with softmax activation) to re-conduct classification using these feature maps . A prediction score will then be calculated by the newly introduced layer for each class , as formulated in Eq. (5).
where represents the weight coefficient of feature map for class .
After that, class-specific activation maps can be generated by first combining the feature maps using the trained weights and then conducting upsampling to project it onto input images, as in Eq. (6).
The re-training procedure, however, is time consuming, which limits CAM’s application. Moreover, classification accuracy will sacrifice due to the modification of the model’s architecture, and consequently the accuracy of activation maps will decrease.
Ii-C2 Gradient-weighted CAM (Grad-CAM)
To address the limitations of the CAM method, Gradient-weighted CAM (Grad-CAM), was proposed  to compute activation maps without modifying the model’s architecture. Similar to CAM, Grad-CAM also needs a set of weight coefficients so as to combine feature maps. This can be achieved by first calculating the gradients of decision of interest w.r.t each feature maps and then performing global average pooling on the gradients to get scalar weights. In this way, Grad-CAM avoids adding extra layers and consequently both model-retraining and performance-decrease problems can be solved. The formulations of how Grad-CAM calculates weights and activation map are as follows.
where represents the prediction score for class , and
Ii-C3 Guided Grad-CAM: high resolution class-specific activation maps
Both CAM’s and Grad-CAM’s activation maps are coarse due to the upsampling procedure, as feature maps normally are of smaller size compared to input images. This brings about difficulties in identifying small but important object-features. Fine-grained visualization methods, e.g., guided backpropagation (BP) and deconvolution , can generate high resolution activation maps. These methods use backward projections which operate on layer-to-layer gradients. Upsampling procedure is not involved in these back projection methods, and therefore high resolution activation maps can be obtained. Nevertheless, the activation maps are not class-specific, bringing about obstacles in interpreting the activation maps, especially for multi-class (more than 2) scenarios. To obtain both high resolution and class-specific activation maps, guided Grad-CAM was proposed in the work  which incorporated guided BP into Grad-CAM. Guided Grad-CAM computes activation maps by performing a Hadamard product between the Grad-CAM map and the Guided BP map, as formulated in Eq. (9).
where represents the map computed using guided BP algorithm , and represents the Hadamard product operation. For example, given two arbitrary matrices , their Hadamard product is defined as .
Ii-D Grad-CAM guided convolutional collaborative learning (gCAM-CCL)
For the purpose of interpretable multimodal fusion, we develop a new model, Grad-CAM guided convolutional collaborative learning (gCAM-CCL), which incorporates both guided BP and Grad-CAM methods into the DCL model. As shown in Fig. (1), gCAM-CCL first integrates two modality data using the collaborative networks, and then computes class-specific activation maps using Guided BP and Grad-CAM. In this way, gCAM-CCL can perform both automated classification/diagnosis and automated biomarker-identification as well as result interpretation simultaneously.
To be more specific, gCAM-CCL uses a 1D ConvNet to learn features from SNP data and uses a 2D ConvNet to learn features from brain imaging data. The output of two ConvNets are flattened and then fused in the collaborative layer with the loss function in Eq. (11), which considers both cross-data associations and their fittings to phenotype/label . After that, two intermediate layers will be selected, from which the feature maps will be combined using the gradient-based weights (Eqs. (7)-(8)) and class-specific Grad-CAM activation maps will be generated accordingly. Meanwhile, fine-grained activation maps are computed by projecting the gradients back from the collaborative layer to the input layer using Guided BP. The obtained activation maps indicate pixel-wise contributions to the decision of interest, e.g., prediction, and significant biomarkers, e.g., brain FCs and genes, can be identified accordingly.
Compared to the DCL model 
, gCAM-CCL employs both new architecture and new loss function so as to incorporate Grad-CAM. As computing activation maps needs a layer of feature-maps, gCAM-CCL replaces a multilayer perceptron (MLP) network with two ConvNets so that multi-channel feature maps can be obtained. This also benefits model-training as ConvNet dramatically reduces the the number of parameters by enforcing shared kernel weights.
Moreover, to compute class-specific activation maps, gradients w.r.t. each class are needed, as illustrated in Eq. (7
).. However, DCL uses external classifiers, e.g., support vector machine (SVM), and therefore no class information is provided in DCL’s gradients. To solve this problem, gCAM-CCL replaces external classifiers with an embedded softmax classifier so that class-specific gradientscan be obtained.
Furthermore, ideal class-specific activation maps should highlight only the features relevant to the corresponding class, e.g., ’dog’ class. However, features related to other classes, e.g., fish-related features, may have strong but negative contributions to predicting ’dog’ class, resulting in noise features in the activation maps. To remove the noise features, we apply a ReLU function to the gradients, as shown in Eq. (10). The ReLU function ensures positive effects so that pixels with negative contributions can be filtered out.
where represents the prediction score for class .
In addition, as pointed out in Wang’s work , both DCCA and DCL  include the parameter of sample size into their loss functions, resulting in a problem in batch size tuning. In other words, their loss functions are dependent on batch size due to a population-level correlation term . As a result, a large batch size is required , leading to a challenge for batch size tuning and network training. In this work, we propose a new loss function which resolves the batch-size dependence, as formulated in Eq. (11). As shown in Eq. (11), the population-level correlation term is replaced with a summation of sample-level loss. Moreover, the correlation term is replaced with a regression loss, i.e., cross-entropy loss, as it has been shown that the optimization of correlation term is equivalent to the optimization of regression loss .
where are the outputs of two ConvNets, as illustrated in Fig. (1).
This batch-independent loss function is easier to extend to multi-class multi-view scenarios and the extended loss function is formulated as follows.
where represents the number of views, and represents the number of classes.
Iii Application to brain imaging genetic study
We apply the gCAM-CCL model to an imaging genetic study, in which brain FC data is integrated with single nucleotide polymorphism (SNPs) data to classify low/high cognitive groups. Multiple brain regions of interests (ROIs) function as a group when performing a specific task, e.g., reading. Brain FC depicts the functional associations between different brain ROIs . On the other hand, genetic factors may also have influences on brain functions, as brain dysfunctionality is genetically inheritable. Imaging-genetic integration enables exploring brain function from a more comprehensive view, which may further contribute to the study of normal and pathological brain mechanisms. The proposed gCAM-CCL model, which can perform automated diagnosis and feature interpretation, can be used to extract and analyze the complex interactions both within and between brain FC data and genetic data.
Iii-a Brain imaging data
Several brain fMRI modalities from the Philadelphia Neurodevelopmental Cohort (PNC)  were used in the experiments. PNC cohort is a large-scale collaborative study between the Brain Behavior Laboratory at the University of Pennsylvania and the Children’s Hospital of Philadelphia. It has a collection of multiple neuroimaging data, e.g., fMRI, and genomic data, e.g., SNPs, from adolescents aged from 8 to 21 years. Three types of fMRI data are available in PNC cohort: resting-state fMRI, emotion task fMRI, and nback task fMRI (nback-fMRI). As our work was focused on analyzing cognitive ability, only nback-fMRI, which was related to working memory and lexical processing, was used in the experiments. The duration of nback-fMRI scan was 11.6 minutes (231 TR), during which subjects were asked to conduct standard nback tasks.
SPM12111http://www.fil.ion.ucl.ac.uk/spm/software/spm12/ was used to conduct motion correction, spatial normalization, and spatial smoothing. Movement artefact (head motion effect) was removed via a regression procedure using a rigid body (6 parameters: 3 translation and 3 rotation parameters) , and the functional time series were band-pass filtered using a 0.01Hz to 0.1Hz frequency range as significant signals mainly focus on low frequency. For quality control, we excluded high motion subjects with translation 2mm or with SFNR 275 (Signal-to-fluctuation-noise ratio) following the work in . 264 regions of interest (ROIs) (containing 21,384 voxels) were extracted based on the Power coordinates  with a sphere radius parameter of 5mm. For each subject, a image was then obtained based on the ROI-ROI connections, which was used next as image inputs for the gCAM-CCL model.
Iii-B SNP data
The genomic data were collected from 3 platforms, including the Illumina HumanHap 610 array, the Illumina HumanHap 500 array, and the Illumina Human Omni Express array. The three platforms generated 620k, 561k, 731k SNPs, respectively . A common set of SNPs (313k) were extracted, and then PLINK  was used to perform standard quality controls, including the Hardy-Weinberg equilibrium test for genotyping errors with p-value 1e5, extraction of common SNPs (MAF 5%), and linkage disequilibrium (LD) pruning with a threshold of 0.9. After that, SNPs with missing call rates 10% and samples with missing SNPs
5% were removed. The remaining missing values were imputed by Minimac 3 using the reference genome from 1000 Genome Project. In addition, only the SNPs within gene bodies were kept for further analysis, resulting in 98,804 SNPs in 14,131 genes.
As the study aimed to investigate the brain, we further narrowed down the scope to brain-expression-related SNPs. This was achieved using the expression quantitative trait loci (eQTL) data from Genotype-Tissue Expression (GTEx)222https://gtexportal.org/ database , a large scale consortium studying tissue-specific gene regulations and expressions. The GTEx data were collected from 53 different tissue sites from around 1000 subjects. Among the 53 tissue sites, 13 tissues were brain-related and they were listed in Table (I). A set of 108 SNP loci which showed significant tissue regulation level (eQTL -8) in all 13 brain relevant tissues were selected. In addition, SNPs in the top 100 brain-expressed genes were also selected based on the GTEx database. These procedures resulted in 750 SNP loci, which were used next as the genetic input for the gCAM-CCL model.
|Brain amygdata||Brain nucleus accumbens|
|Brain caudate||Brain cerebellar hemisphere|
|Brain cerebellum||Brain frontal cortex|
|Brain cortex||Brain substantisa nigra|
|Brain putamen||Brain anterior cingulate cortex|
|Brain spinal cord||Brain hypothalamus|
Iii-C Integrating brain imaging and genetic data: classification
The gCAM-CCL was then applied to integrate brain imaging data with SNPs data to classify subjects with low/high cognitive abilities. The wide range achievement test (WRAT) score, a measure of comprehensive cognitive ability, including reading, comprehension, math skills, etc., was used to evaluate the cognitive ability of each subject. The 854 subjects were divided into three classes: high cognitive/WRAT group (top 20% WRAT score), low cognitive/WRAT group (bottom 20% WRAT score), and middle group (the rest), following the procedures in work .
The gCAM-CCL model adopted a 1D convolutional nieural network (CNN) to learn the interactions between alleles at different single-nucleotide polymorphism (SNP) loci. ConvNet has been widely used on sequencing and gene expression data [20, 27] to learn local genetic structures. According to these studies, 1D kernels with relatively larger size are preferred. As a result, a kernel and a kernel were used. The detailed architecture of gCAM-CCL is listed in Table (IV). The partition of the data is as follows: training set (70%), validation set (15%), and test set (15%). The proposed gCAM-CCL model was trained on training set; hyper-parameters were selected based on the loss on the validation set; and the classification performance was reported based on the test set.
). Mini-batch SGD was used to solve the optimization problem. Over-fitting problem occurred due to small sample size. To solve overfitting, dropout was used and the dropout probability of the middle layers was set to be 0.2. Moreover, early stopping was used during network training to further address overfitting. In addition, batch normalization was implemented after each layer to relieve the gradient vanishing/exploding problem resulting from ReLU activation. Computational experiments were conducted on a Desktop with an Intel(R) Core(TM) i7-8700K CPU (@ 3.70GHz), a 16G RAM, and a NVIDIA GeForce GTX 1080 Ti GPU (11G).
|Methods||Epochs||batch size||Activation||Learning rate||Decay rate||dropout||Momentum|
|gCAM-CCL||500||4||ReLU, Sigmoid||0.00001||Half per 200 epochs||0.2 (middle layers)||0.9|
For the purpose of comparison, several classical classifiers, e.g., SVM, random forest (RF), decision tree, were implemented for classifying low/high WRAT groups. In addition, several deep network based classifiers were implemented, including CCL with external classifiers (SVM/RF), multilayer perceptron (MLP). The result of classifying high/low cognitive groups was shown in Table (III). From Table (III), gCAM-CCL outperforms both conventional classifiers, e.g., SVM, and regular deep network fusion method, in which two data were concatenated as the input. This is consistent with the result in the work , which also showed that the collaborative network can improve classification performance for multimodal data. Moreover, gCAM-CCL with intrinsic softmax classifiers achieved better classification performance compared with ’CCL+SVM’ and ’CCL+RF’. This may be due to the incorporation of cross-entropy loss, i.e., Eq. (11), which helps the network more efficiently learn loss-gradient during back-propagation process at each iteration.
|fMRI ConvNet||SNP ConvNet|
|Layer Name||Input Shape||Operations||Connects to||Layer Name||Input Shape||Operations||Connects to|
|f_conv1||(b, 1, 264, 264)||K, P, MP = 7, 3, 2||f_conv2||s_conv1||(b, 1, 750)||K, MP = 31, 6||s_conv2|
|f_conv2||(b, 16, 132, 132)||K, P, MP = 5, 2, 4||f_conv3||s_conv2||(b, 16, 120)||K, MP = 31, 6||s_conv3|
|f_conv3||(b, 32, 33, 33)||K, P, MP = 3, 1, 3||f_conv4||s_conv3||(b, 32, 15)||K = 15||s_flatten|
|f_conv4||(b, 32, 11, 11)||K = 11||f_flatten||s_flatten||(b, 64, 1)||-||collab_layer|
|f_flatten||(b, 64, 1)||-||collab_layer||-||-||-||-|
Notations: b (batch size), K (kernel size), P (padding), MP (maxpooling).
|SNP rs #||Gene||SNP rs #||Gene|
|SNP rs #||Gene||SNP rs #||Gene|
|Pathway Name||Pathway Source||Set size||Contained||p-value||q-value|
|Eukaryotic Translation Elongation||Reactome||106||5||1.18E-06||1.65E-05|
|Peptide chain elongation||Reactome||101||4||3.12E-05||2.18E-04|
|Calcium Regulation in the Cardiac Cell||Wikipathways||149||4||1.48E-04||6.89E-04|
|Metabolism of proteins||Reactome||2008||11||3.96E-04||1.11E-03|
|Midbrain development||Gene Ontology||94||4||1.22E-05||1.53E-03|
|Site of polarized growth||Gene Ontology||167||4||1.25E-04||2.19E-03|
|Growth cone||Gene Ontology||165||4||1.20E-04||4.07E-03|
|Cellular catabolic process||Gene Ontology||2260||12||7.22E-05||4.55E-03|
|Pathways in cancer - Homo sapiens (human)||KEGG||526||5||2.39E-03||5.58E-03|
|Metabolism of amino acids and derivatives||Reactome||342||4||3.20E-03||6.40E-03|
|Pathway Name||Pathway Source||Set size||Contained||p-value||q-value|
|Regulation of neurotransmitter levels||Gene Ontology||335||9||6.77E-10||1.04E-07|
Transmission across Chemical Synapses
|Synaptic signaling||Gene Ontology||711||10||3.17E-08||2.43E-06|
|Insulin secretion - Homo sapiens (human)||KEGG||85||5||3.26E-07||5.06E-06|
|Organelle localization by membrane tethering||Gene Ontology||170||6||1.34E-07||6.84E-06|
|Regulation of synaptic plasticity||Gene Ontology||179||6||1.89E-07||7.21E-06|
|Membrane docking||Gene Ontology||179||6||1.82E-07||1.28E-05|
|Vesicle docking involved in exocytosis||Gene Ontology||45||4||5.11E-07||1.30E-05|
|Plasma membrane bounded cell projection part||Gene Ontology||1452||12||3.05E-07||1.34E-05|
|Cell projection part||Gene Ontology||1452||12||3.05E-07||1.37E-05|
|Neurotransmitter release cycle||Reactome||51||4||1.76E-06||1.37E-05|
|Synaptic Vesicle Pathway||Wikipathways||51||4||1.76E-06||1.37E-05|
|Secretion by cell||Gene Ontology||1493||12||4.13E-07||1.45E-05|
|Adrenergic signaling in cardiomyocytes - Homo sapiens (human)||KEGG||144||5||4.47E-06||2.31E-05|
|Gastric acid secretion - Homo sapiens (human)||KEGG||75||4||8.34E-06||3.70E-05|
|Neuron part||Gene Ontology||1713||12||1.83E-06||4.09E-05|
|Plasma membrane bounded cell projection||Gene Ontology||2098||13||2.22E-06||4.87E-05|
|Inferior Parietal Lobule (Inf_Pari)||Angular Gyrus (Angular)|
|Inferior Occipital Gyrus (Inf_Occi)||Fusiform Gyrus (fusiform)|
|Inferior Frontal Gyrus (Inf_Fron)||Cingulate Gyrus (Cingu)|
|Middle Occipital Gyrus (Mid_Occi)||Sub-Gyral (SubGyral)|
|Middle Frontal Gyrus (Mid_Fron)||Paracentral Lobule (ParaCetr)|
|Parahippocampa Gyrus (Parahippo)||Postcentral Gyrus (PostCetr)|
|Middle Temporal Gyrus (Mid_Temp)||Precuneus (Precun)|
|Superior Parietal Lobule (Sup_Pari)|
Iii-D Integrating brain imaging and genetic data: result interpretation
The class-specific activation maps for low WRAT group and high WRAT group were plotted in Figs. (2)-(3), respectively. From Fig. (2), the low WRAT group shows a relatively larger number of activated FCs, which contributed to making the ’low WRAT group’ decision. In comparison, the high WRAT group (Fig. (3)) shows a relatively smaller number of significant FCs, which contributed to the ’high WRAT group’ decision. This is further validated in the average histogram of the activation maps, i.e., Fig. (4). For the low WRAT group (Fig. (4)-left), a large portion of FCs were activated (high grey-scale value), while for high WRAT group (Fig. (4)-right), only a small portion of them were activated.
To identify significant brain FCs and SNPs, pixels with gray-value maximum gray-value were selected, following the instructions in the work . After that, FCs and SNPs with occurring frequency across all subjects were further selected as significant FCs (see Figs. (6)-(5)) and SNPs (listed in Tables (V)-(VI)).
The identified brain FCs (ROI-ROI connections) and their corresponding ROIs were visualized in Fig. (5) and Fig. (6), respectively. For the high WRAT group (Fig. (5).b), three hub-ROIs (lingual gyrus, middle occipital gyrus, and inferior occipital gyrus) exhibited dominant ROI-ROI connections over the others. All of the three hub-ROIs are occipital-related. Lingual gyrus, also known as medial occipitotemporal gyrus, plays an important role in visual processing [13, 12], object recognition, and word processing . The other two hubs, i.e., middle and inferior occipital gyrus, also play a role in object recognition . As shown in Fig. (5).b, the hub-ROIs also connect to several other ROIs, e.g., cuneus, and parahippocampal gyrus. Among them, the cuneus receives visual signals and is involved in basic visual processing. The parahippocampal gyrus is related to encoding and recognition . These suggest that the three occipital gyri are first activated when processing visual and word signals during the WRAT test, and then several downstream processing ROIs, e.g., para hippocampal gyrus, are activated for further complex encoding. As a result, strong FCs in these ROI-ROI connections may lead the gCAM-CCL to select the high WRAT group.
For the low WRAT group (Fig. (5).a), there were no significant hub ROIs identified. Instead, several previously reported task-negative regions, e.g., temporal-parietal and cingulate gyrus , were identified. This indicates that the low WRAT group may be weaker in activating cognition-processing ROIs and therefore task-negative are relatively more active, which leads the gCAM-CCL to make the ’low WRAT group’ decision.
As seen in Fig. (5)a-b, a relatively larger number of FCs contributed to the low WRAT group, compared to that of the high WRAT group. Despite this, as shown in Table (III), the sensitivity, however, is lower than the specificity, which means that the accuracy of classifying low WRAT group is lower. This suggests that the identified FCs for the high WRAT group are relatively more discriminative while the low WRAT group may contain more noisy FCs.
Gene enrichment analysis is conducted on the identified SNPs (Tables (V)-(VI)) using ConsensusPathDB-human (CPDB) database333http://cpdb.molgen.mpg.de/, and the enriched pathways are listed in Tables (VII)-(VIII). Several neurotransmission related pathways, e.g., regulation of neurotransmitter levels and synaptic signaling, are enriched from the identified high WRAT group genes. This suggests that the high WRAT group may have stronger neuron signaling ability. The stronger neuron-signalling may benefit the daily training and development of ROI-ROI connections, which may further contribute to stronger cognitive ability. For the low WRAT group, several brain development and neuron growth related pathways, e.g., midbrain development and growth cone, were enriched, which suggests that the low WRAT group may highlight problems in brain/neuron development. This may further affect the ROI-ROI connections, leading to weaker cognitive ability.
In this work, we develop an interpretable deep multimodal fusion model, namely gCAM-CCL, which can perform automated classification/diagnosis and result interpretation. The gCAM-CCL model can generate activation maps which indicate pixel-wise contribution of the inputs, e.g., images and genetic vectors, by first calculating each feature map’s gradients and then merge the gradients using global average pooling to combine the feature maps. Moreover, the activation maps are class-specific, which further promotes class-difference analysis and biological mechanism analysis.
The proposed model was applied to an imaging-genetic study to classify low/high WRAT groups. Experimental results demonstrate gCAM-CCL’s superior performance in both classification and biological mechanism analysis. Based on the generated activation maps, a number of significant brain FCs and SNPs were identified. Among the significant FCs (ROI-ROI connections), three visual processing ROIs exhibited dominant ROI-ROI connections over the others. In addition, several signal encoding ROIs, e.g., the parahippocampa gyrus, showed connections to the three hub-ROIs. These suggest that during task-fMRI scans, object recognition related ROIs are first activated and then downstream ROIs get involved in further signal encoding. Results also suggest that high cognitive group may have higher neuron-transmitter signalling levels while low cognitive group may have problems in brain/neuron development, resulting from genetic-level differences. The results demonstrate that gCAM-CCL is superior in both classification and result interpretation, and therefore it can find wide applications in multimodal integration and imaging-genetic studies.
The authors would like to thank the NIH (R01 GM109068, R01 MH104680, R01 MH107354, P20 GM103472, R01 EB020407, R01 EB006841, R01 MH121101, P20 R01 GM130447) and NSF (#1539067) for the partial support.
Deep canonical correlation analysis.
International Conference on Machine Learning, pp. 1247–1255. Cited by: §I, §II-A, §II-D.
-  (2016) Time-varying brain connectivity in fmri data: whole-brain data-driven approaches for capturing and characterizing dynamic states. IEEE Signal Processing Magazine 33 (3), pp. 52–66. Cited by: §III.
-  (2016) Next-generation genotype imputation service and methods. Nature genetics 48 (10), pp. 1284. Cited by: §III-B.
-  (1995) Characterizing dynamic brain responses with fmri: a multivariate approach. Neuroimage 2 (2), pp. 166–172. Cited by: §III-A.
-  (2001) The lateral occipital complex and its role in object recognition. Vision research 41 (10-11), pp. 1409–1422. Cited by: §III-D.
-  (2011) Default-mode and task-positive network activity in major depressive disorder: implications for adaptive and maladaptive rumination. Biological psychiatry 70 (4), pp. 327–333. Cited by: §III-D.
-  (1936) Relations between two sets of variates. Biometrika 28, pp. 321–377. Cited by: §I, §II-A.
-  (2019) Deep collaborative learning with application to multimodal brain development study. IEEE Transactions on Biomedical Engineering. Cited by: §I, §II-B, §II-B, §II-B, §II-D, §II-D, §III-C, §III-C.
-  (2017) Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia. IEEE Transactions on Biomedical Engineering 65 (2), pp. 390–399. Cited by: §II-A.
-  (2014) Correspondence between fmri and snp data by group sparse canonical correlation analysis. Medical image analysis 18 (6), pp. 891–902. Cited by: §II-A.
-  (2013) The genotype-tissue expression (gtex) project. Nature genetics 45 (6), pp. 580. Cited by: §III-B.
-  (1998) ERP and fmri measures of visual spatial selective attention. Human brain mapping 6 (5-6), pp. 383–389. Cited by: §III-D.
-  (2000) Differential effects of word length and visual contrast in the fusiform and lingual gyri during. Proceedings of the Royal Society of London. Series B: Biological Sciences 267 (1455), pp. 1909–1913. Cited by: §III-D.
-  (2014) Seeing scenes: topographic visual hallucinations evoked by direct electrical stimulation of the parahippocampal place area. Journal of Neuroscience 34 (16), pp. 5399–5405. Cited by: §III-D.
-  (2011) Functional network organization of the human brain. Neuron 72 (4), pp. 665–678. Cited by: Fig. 5, §III-A.
-  (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics 81 (3), pp. 559–575. Cited by: §III-B.
-  (2014) Dynamic connectivity states estimated from resting fmri identify differences among schizophrenia, bipolar disorder, and healthy control subjects. Frontiers in human neuroscience 8, pp. 897. Cited by: §III-A.
-  (2014) Neuroimaging of the philadelphia neurodevelopmental cohort. Neuroimage 86, pp. 544–553. Cited by: §III-A, §III-B.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §II-C2, §II-C3, §III-D.
DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32 (17), pp. i639–i648. Cited by: §III-C.
-  (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §II-C3.
-  (2020) Neuroimaging-based individualized prediction of cognition and behavior for mental disorders and health: methods and promises. Biological Psychiatry. Cited by: §I.
-  (2015) On deep multi-view representation learning. In International Conference on Machine Learning, pp. 1083–1092. Cited by: §I, §II-D.
-  (2006) Wide range achievement test. Psychological Assessment Resources. Cited by: §III-C.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §II-C3.
Learning deep features for discriminative localization. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §II-C1.
-  (2018) Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics 50 (8), pp. 1171. Cited by: §III-C.