Log In Sign Up

Deep Dynamic Effective Connectivity Estimation from Multivariate Time Series

Recently, methods that represent data as a graph, such as graph neural networks (GNNs) have been successfully used to learn data representations and structures to solve classification and link prediction problems. The applications of such methods are vast and diverse, but most of the current work relies on the assumption of a static graph. This assumption does not hold for many highly dynamic systems, where the underlying connectivity structure is non-stationary and is mostly unobserved. Using a static model in these situations may result in sub-optimal performance. In contrast, modeling changes in graph structure with time can provide information about the system whose applications go beyond classification. Most work of this type does not learn effective connectivity and focuses on cross-correlation between nodes to generate undirected graphs. An undirected graph is unable to capture direction of an interaction which is vital in many fields, including neuroscience. To bridge this gap, we developed dynamic effective connectivity estimation via neural network training (DECENNT), a novel model to learn an interpretable directed and dynamic graph induced by the downstream classification/prediction task. DECENNT outperforms state-of-the-art (SOTA) methods on five different tasks and infers interpretable task-specific dynamic graphs. The dynamic graphs inferred from functional neuroimaging data align well with the existing literature and provide additional information. Additionally, the temporal attention module of DECENNT identifies time-intervals crucial for predictive downstream task from multivariate time series data.


page 1

page 6

page 7


DBGSL: Dynamic Brain Graph Structure Learning

Functional connectivity (FC) between regions of the brain is commonly es...

Brain dynamics via Cumulative Auto-Regressive Self-Attention

Multivariate dynamical processes can often be intuitively described by a...

Temporal Attribute Prediction via Joint Modeling of Multi-Relational Structure Evolution

Time series prediction is an important problem in machine learning. Prev...

Multivariate Time Series Classification with Hierarchical Variational Graph Pooling

Over the past decade, multivariate time series classification (MTSC) has...

From Static to Dynamic Node Embeddings

We introduce a general framework for leveraging graph stream data for te...

Modeling Combinatorial Evolution in Time Series Prediction

Time series modeling aims to capture the intrinsic factors underpinning ...

DBT-DMAE: An Effective Multivariate Time Series Pre-Train Model under Missing Data

Multivariate time series(MTS) is a universal data type related to many p...

I Introduction

Many classification/prediction problems can be solved by learning the underlying structure/pattern of the data and how different components are co-related with each other. Datasets from different fields are often represented as a graph. Graph networks [4700287, bruna2014spectral] are proposed to work on such datasets. Recently, methods such as, graph neural networks (GNNs) have been extensively used to learn representations on graph-structured data [Bronstein_2017, hamilton2018representation, gilmer2017neural, PARISOT2018117]. GNNs take nodes from data and update representations of nodes with the help of different aggregating functions. The aggregate functions work using a message-passing system, where a node receives messages from its neighbors, which are defined by edges. The representations can then be used for node classification, graph classification, or predicting edges between nodes by using an existing true graph structure or learning the graph [monti2016geometric, velickovic2018graph, kipf2017semisupervised, gilmer2017neural, 10.1093/bioinformatics/bty294, zhang2018link, wang2019dynamic, kipf2018neural, Zitnik_2018]. For any of the mentioned tasks, most of the existing work (classification, link prediction) has been done on static graphs, e.g., [PARISOT2018117], creates a static graph based on representation and phenotype information of subjects, [Mahmood_2021] learns a static graph between brain regions, [kipf2018neural] learns a static graph in an interacting system. In reality, many fields (social networks, brain connectivity, traffic data, speech) are dynamically changing and cannot be completely represented using a static graph. We propose that learning a dynamic graph for such systems can increase our understanding of these highly dynamic systems and may also yield higher classification performance based on the task. For example, learning the dynamic connectivity of the brain’s networks can help researchers to understand brain dynamics and the causes of brain disorders by learning how connectivity changes while performing tasks, or with age. Dynamic graphs for social network data can help to understand users’ patterns, peak traffic times, retention time, and many other vital aspects of the network.

Even though graph networks have excelled in many areas, we see a couple of shortcomings in the current work regarding graph-structured data. 1) Most of the work done, whether for graph classification (node or graph) or graph learning (link prediction), assume that the graph structure of the data is available or easily created and thus work directly on the graph-structured data [li2017gated, gilmer2017neural]. This assumption is improbable in many different datasets across many fields. 2) The semi or unsupervised methods developed to learn the graph structure (link-prediction) create embeddings/representations based on the learned graph structure, and use these embeddings to either predict future embeddings or perform classification tasks where loss is the error in prediction or classification [kipf2018neural, kazi2020differentiable, Mahmood_2021]. The problem with this approach is that the embeddings are used for prediction/classification. The learned structure is not tested directly, especially where the true graph structure is never available (e.g., brain functional connectivity) and thus is unreliable. The unreliability increases in systems with relatively easy tasks and noiseless data (real or simulated). [kipf2018neural] shows that using a full graph leads to almost the same or better performance (in terms of loss) as the learned static graph, thus questioning the correctness/importance of the learned graph structure. 3) The graph structure is assumed to be static, which is highly unlikely in many datasets such as a) social network, where a node can join/leave at any time or create/drop an edge. b) Brain functional connectivity, where the connectivity between brain regions is always dynamic. [Xu2020Inductive] shows that using a static graph learning method for a dynamic system/graph can lead to lower classification performance, [kipf2018neural] shows improved results by just dynamically re-evaluating static learned graph during testing. The improved performance for the relevant task is understandable as the dynamic connectivity provides essential information about the system.

Classification gains by using a dynamic graph depend on the downstream task; e.g., a social network’s dynamic graph may not be too helpful to predict a user’s gender but can provide additional information, which itself is extremely important for understanding the dynamic system and its working. For example, studies like [article21, article22, CALHOUN2014262] show that dynamic functional connectivity (FC) show re-occurring patterns which cannot be captured in static FC. Thus, we want to learn dynamic and directed graph structure representing the time series data and use that for interpretation, understanding and prediction tasks.

We present a novel method called - dynamic effective connectivity estimation via neural network training - (DECENNT). DECENNT is a semi-supervised method (unknown graph structure, known graph labels) which we use to 1) learn dynamic directed graph structure using embeddings for better understanding of the underlying system and 2) perform graph classification based on the learned graph structure/connectivity alone. We propose that high classification results based on the learned graph structure alone and not the representations greatly reduce the uncertainty regarding the usefulness of the learned structure and produce more useful and reliable graphs, which is the key objective of our study. These graphs can then be used for understanding the underlying system and interpreting the cause(s) of classification/prediction.

Recently, graph structure has been used to represent the brain. Brain connectivity is highly dynamic and changes with functionality being performed. Understanding the dynamic functionality would help to understand the functionality and connectivity of the brain. Thus, we apply our model to learn the brain’s dynamic effective connectivity (EC) using the functional magnetic resonance imaging (fMRI) data. fMRI is an imaging method used to capture blood-oxygenation level dependent (BOLD) signals in the brain, which measures neural activity between brain regions.

In this study, we learn dynamic EC of the brain and use that to 1) predict the presence of a disease or predict the gender of the subject and 2) learn dynamics of brain networks’ connectivity related to the downstream task. Many recent studies have been proposed to learn or compute FC of the brain [arslan2018graph, 10.3389/fnins.2020.00630, ktena2017distance, KTENA2018431, ma2019similarity] and use it to predict the gender or disease/disorder [arslan2018graph, kazi2021iagcn, 10.3389/fnins.2020.00630, KTENA2018431, ma2019similarity] using GNNs or other such methods but have limitations as discussed above. Methods that incorporate dynamic FC (learned/computed) are also mostly window-based [DAMARAJU2014298, ARMSTRONG2016175, gadgil2021spatiotemporal, 10.1007/978-3-030-59861-7_1], which partition the data into multiple windows, each consisting of data from multiple time-points. As the structure/connectivity can change at any time, we create an instantaneous dynamic structure. Existing studies use a correlation matrix to represent FC. The symmetric correlation matrix does not capture effective connectivity, and does not provide the direction of flow of information which is represented by directed edges.

We also incorporate temporal attention in our model to provide better interpretable results and further understand the working of brain functionality. We apply our model to both resting state-fMRI (rs-fMRI), where the essential time-points (putative biomarkers) are not known and to speech data. In the latter, we predict the presence of a specific target word in the speech and use the attention module to mark the time-points of the occurrence.

Contributions: Our study has the following contributions.

  1. Without the availability brain’s true E/FC structure, by using fMRI data, we learn a directed connectivity structure of the brain that provides additional details than existing literature. Thus, it removes the need to use a separate method to compute connectivity before applying classification.

  2. Based on the learned dynamic EC, our model outperforms other SOTA methods in classification tasks (disorders, gender, and keyword detection) across multiple datasets and pre-processing.

  3. Our temporal attention module finds the essential time-points for the downstream task with very high accuracy and is stable/consistent across multiple trials. It improves classification performance and finds important bio-markers related to the downstream task. This in turn can lead to reducing the temporal dimensions and discarding time-points that are unrelated to the downstream task.

Ii Decennt

We use the proposed DECENNT model to learn a dynamic directed graph structure/pattern for any multivariate time series data. The dynamic directed graph is essential to learning and understanding the system and can be used in different ways to perform classification on the downstream task. We learn a distinct dynamic graph for the complete time series where is a set of graphs with being the total time-points of the time series. We define G as:

time-points} and , where, and represent the vertices and edges present at time-point . After computing the set we use our temporal attention module to focus on the important time-points and generate a single final graph representing the complete time series. is used for downstream classification. To create the embedding for the component at time

we use a bidirectional long short-term memory (biLSTM)

[650093] which takes the time series for the component and produces for each component. To create the connectivity matrix (adjacency matrix/graph) between the components (nodes) at each time-point we use a self-attention module [10.5555/3295222.3295349]. We explain both parts separately in the following sections. Fig. 1 shows the complete architecture of the model.

Ii-a biLSTM

biLSTMs have been used very successfully for time series data. LSTMs take one input (e.g., word) from a sequence (e.g., sentence) and provide embeddings for data at each point in the sequence. The effectiveness of LSTMs comes from the memory and forget gates, which help the model to learn relationships between input at different time-points. In time series data, e.g., a sentence, each input is not independent of previous or future values. Thus it makes very crucial to find these effective relationships between the data. As the effect of input at a time, onto other inputs, is unknown and can vary across different time series and components of the same time series, it is crucial to learn these relationships based on the downstream task. The working of the LSTMs can be explained by the following equations. represents sigmoid activation, are the biases, and is the Hadamard product [Million07thehadamard].


Here represent the embedding for the input at . We use a biLSTM to create embeddings for each component . Thus , and . Here and are representation for forward and backward pass. We use LSTM for each component individually, sharing weights of LSTM among the components. We give

as input to the LSTM along with hidden vector and receive

for the component . This allows us to later compute connectivity matrix (links/edges) between the components/nodes.

Iii Self-Attention

Self-attention creates new embeddings for each depending on other embeddings in the sequence. Self-attention tries to find the relationship of each input with all other inputs denoted by weights and updates accordingly. Self-attention can be considered a special case of a typical GNN with layer/hop where a node receives input from the neighbors that are one hop away. Because of the ability of the self-attention module to create weights by learning the relationship between different embeddings, we create a connectivity matrix between components at each time-point by giving { }, = total components, as input to the self-attention module and creating new embeddings {} and the weight matrix , where each . The self-attention module creates three embeddings, namely, key (), value (), and query () and creates new embeddings for each input using these embeddings. The set of equations in (2) can sum up the whole process. For simplicity, we omit the from these equations. represents transpose and represents concatenation.


Here is the connectivity matrix between components/nodes in the graph. We use for downstream classification and not the embeddings. As the true graph is never available in many applications to directly compare with, we propose that a connectivity matrix leading to state-of-the-art classification performance makes it more reliable than using the embeddings () for classification.

Fig. 1: DECENNT architecture using biLSTM, self-attention and temporal attention.

Iv Temporal Attention

Since we get a set of the matrices , one for each time-point, an easy and standard way is to average the

matrices, however, not all time-points can be equally crucial for the downstream task. Therefore, we introduce a new temporal attention mechanism to focus on important time-points. The temporal attention module is essential for the downstream classification and for finding important timepoints/biomarkers in the data, thus making it crucial to our model. We name our attention model - global temporal attention (GTA) - that attends to crucial time-points and is stable and consistent across randomly seeded trials.

Iv-a Gta

To give the attention module a global view of the graph, we present GTA. The global view allows the model to learn how each connectivity matrix contributes to the global graph or structure of the data in the downstream task. We create a sum of all the connectivity matrices and call it representing the global view. We then compare the similarity of each local with the global view and use them to create the temporal attention vector .


Here is the Hadamard product [Million07thehadamard] between matrices, represents concatenation, and is the final weight matrix. Equation (3) shows the equations for GTA.

Name Category Preprocessing Parcellation Subjects 0 Class 1 Class CV Folds TP
FBIRN Schizophrenia SPM12 ICA 311 151 160 4, 6, 18 157
OASIS Dementia SPM12 ICA 912 651 261 4, 10 157
ABIDE Autism SPM12 ICA 569 (TR=2) 255 314 5, 10 140
ABIDE Autism SPM12 ICA 869 398 471 5, 10 140
HCP Gender SPM12 ICA 833 390 443 5, 15 980
FBIRN Schizophrenia SPM12 Shaefer 200 311 151 160 18 157
HCP Gender Glassier Shaeffer 200 942 411 531 10 1200
ABIDE Autism C-PAC Shaeffer 200 871 403 468 10 83-316
TABLE I: Details of the neuroimaging datasets used. We tried different CV folds in our experiments but that did not have a significant effect on results. We report the results with CV folds that match comparing studies.

V Experiments

This section reports the training process of our model, details about hyper-parameters, and datasets used.

V-a Training

We ran our experiments using RTX 2080 using PyTorch. The hidden dimensions for the LSTM, self-attention including key, query, and value modules, were all set to

. Both LSTM and self-attention modules had only one layer. We tried to incorporate multiple layers, but it did not help in terms of classification performance nor interpretation. The dimensions of MLP layer for calculating temporal attention vector were and with

. We used batch normalization after the first MLP layer. ReLU activation was used in our model between the MLP layers. A final two-layer MLP was used to get logits for binary classification problem with

as input with dimensions and . We used cross-entropy loss with Adam optimizer. Let represent al the parameters of the architecture, being the prediction and is the true labels, the loss is calculated as:


(regularization weight) was set to and learning rate () was . We reduced the learning rate by a factor of when validation loss reached plateau. Early stopping was used to stop training the model based on validation loss and patience of . For each dataset, to have a fair result, we perform n-fold cross validation, depending on the size of the data with 10 randomly seeded trials for each fold. We report the mean area under curve - receiver operating characteristic (AUC-ROC) and many other metrics to show classification performance. For region based experiments, was reduced to , was set to for HCP and for others. Batch size was set to .

V-B Datasets

To test our model for a) classification b) learned connectivity matrix and c) learned temporal attention we use five different datasets across two fields; neuroimaging, and natural language processing (NLP). Refer to Tab.

I for details of the neuroimaging datasets. Validation and test size was kept same.

V-B1 NeuroImaging

The neuroimaging datasets can be further divided into two sub-tasks; brain disorder and gender prediction.

Disorder Prediction

Three datasets used in this study include FBIRN (Function Biomedical Informatics Research Network111We are using fBIRN phase III.[keator2016function] project, release 1.0 of ABIDE (Autism Brain Imaging Data Exchange222[di2014autism] and release 3.0 of OASIS (Open Access Series of Imaging Studies333[rubin1998prospective] to predict schizophrenia, autism and dementia respectively.

Gender Prediction

Healthy controls from the HCP [van2013wu] are used for gender prediction.


We use different brain parcellation techniques which can be divided into two sub-categories; ICA and region based. The preprocessing method applied depends on the parcellation technique used and the methods used in SOTA studies for the particular dataset.

ICA parcellation:

All experiments used a fully automated independent component analysis (ICA) as a brain parcellation technique. We first preprocess the fMRI data using statistical parametric mapping (SPM12, within MATLAB 2020. Subjects were included in the analysis if the subjects have head motion and mm, and with functional data providing near full brain successful normalization [fu2019altered]. For each subject, ICA components are estimated using the Neuromark template and used in experiments following the same procedure described in [fu2019altered]. We use ICA timecourses as input to the model. For ABIDE1, we conduct two ICA based experiments using all subjects and subjects with .

Region parcellation: SOTA methods use different preprocessing pipelines for HCP and ABIDE dataset. For comparison with these SOTA methods on HCP and ABIDE dataset, we preprocess these datasets following existing studies. HCP [van2013wu] was first minimally pre-processed following [article], and then FIX-ICA based denoising was applied to reduce noise in the data [article10, Griffanti2014ICAbasedAR]. After denoising, subjects were discarded based on head motion following [10.3389/fnins.2020.00630] which results into subjects. ABIDE1 [di2014autism] was pre-processed using cpac [article9], out of subjects were selected following [ABRAHAM2017736, PARISOT2018117, CAO2021103015]. To divide the data into regions, we use Shaefer [10.1093/cercor/bhx179] and Harvard Oxford (HO) [DESIKAN2006968] atlas depending on the experiment. Refer to Tab. I for details about the datasets.

V-B2 Nlp

To show the broad implications of our method, we apply our method for keyword detection in audio files. We choose this problem because it has many practical applications (e.g., virtual assistants in smartphones and robots). We use Speech Commands Dataset [speechcommand] for predicting the occurrence of a keyword in speech. The audio files are combined with a background noise of a coffee shop  [backgroundnoise]

to make prediction harder. We use this dataset to test the temporal attention weights we get from our model because important time-points (location of word cat in the noise) is known. We match this experiment with classifying brain disorder. The keyword ”cat” can be thought as the presence of a disease and the background noise can be considered as the noise and other data present in the fMRI time-courses.


For prediction, we collect samples of audios for the keyword ”cat” from [speechcommand] which has 1515 audio files for the keyword. To create ”cat” class examples, we superimpose each of the keyword audios with the length of one second onto a two seconds long background noise at a random location, resulting in a two seconds long audio consisting of background noise and keyword cat. To make it a difficult problem, we do the following things; a) the audio of cat is mixed with the noise when creating the final audio, which means the timepoints where the word cat is added has noise as well and b) before mixing the two audio files, we match the amplitudes of the two audios by normalizing both audios (background and cat) to same scale and finally c) we normalize the 1-second long sum of both files so that the sum does not have higher values than the rest of the audio file which only has noise. Because of a-c the model cannot perform classification based on amplitude. Furthermore, the model receives the mel-spectogram of the audio files as input rather than actual audio files. Fig. 2 shows the mel-spectogram of the three audio files. To create ”no-cat” class examples, we use another 1515 two seconds long audio files containing only the background noise. Thus, we use 3030 audio files for the downstream task (”cat”/”no-cat” classification). For all of the 3030 audio files, we compute mel-spectrogram to convert each of them into a matrix of size components time courses. To test our model on multiple keywords, we create another dataset with keyword ’nine’ following the same method.

Fig. 2: Mel-spectogram of ’background noise’, ’cat audio’ and ’superimposed audio’. The model receives background and superimposed files as input.

Vi Results

We show three different results, one for each of the paper contributions. We compare our results with SOTA DL methods [mahmood2019learnt, Mahmood_2020, Mahmood_2021, gadgil2021spatiotemporal, 10.3389/fnins.2020.00630, article7, 10.1093/cercor/bhz129, arslan2018graph, CAO2021103015, PARISOT2018117, KTENA2018431]

depending on the task, and ML methods such as support vector machine (SVM) and logistic regression (LR). To be fair to the other papers, we report directly from the results mentioned in the papers. Not all methods were applicable to each of the dataset/task, or the code/results of other methods were not available. All figures are generated using multiple test subjects across at least

randomly seeded trials. Our experiments show that our model beats SOTA methods on classification/prediction tasks but more importantly our learned EC structures are a) similar to existing studies, b) provides knowledge not present in existing methods; FC, c) captures direction of connectivity and d) finds important temporal bio-markers relevant to the downstream task.

Fig. 3: AUC comparision of DECENNT model with four different methods (MILC [Mahmood_2020], STDIM [mahmood2019learnt], LR, SVM), over five different datasets on ICA time courses (Ref to section V-B1). Our method significantly outperforms SOTA methods. We performed Autism experiments with 869 subjects (all TRs) as well. As we do not have a pre-training step we compare with not-pre-trained (NPT) version of MILC and STDIM. Input to ML models was same ICA time courses.
AUC 93.6 NA NA NA 85.45 88.125 82.5 78.8
ACC 86.0 84.6 68.7 83.98 76.95 79.9 NA NA
Precision 87.2 86.19 NA 84.59 NA NA NA NA
Recall 88.6 86.81 NA 87.78 NA NA NA NA
Shaefer 400
+ Fan 39
Validation 10 10 10 10 10 10 18 18
Subjects 942 942 434 942 942 820 311 311
Study Our [10.3389/fnins.2020.00630] [10.1093/cercor/bhz129] [arslan2018graph] [gadgil2021spatiotemporal] [article7] Our [Mahmood_2021]
TABLE II: Classification performance comparison of DECENNT with other DL methods on region based data of HCP and FBIRN datasets (Ref to section V-B1). Our DECENNT model outperforms all other methods in almost every metric. The best two scores are shown as bold and italic respectively. Note: As we use all the regions in the atlas we report the mean accuracy for [10.1093/cercor/bhz129]. The results for GCN [arslan2018graph] on HCP data are reported by [10.3389/fnins.2020.00630] and results for ST-GCN [gadgil2021spatiotemporal] are taken from [kim2021learning] which matches the number of subjects, parcellation, and number of CV folds. ST-GCN [gadgil2021spatiotemporal] reports accuracy (ACC) of using subjects with ROIs and -fold CV and by using test data for choosing best performing model.
Method Parcellation Input AUC
DECENNT Shaefer fMRI data 0.70
DECENNT HO fMRI data 0.69
fMRI +
phenotypic data
DeepGCN [CAO2021103015] HO
fMRI +
phenotypic data
Metric Learning [KTENA2018431] HO fMRI data 0.58
TABLE III: Comparison of AUC score on ABIDE1 region based dataset (Ref to section V-B1). Existing methods use Harvard Oxford (HO) parcellation with brain regions. Unlike [PARISOT2018117, CAO2021103015] we use only fMRI data.

Vi-a Classification

Our method turns out to be the best performing model against SOTA methods, giving the highest AUC score for all the datasets used for classification (disorder, gender, speech) with ICA data. Even with region-based data our model performs better than existing methods on HCP and FBIRN dataset. As our model does not use phenotypic information about subjects, our model lacks behind [PARISOT2018117, CAO2021103015] on ABIDE.

[PARISOT2018117] reports a decrease of AUC by using a different phenotypic information which clearly shows the dependence on phenotypic data. [KTENA2018431] reports much lower AUC score by using only fMRI data. Fig. 3

shows the classification results on ICA data. The machine learning methods fail due to high data dimensions (

), and relatively smaller number of subjects(), . Tab. II and Tab. III show region based classification results. We like to point here that GIN [10.3389/fnins.2020.00630] uses test data for hyper-parameter tuning and early stopping, whereas we use validation data for both and test data is used only to test the model. [kim2021learning] reports lower results for GIN [10.3389/fnins.2020.00630] ( ACC and AUC) when not using test data as validation data.

Vi-B Connectivity Matrix

We first present the learned connectivity for the relatively easier task of NLP. We show the difference in the learned ENC for the two keywords (’cat’ and ’nine’) in Fig. 4. Fig. (a)a show high connectivity between higher channels, whereas Fig. (b)b show high connectivity for relatively lower channels which follows the high frequency sounds in ’cat’ and relatively lower frequency sounds in ’nine’.

(a) Cat ENC
(b) Nine ENC
Fig. 4: ENC learned by our model for keywords ’cat’ and ’nine’ superimposed with noise. We used a test fold of subjects and computed mean ENC with 10 trials per subject. Our model accurately gives high attention values to medium-to-high channels for ’cat’ and low-to-medium channels for ’nine’ samples. Average values: Inside green box: for Fig. (a)a and for Fig. (b)b, outisde box: for Fig. (a)a and for Fig. (b)b. X and y axis denote the frequency channels in hertz (HZ).

Next we compare the connectivity matrix learned by our model on neuroimaging dataset with Pearson product-moment correlation coefficients (PCC), which is probably the most popular method for computing connectivity matrix. Fig. 

5 shows that the two matrices are comparable, but our ENC Fig. (a)a is directed and provides additional details. We also see that our ENC Fig. (a)a has more inter-network connectivity which is missing in Fig. (b)b. The effect of visual (VI) network onto other networks is seen only in Fig. (a)a. We group the ICA components according to [10.3389/fnsys.2011.00002] into seven domains based on anatomical and functional properties. 53 components out of 100 fall into the seven domains and the rest are marked as noise. The connectivity matrix clearly shows that the components have high intra-domain connectivity, which matches the existing literature [10.3389/fnsys.2011.00002].

Fig. 5: Fig. (a)a is the connectivity matrix generated by our model for FBIRN dataset. We used a test fold of subjects ( trials each) and computed mean ENC for all subjects. Fig. (b)b is the mean FNC of the same subjects generated by PCC. Both figures are strikingly similar, which verifies the correctness of the connectivity matrix learned by our model. To match the positive weights of our model, we normalize the FNC from 0 to 1 instead of -1 to 1.

Furthermore, as our model learns ENC, we use Fig. 6 to show the importance of direction. Fig. 6 (left) shows edges from to , where For example, the edge (8,21) means the edge is from to . It is observable that the components in visual (VI) heavily affect components in sensorimotor (SM). The direction is reversed in Fig. 6 (right) and SM does not affect VI. Similar direction can be seen between cognitive control (CC) and SM. The presence of direction is of paramount importance and is missing from FNC. It can potentially help to make and answer interventions in data.

Fig. 6: Top 10 directed edges of FBIRN ENC (a)a. The numbers represent the crucial components. The figure shows the direction of connectivity. Visual (VI) affects other domains, cognitive control affects sensorimotor. Edges: VI other: 79, other VI: 25. CC SM: 9, SM CC: 3.

Vi-C Temporal Attention

As in rs-fMRI the subjects are not performing any specific task at any time-point, there is no available true knowledge of important time-points. Because of this reason, we show keyword detection experiments where the precise location of the keyword is available. Fig. 7 shows attention weights for 8 test subjects. The attended time-points match with the time-points of the keyword. This is extremely significant and proves the model can accurately find important time-points, as the location of the keyword was never given to the model. We compute the statistical values such as (precision, recall) of the temporal attention, mentioned in the caption of Fig. 7. We assign label ’1’ to time-points where ’cat’ audio is superimposed and label ’0’ to all other time-points which gives us the true labels. For predicted labels, we assign label ’1’ to time-points with attention value greater than 0 and label ’1’ to all other time-points. The stats shows that the model a) assigns high attention values to ’cat’ time-points, b) does not attend to ’non-cat’ time-points and c) does not attend to all ’cat’ time-points. Although, we would have liked the model to attend to all cat time-points, we think the model does not do that because of two reasons; 1) The 1-sec long ’cat’ audio files on average have the ’cat’ sound for only 0.5 seconds or less whereas, when creating Fig. 7 and the stats, we used the complete 1-sec time-points. 2) The model maybe looking for a part of the keyword ’cat’ which is distinct from the noise.

(a) All trials
(b) Averaged
Fig. 7: Fig. (a)a is the normalized temporal attention weights for keyword detection task for 8 subjects, with 10 trials for each. Fig. (b)b is the mean weights of the same subjects. The top red line marks the actual time-points for the keyword. Statistics: True Positive=56, False Positive=18, True Negative=331, False Negative=291, Precision=0.76, Sensitivity=0.16, Specificity=0.95.

To further check the correctness of the time-points selected by our model and the affects on classification performance, we perform an experiment where after training the model, we compute using only top attended time-points for training data to train an LR model and then use the top time-points for the test data to test the model. Similarly we perform experiments for bottom values as well. Tab. IV shows the comparison using three brain datasets. The results show that the LR model provides high AUC score by just using top of the time-points attended by the model. Thus, it proves that a) not all time-points are important for classification of the downstream task and b) our model accurately finds the important time-points. We use an LR model for this experiment to show that the learned top/bottom time-points are not limited to our model but is generalized such that an independent LR module gives high classification results using the top attended time-points and does not learn on the low data. In our experiments, we also note upto drop in AUC when not using temporal-attention.

100 % DECENNT 0.844 0.72 0.65
Top 5 % LR 0.835 0.713 0.642
Bottom 5 % LR 0.566 0.548 0.532
TABLE IV: AUC score comparison on brain datasets with ICA components by using all, top 5 % and bottom 5% time-points only.

Vii Conclusion

Our model demonstrates the importance of learning dynamic temporal graphs for any multivariate time series, which is currently missing from the existing literature. Using dynamic graphs, our model outperforms SOTA methods across five different tasks, proving that the model is applicable across different fields and tasks. By learning the correct graph structure/connectivity matrix for the data, our model eliminates the need for existing external methods such as PCC, K-means. Our model learns a directed graph structure that provides more detail than a symmetric correlation matrix which does not capture effective connectivity. As seen in results, our learned EC matrices give the direction of connectivity between brain regions. The temporal attention module proves to be highly effective in terms of classification. As shown in the paper, it provides stable attention weights and accurately finds the critical time-points depending on the downstream task. Both self and temporal attention modules result into stable, consistent attention values and increase the classification performance across tasks. These attributes address the questions regarding explainability of attention mentioned in

[jain2019attention, wiegreffe2019attention]. Many tasks across many fields are ever dynamic and have missing graph structure, e.g. (Brain functional networks, social networks, self-driving cars etc.) which increases the need of methods like DECENNT. Temporal attention used in brain connectivity can help us find important bio-markers relative to the disorder/disease which in turn help us understand the disorder and its causes. For future work, we plan to extensively interpret the learned connectivity structures, and see the differences in them across controls and patients and across multiple brain disorders. We also want to incorporate a form of spatial attention, which like temporal attention, could help identify essential nodes/components that are sometimes unavailable in many fields. We also for each class of subject want to examine how ENC changes overtime and if/how the direction of flow of information changes through time.