1 Introduction
Functional connectivity refers to the functionally integrated relationship between spatially separated brain regions [1]. Recent work revealed that functional connectivity exhibits meaningful variations within the time series captured by restingstate fMRI [2, 3]. As a consequence, a considerable amount of work has been directed to quantify dynamic functional connectivity. A popular way of quantification [2, 4, 5]
is to first categorize the timevarying connectivity patterns of a subject or a population into several states. Then, the dwell time of each state and the transition probabilities across states are used to characterize functional dynamics and perform group analyses
[4].Most existing works identify major connectivity states (commonly observed states) by grouping the dynamic connectivity patterns into a few clusters [2, 4, 5]. However, some studies have indicated that there exist many minor states containing rare connectivity patterns that persist shortly ( occupancy rate) [6]. These minor states often provide little merit to analysis because they may correspond to random individual brain variation, inaccurate connectivity pattern computed during state transitions or rsfMRI noise. Instead of merging minor states into the major ones [4, 5], recent work suggests to disentangle minor from major states by modeling an infinite number of clusters [6, 7]. However, connectivity patterns in minor states may correspond to pure noise, so grouping them into clusters is not meaningful. In this paper, we address these concerns by developing a statistical framework where the patterns associated with major states are drawn from an informative distribution while we use a noninformative distribution for minor states.
Motivated by the truncated stickbreaking representation of Dirichlet processes [8, 9], our approach is guided by a Dirichlet prior when separating dynamic connectivity patterns into major and minor states. We define dynamic connectivity patterns by computing correlation matrices associated with sliding windows. We assume that the correlation matrices belonging to major states are generated by a nonlinear process from a lowdimensional latent space, where the latent representations follow a Gaussianmixture distribution. We assume that the rest of the correlation matrices, which correspond to minor states, are generated from a uniform distribution in the original space. To determine the optimal parameters of our model, we derive the variational lowerbound of its log marginal probability and find the maximum of that lowerbound by optimizing a variational autoencoder. As a result, our method, tGMVAE, simultaneously achieves clustering and outlierdetection: tGMVAE clusters dynamic connectivity patterns associated with major states and treats the minor states as outliers. In this work, we first demonstrate that tGMVAE achieves higher accuracy in defining major clusters and outliers compared to traditional Gaussianmixturebased approaches when clustering synthetic data with groundtruth. We then report that, for 15k correlation matrices derived from rsfMRI scans of 593 adolescents in the the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA), tGMVAE identifies meaningful connectivity states and a significant effect of age on their mean dwell time.
In the following, we first review existing VAEbased clustering approaches in Section 2. We introduce in Section 3 the generative model of tGMVAE, the variational lower bound of the resulting log marginal likelihood, and reformulate tGMVAE into a joint clustering and outlierdetection approach. We present our synthetic experiments and clinical data analysis in Section 4.
2 Related Work
Traditional approaches to clustering connectivity patterns are mostly based on Gaussianmixture models [4, 5]
. These methods usually require fitting probability distributions in a high dimensional space, which is a challenging task. Moreover, it has been found that the underlying distributions of both fMRI measurements
[6] and the derived correlation matrices [10] lie on a nonlinear latent space. Therefore, modeling Gaussianmixtures in the original space is suboptimal.Generative models used in connection with neuralnetworks, such as VAEs, have recently attracted much attention for their capability of modeling latent representations of the data
[11]. In VAE, the encoder approximates the intractable posterior distribution of the latent representation and the decoder aims to reconstruct the observation based on its latent representation. While traditional VAE assumes that latent variables follow a single Gaussian prior, recent works adopt mixture models in the latent space for semisupervised learning
[12] and clustering [13]. Dilokthanaku et al. [13] construct a twolevel latent space that allows for a multimodal prior of latent variables, but this model exhibits overregularization effects that require specific optimization procedures. Jiang et al. [14] explicitly define a generative process based on a mixture of Gaussians in the latent space, which achieves better clustering performance. Our tGMVAE model is built upon a generative model similar to [14] to capture major states but also includes a noninformative distribution for modeling minor states.Besides the above approaches for modeling fixed number of clusters, Bayesian nonparametric models have been adopted to model an infinite number of clusters. The semisupervised approach proposed in
[15] uses multiple VAEs as a proxy of Gaussianmixture models and automatically determines the number of VAEs by maximizing the reconstruction capability for the entire dataset. The stickbreaking construction [9] has also been adopted in VAE for semisupervised classification, where the latent representation is a set of truncated categorical weights. While this approach is not intrinsically built for clustering, the truncation strategy motivates us to use the last category (remainder of the truncation) to capture all dynamic connectivity patterns that do not belong to major clusters. Contrary to the above two approaches, tGMVAE only models the encoding/decoding process for major clusters and omits the latent representation for the remainder. This strategy is useful when the remainder corresponds to (a) minor clusters of no interest to analysis so modeling their latent presentations is redundant; (b) outliers whose latent representations are meaningless or do not form clusters.3 Methods
3.1 The Generative Model
Let be a training dataset with observations. Each represents a dynamic connectivity pattern, e.g., the upper triangular part of an ROItoROI correlation matrix derived from the rsfMRI time series at a given sliding window [2]. We assume that each
belongs to a state, which, in our proposed generative process, is encoded by the categorical variable
. The first categories represent the major states and corresponds to the remainder (minor states). is drawn from a categorical distribution , where belongs to the dimensional simplex and is generated from a Dirichlet prior with two parameters . By construction, a single parameter controls the portion of the major clusters indifferently, and separately controls for the portion of the remainder via a stickbreaking procedure [8].For simplicity, let denote . We assume that when with , is generated from a latent representation through a nonlinear process modeled by a neuralnetwork with parameter : , where
is the fixed standard deviation of noise. We further assume
is drawn from a Gaussian distribution with mean
and an identity covariance: with . In other words, the marginal distribution of follows a Gaussian mixture in the latent space. On the other hand, when , we assume is simply drawn from a uniform distribution in a unit domain embedded in the original space containing all observations after normalization: . Based on the above generative model parameters , we have for , and . The Bayesian graphical diagram of this model is given in Fig. 1a.3.2 Variational Lower Bound
Given the training dataset X and the two parameters of the Dirichlet prior, the generative model parameters are determined by maximizing the marginal probability . Assuming i.i.d for each , the log marginal probability can be written as:
(1) 
In the above equation, the log likelihood can not be directly optimized, so variational inference is used to maximize its lowerbound. Typically, lowerbounds for graphical models are derived by approximating an intractable posterior on the latent variables with a tractable function . Here we make the common meanfield assumption: . When omitting the subscripts to simplify notations, it reads:
(2)  
(3)  
(4)  
(5)  
(6)  
(7) 
where denotes the KL divergence between two probability distributions.
Interpretation of the Lower Bound. corresponds to the formulation of the traditional singleGaussian VAE [11] with respect to the cluster. Specifically, encourages the decoded reconstruction of the latent variable to resemble the observation. The term is commonly interpreted as a regularizer encouraging the approximate posterior to resemble the clusterspecific Gaussian prior .
The right term of the lowerbound (Eq. 6) sums the losses of singleGaussian VAEs over the major clusters and weighs them by clusterassignment probability . Maximizing this term improves the encoding/decoding capability for patterns in major states while keeping their latent variables to form clusters. The left term of the lowerbound corresponds to the KLdivergence between and and encourages the posterior categorical distribution to approximate the categorical prior. It is important to note that latent representations are only modeled for the clusters but not for the remainder. The portion of the remainder is controlled by the left term of the lowerbound.
3.3 Reformulation
In this section, we reformulate our model to demonstrate, by reorganizing the lowerbound of Eq. 6, that tGMVAE can be interpreted as a joint outlierdetection and clustering framework. Given the generative process described in Section 3.1, the categorical variable can be constructed by first differentiating the major clusters from the remainder. Let denote a Bernoulli variable generated by , where defines the portion of the remainder. When ( for major clusters), a cluster assignment variable is drawn from a categorical distribution , where follows . This construction also involves two parameters, . The graphical diagram of this model is given in Fig. 1b.
For posterior inference, different functions are constructed for the reformulated generative process. Let denote the approximate posterior of assigning to either major clusters or the remainder and let denote the major cluster assignment given . Then and in Section 3.2 become
(8)  
(9) 
Replacing the terms in Eq. 6 with Eq. 8,9 leads to the following lower bound
(10) 
where is exactly the formulation of Gaussianmixture VAE with clusters [14]. From Eq. (10) we can see that essentially gives the probability of being an inlier. Data with high inlierprobability are then clustered by , while the right term in Eq. (10) regularizes the portion of outliers with parameter . In practice, we use an additional weight to balance the two types of losses in Eq. (10), a common practice in VAE frameworks [13, 16].
3.4 Clustering Correlation Matrices
The design of our VAE network is based on the above inference procedure. More specifically, all the approximate posteriors are modeled by neural networks. Similar to the traditional VAE [11], is an encoder network (Fig. 1c red blocks) with parameters , which encodes the posterior as a multivariate Gaussian with an identity covariance . While allowing for a diagonal or full covariance are both reasonable practices, we simply rely on the nonlinear neural network to capture the covariance structure, and we only use the mean to capture the clustering effects in the latent space. The encoder has 3 densely connected hidden layers with tanh activation. The dimensions of the 3 layers are , where is the leading “power of two” that is smaller than the input dimension (e.g., D=64 for a 1515 correlation matrix with 105 upper triangular elements). The decoder network
has an inverse structure as the encoder and uses MSE reconstruction loss. For the optimization of these two networks, the SGVB estimator and reparameterization trick are adopted
[11].Contrary to previous work [14, 17], we also use neural networks to model the categorical posteriors and (Fig. 1c orange blocks). Their first two layers were shared from the encoder of and the last layer is densely connected with softmax activation. This construction rigorously reflects the structure of the generative model described in Section 3.3 (Fig. 1b) and allows for two separate mechanisms for detecting outliers with and assigning clusters with . By comparison, a single neural network for would be obtained from the model described in Section 3.1 (Fig. 1a), but this network would treat the clusters and outliers indifferently.
4 Results
tGMVAE was first validated and compared to traditional clustering approaches based on synthetic experiments, where rsfMRI series and timevarying correlation matrices were simulated according to a groundtruth state sequence. We measured, in particular, the accuracy of tGMVAE in connectivity states estimation. Then, tGMVAE was used to cluster 15k correlation matrices obtained from the rsfMRI scans of 593 adolescents in the NCANDA study [18]. The relation between the age of a subject and the mean dwell time of the connectivity states was finally examined.
4.1 Synthetic Experiments
Data Simulation. We followed the simulation procedure presented in [2, 6] by first generating a state sequence of 50000 time points associated with 10 connectivity states, among which 5 states were major states. The transition probability from the state to the state was set to , where is the Kronecker Delta function, and was randomly generated from . This process led to selftransition probabilities varying between 0.9 and 0.95, and crossstate transition probabilities between 1e4 and 0.05. The mean dwell time of a state (average time that a state continuously persists before switching to another state) varied between 8 and 15 time points. The occupancy rate of a major state (percentage of a state occupying the sequence) varied between 8% to 30%, and the total occupancy of the 5 minor states varied between 5% to 10%. These metrics are similar for real rsfMRI data reported in [6].
Next, a connectivity pattern was simulated for each state. In the first experiment, we assumed that there were 15 regions of interest (ROI) in the brain, so each state was associated with a matrix, known as the community matrix [6]. For the
state, a 1D loading vector
consisted of (representing positive/negative or no activation of each ROI) was randomly generated. Then, the community matrix was computed by [2].Afterwards, synthetic rsfMRI signals at each time point were randomly sampled from a Gaussian distribution with the covariance being the statespecific community matrix at that time point. Gaussian noise of standard deviation 0.1 was further added to the synthetic rsfMRI series. Finally, dynamic correlation matrices were generated using a sliding window of length 11. These different steps are summarized in Fig. 2.
Clustering Accuracy. tGMVAE clustered the dynamic correlation matrices into 5 major states with the following parameter settings: , , and . These settings corresponded to an accurate estimate of the portion of the remainder (), a rather noninformative Dirichlet prior () and a strong regularization on the portion of the remainder (). Fig. 3 presents the 3D latent space associated with tGMVAE. Only the 5 major states are displayed as the latent representations of the remainder were not modeled. We can observe that the latent representations were reasonably clustered by states, thanks to the Gaussianmixture modeling in the latent space [9, 13, 14].
To associate the 5 estimated clusters with the 5 groundtruth major states, the correlation matrices in an estimated cluster were first averaged and linked to the closest community matrix with respect to the Frobenius norm. As there was no interest in differentiating minor connectivity states, the clustering accuracy was measured with respect to the 6 classes (5 clusters + remainder). tGMVAE was compared with three other clustering approaches as indicated by Fig. 4. Both GaussianMixture Model (GMM) and GaussianMixture VAE (GMVAE) clustered the entire dataset into 5 clusters (merging minor states into major ones); The nonparametric Dirichlet Process (DP) Gaussianmixture approach modeled an infinite number of clusters, so the 5 largest clusters estimated by DP were considered major states and the rest was considered the remainder. The clustering accuracy of these approaches was 68.4% (GMM), 69.0% (DP), 74.8% (GMVAE) and 78.5 % (tGMVAE). Fig. 3b shows the estimated state sequence produced by tGMVAE (most accurate) and GMM (least accurate). We observe that the two VAEbased methods produced significantly improved clustering accuracy than the two traditional GaussianMixture methods (GMM and DP). This improvement indicates that the modeling of latent representations and the associated nonlinear generative processes as provided by the VAE framework were helpful in analyzing correlation matrices. Moreover, the truncation of tGMVAE could accurately capture the minor states and provided 3.7% improvement over GMVAE, whereas explicitly clustering minor states was a less effective strategy (DP only 0.6% improvement over GMM).
Next, the above comparison was repeated for different simulation settings (Fig. 4). To demonstrate that tGMVAE can generalize to brain parcellations of different scales, the number of ROIs was varied between 10 and 50, which covered the typical range used in existing analyses of functional dynamics [2, 4, 6, 7]. In all settings the two VAEbased approaches produced more accurate clustering, and tGMVAE was the most accurate approach. This was also the case when the standard deviation of noise in synthetic rsfMRI time series was varied between 0.05 to 1. Another important parameter (not relevant to clustering approaches) in the analysis of functional dynamics is the length of the sliding window for computing correlation matrices. Previous works often use a window size longer than the mean dwell time of connectivity states in order to reliably compute correlation values, but this strategy could potentially fail to differentiate dynamic connectivity patterns across neighboring states because the long window often covers multiple state transitions. While the analysis of window length is not the focus of the presented work, our experimental results (Fig. 4c) indicate that choosing a window size longer than the mean dwell time does not guarantee accurate clustering.
Note that the shallow neural networks tested here are a simplification choice and not a limitation of the method. Further exploration in the network structure would lead to better results for tGMVAE. For instance, setting the dimension of latent space larger than 3 would produce higher accuracy for large correlation matrices (Fig. 4d).
4.2 The NCANDA Dataset
We applied tGMVAE to the rsfMRI data of 593 normal adolescents (age 1221; 284 boys and 309 girls) from the NCANDA study [18] to investigate dynamic connectivity states in young brains. The rsfMRI time series was preprocessed using the publicly available pipeline as described in the NCANDA study [18]. For each subject, functional time series were extracted from 45 cerebral regions (averaged bilaterally) as defined by the sri24 atlas [19]. Dynamic correlation matrices of size were then derived for each subject based on a slidingwindow approach [2] and improved by a linear shrinkage operation [20]. As mentioned, there is no consensus on the optimal length of the slidingwindow. In the present work, we selected the length that produced the largest number of strong correlations (absolute value 0.5) to maximize the information contained in the training data. Our experiments suggest that the optimum was achieved at 10 time points (22s) regardless of the parcellation used to produce correlation matrices (Fig. 6). Afterwards, a total of 153587 matrices were derived for the entire cohort and clustered by tGMVAE into 5 major states [4]. The dimension of the latent space was set to 6. Other parameters were set as in the synthetic experiments. Fig. 5 shows the mean correlation matrices associated with the 5 major states detected by tGMVAE and visualizes their graph structures. These 5 states correspond to wellknown functional networks: auditory network (State 1), limbic and thalamostriatal network (State 2), visual network (State 3), salience network (State 4) and the default mode network (State 5).
Based on the clustering results, the state sequence was recovered for each subject and the mean dwell time over all states was computed. A group analysis was then performed to investigate the aging effect on the mean dwell time. First, sex and scannertype were removed as confounding factors from mean dwell time using regression analysis
[18, 21]. The residuals were then correlated with age, resulting in a significant positive correlation (onetailed =.0006, Fig 6). This agerelated increase of mean dwell time could also be observed when the analysis was repeated with the dimension of latent space varying between 3 to 7. These results essentially indicate each connectivity state tends to persist longer in older adolescents, which converges with current concept of neurodevelopment that variation of dynamic functional connectivity declines with age [22].5 Conclusion
In this paper, we have presented a novel joint clustering and outlierdetection approach to identify functional connectivity states. Our model, tGMVAE, introduces for the first time a truncated Gaussianmixture model in the variational autoencoder framework. This approach allows us to cluster data corrupted by noise, outliers and minor clusters of no interest to analysis. We used tGMVAE to extract major functional connectivity states from restingstate fMRI scans and characterize their dynamics. We showed that modeling latent representations of correlation matrices improves clustering accuracy compared to traditional Gaussianmixture approaches and that our truncation strategy is useful in disentangling minor and major connectivity states. In the future, we will expand our framework to improve the modeling of state transitions.
References
 [1] Buckner, R., Krienen, F., Yeo, B.: Opportunities and limitations of intrinsic functional connectivity MRI. Nat Neurosci. 16(7) (2013) 832–837
 [2] Allen, E., Damaraju, E., Plis, S., Erhardt, E., Eichele, T., Calhoun, V.: Tracking wholebrain connectivity dynamics in the resting state. Cerebral Cortex 24(3) (2014) 663–676
 [3] Zalesky, A., Fornito, A., Cocchi, L., Gollo, L.L., Breakspear, M.: Timeresolved restingstate brain networks. PNAS 111(28) (2014) 10341–10346
 [4] Damarajua, E., Allen, E., Belgerc, A., et al.: Dynamic functional connectivity analysis reveals transient states of dysconnectivity in schizophrenia. NeuroImage: Clinical 5 (2014) 298–308
 [5] Yu, Q., Erhardt, E.B., Sui, J., et al.: Assessing dynamic brain graphs of timevarying connectivity in fmri data: Application to healthy controls and patients with schizophrenia. NeuroImage 107 (2015) 345–355
 [6] Taghia, J., Ryali, S., et al.: Bayesian switching factor analysis for estimating timevarying functional connectivity in fmri. NeuroImage 155 (2017) 271–290

[7]
Nielsen, S., Madsen, K., Røge, R., Schmidt, M.N., Mørup, M.:
Nonparametric modeling of dynamic functional connectivity in fmri
data.
In: Machine Learning and Interpretation in Neuroimaging Workshop. (2015)
 [8] Blei, D.M., Jordan, M.I.: Variational inference for dirichlet process mixtures. Bayesian Analysis 1(1) (2017) 121–144
 [9] Nalisnick, E., Smyth, P.: Stickbreaking variational autoencoders. In: ICLR. (2017)
 [10] Zhao, Q., Kwon, D., Pohl, K.M.: A riemannian framework for longitudinal analysis of restingstate functional connectivity. In: MICCAI. (2018)
 [11] Kingma, D., Welling, M.: Autoencoding variational bayes. In: ICLR. (2013)
 [12] Kingma, D.P., Rezende, D.J., Mohamed, S., Welling, M.: Semisupervised learning with deep generative models. In: NIPS. (2014)
 [13] Dilokthanakul, N., Mediano, P.A., Garnelo, M.: Deep unsupervised clustering with gaussian mixture variational autoencoder. (2017) preprint, arxiv.org/abs/1611.02648.
 [14] Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational deep embedding: An unsupervised and generative approach to clustering. In: IJCAI. (2017)
 [15] Abbasnejad, E., Dick, A.R., van den Hengel, A.: Infinite variational autoencoder for semisupervised learning. In: CVPR. (2017)
 [16] Higgins, I., Matthey, L., Pal, A., et al.: betavae: Learning basic visual concepts with a constrained variational framework. In: ICLR. (2017)
 [17] Ebbers, J., et al.: Hidden markov model variational autoencoder for acoustic unit discovery. In: InterSpeech. (2017)
 [18] MüllerOehring, E., Kwon, D., Nagel, B., Sullivan, E., et al.: Influences of age, sex, and moderate alcohol drinking on the intrinsic functional architecture of adolescent brains. Cerebral Cortex 28(3) (2017) 1049–1063
 [19] Rohlfing, T., Zahr, N., Sullivan, E., Pfefferbaum, A.: The sri24 multichannel atlas of normal adult human brain structure. Hum Brain Mapp. 31(5) (2014) 798–819
 [20] Chen, Y., Wiesel, A., Eldar, Y.C., Hero, A.O.: Shrinkage algorithms for mmse covariance estimation. IEEE Trans. on Sign. Proc. 58(10) (2010)
 [21] Pfefferbaum, A., Kwon, D., et al.: Altered brain developmental trajectories in adolescents after initiating drinking. American Journal of Psychiatry. 175(4) (2017) 370–380
 [22] Chen, Y., Wang, W., et al.: Agerelated decline in the variation of dynamic functional connectivity: A resting state analysis. Front Aging Neurosci. 9(23) (2017)
Comments
There are no comments yet.