Mutual independence is a key concept in statistics whose goal is to characterize the structural relationships between variables. As a consequence, a fundamental problem is to be able to determine from a sample of finite size whether some subsets of variables are mutually independent or not. To this end, several approaches have been proposed in the literature. For two- and three-way contingency tables, one can use the minimum discrimination information statistic against the null hypothesis of independence
Mutual independence is a key concept in statistics whose goal is to characterize the structural relationships between variables. As a consequence, a fundamental problem is to be able to determine from a sample of finite size whether some subsets of variables are mutually independent or not. To this end, several approaches have been proposed in the literature. For two- and three-way contingency tables, one can use the minimum discrimination information statistic against the null hypothesis of independence[1, Chap. 8, §2 and §3.1]; in the multidimensional case, there is the chi-squared test for independence [2, §23.8]. For multivariate normal distributions, one can resort to the likelihood ratio criterion [3, Chap. 9] or, again, the minimum discrimination information statistic [1, pp. 306–307] against the null hypothesis of independence. Such approaches, however, have a major drawback, in that one needs two and only two competing models, one being nested into the other one and used to generate a null distribution for a statistic of interest, usually under the asymptotic assumption of large sample size; if the statistic is beyond a certain threshold, the null assumption is rejected, but usually with no hint about the validity of the second model, not to mention the true underlying structure of dependence. In the case of several competing models, or when there is no hint regarding the underlying structure of mutual independence, the above-mentioned methods do not provide any way to approach the problem. As a consequence, these methods have a very restricted scope of application.
In the present study, we propose to change the investigation of mutual independence from a hypothesis-driven task that can only be applied in very specific cases to a blind and automated search within patterns of mutual independence. To this end, we develop a general framework for a data-driven investigation of all potential patterns of mutual independence. More specifically, we propose to treat the issue as one of model comparison that we solve in a Bayesian framework. A first step in this direction was proposed by  , who used Bayesian model comparison to quantify the probability for two discrete variables to be independent. A second step was performed by , who showed that a Bayes factor, comparing two models with and without independence, provided a relevant measure of similarity for agglomerative hierarchical clustering in the case of two subvectors of a multivariate normal distribution. We here propose to go further and perform a full probabilistic exploration of the independence pattern underlying the data. Such an approach heavily relies on the one-to-one mapping that exists between a pattern of mutual independence between , but requires a class of its own. After specification of a model of dependence, the Bayesian machinery makes it possible, from a theoretical perspective, to calculate the posterior probability of any partition (corresponding to any pattern of mutual independence) given some data. From a practical point of view, it allows the resulting posterior distribution on the set of all partitions (corresponding to the set of all patterns of mutual independence) to be explicitly computed (if possible), or otherwise approximated through a numerical sampling scheme.
, who used Bayesian model comparison to quantify the probability for two discrete variables to be independent. A second step was performed by
, who showed that a Bayes factor, comparing two models with and without independence, provided a relevant measure of similarity for agglomerative hierarchical clustering in the case of two subvectors of a multivariate normal distribution. We here propose to go further and perform a full probabilistic exploration of the independence pattern underlying the data. Such an approach heavily relies on the one-to-one mapping that exists between a pattern of mutual independence betweenvariables and a partition of . Comparing models of mutual independence is then equivalent to comparing partitions, which can be seen as a clustering problem and solved in the general framework of Bayesian model-based clustering [6, 7, 8, 9]. The present problem, however, can be solved by neither of the two main classes of clustering methods, namely the clustering of mixture models, e.g., [10, 11, 12, 13, 14], and the clustering of functional data or curves, e.g., [15, 16, 17, 18, 19, 20]
, but requires a class of its own. After specification of a model of dependence, the Bayesian machinery makes it possible, from a theoretical perspective, to calculate the posterior probability of any partition (corresponding to any pattern of mutual independence) given some data. From a practical point of view, it allows the resulting posterior distribution on the set of all partitions (corresponding to the set of all patterns of mutual independence) to be explicitly computed (if possible), or otherwise approximated through a numerical sampling scheme.
The outline of the article is the following. We first expose the problem and provide a theoretical treatment in the form of Bayesian model comparison. We then investigate the particular cases of multivariate normal and discrete distributions, providing asymptotic expressions for the log posterior distributions which turn out to be compatible with the Bayes information criterion , the likelihood ratio criterion  and the minimum discrimination information statistic . We propose a general Markov chain Monte Carlo (MCMC) algorithm to numerically approximate the posterior distribution on the space of all patterns of mutual independence. We then demonstrate the interest of the method on synthetic data as well as two real datasets, showing the unique insight provided by this approach. The discussion sums it up and rises some outstanding issues.
We start by proposing a quick description of the problem (Section 2.1) and its interpretation in a Bayesian framework in terms of model comparison (Section 2.2). We consider the special cases of multivariate normal distributions (Section 2.3) and discrete distributions (Section 2.4). In Section 2.5, we propose an efficient sampling scheme to explore the set of all partitions.
2.1 Mutual independence and partitions
Let be a -dimensional variate following a distribution with parameter . By definition, if can be decomposed into mutually independent subvectors , then can be decomposed as the product
The definition of a decomposition of into mutually independent subvectors is equivalent to the choice of a partition of , that is, disjoint and non-empty subsets (or blocks) ’s of whose union is equal to . A partition is denoted by concatenating the subsets composing it separated by the sign “”, e.g., for the partition of into . Let be the set of all partitions of . For instance, has 2 elements, while has 5. More generally, the total number of partitions of a set with elements is given by the th Bell number [22, 23, 24, 25, 26].
A note on notations
Consider the partition of into blocks. Using the language of partitions, we should denote by the -dimensional subvector of defined by , and, if , …, are mutually independent, then should be decomposed as the product
For the sake of simplicity, we will stick to and instead of and , respectively. Note however that these notations implicitly refer to a partition .
2.2 Model comparison
Consider a partition that decomposes into mutually independent subvectors , being of size . Given a realization of , we quantify the relevance of by its posterior probability . According to Bayes’ updating rule, this quantity yields
is the prior probability for
is the prior probability for; it characterizes what is known about before the data are available. In the following, it will be set as uniform on ; issues related to selecting the prior are discussed in Section 6.4. is the likelihood. It expresses how the data are related to the model. In the present case, it is also called marginal model likelihood, as it is obtained after removing the effect of the model parameters (see below). Finally, is the posterior probability of . It summarizes all the information that is available regarding after acquisition of the data.
Marginal model likelihood
As the distribution is assumed to be a function of some parameter , the marginal model likelihood can be obtained by marginalization of the usual model likelihood, yielding in the assumption where holds
In this expression, is equal to
If can itself be partitioned into subvectors, , each subvector characterizing the parameters of , and assuming prior independence of these parameters,
the marginal likelihood reads
Often we are under the assumption that the data are composed of independent and identically distributed (i.i.d.) realizations of , that is, . While this assumption does not simplify much the theoretical expression of the posterior distribution, it usually provides key simplifications in practical cases, as we will see in the next two particular cases of multivariate normal distributions and multivariate discrete distributions.
2.3 Multivariate normal distributions
We here consider the specific case where the data are multivariate normal. More specifically, let be a -dimensional multivariate normal variable with known mean and unknown covariance matrix . Under , is block-diagonal, each block corresponding to a subset of size . Given a dataset of i.i.d. realizations of and the corresponding sample sum-of-square matrix
and with conjugate (i.e., inverse-Wishart) prior for the covariance matrix, can be calculated in closed form and yields 
where is the prior (diagonal) scale matrix, its th block, the determinant of a matrix, the prior degree of freedom,
the prior degree of freedom,, and the inverse of a normalization constant
Incorporating the expression of from Eq. (12) into Bayes’ rule, Eq. (3), and taking into account the fact that does not depend on the model of dependence, and so is part of the normalization constant, we obtain for the posterior distribution
Note that if the mean is unknown, this calculation is still valid, with replaced by the sample mean
and the degree of freedom by .
Let be the sample covariance matrix corresponding to block . Asymptotically (), the log of the marginal likelihood can be expressed as 
plus a term that is proportional to and, hence, does not depend on , and terms that are . In this equation, the first term corresponds to the part of the maximum-likelihood that does depend on (see §1.1 of online supplement), while the second term is the BIC penalization for the dimension of the problem , i.e., of the form
In the case where we compare a model of full dependence with another nested model with mutually independent subvectors , the log-ratio of marginal model likelihoods yields
The first term of the right-hand side corresponds to the minimum discrimination information statistic against the null hypothesis of independence in the case of a multivariate normal distribution [1, Chap. 12, §3.6] or, equivalently, to the log of the likelihood ratio criterion for testing independence between sets of variates [3, Chap. 9]. This result can easily be generalized to any pair of nested models.
2.4 Cross-classified multinomial distributions
We now consider the case of a -dimensional discrete distribution, also coined cross-classified multinomial distribution [27, §7.1]. To this aim, let be a -dimensional discrete multivariate variable, such that each takes values in set with cardinality . For each block , we also define the set with cardinality . Under , can be decomposed into mutually independent variables , the model is parameterized by multidimensional parameters , and the likelihood reads
where is the probability to have . Given a dataset of i.i.d. realizations of and under the usual assumption of a Dirichlet prior with parameter for each , the marginal model likelihood has a simple expression (see §2.1 of online supplement), leading to a posterior probability of
where is the number of times that we observe for , and the usual Gamma function.
For , let , so that . Set also . Then the log posterior can be asymptotically expressed as (see §2.2 of online supplement)
where is the classical entropy function associated with , that is,
The first term in the asymptotic expression of , Eq. (2.4 ), corresponds to the maximum-likelihood estimate, while the second term is the BIC penalization for a model with
), corresponds to the maximum-likelihood estimate, while the second term is the BIC penalization for a model withparameters (corresponding to the -dimensional parameter together with the additional constraint ).
As for the multivariate normal case, consider the case where we compare a model of full dependence with another nested model with mutually independent subvectors . The log-ratio of marginal model likelihoods then yields
The first term of the right-hand side generalizes the minimum discrimination information statistic against the null hypothesis of independence in the case of discrete two- and three- way contingency tables [1, Chap. 8, §2 and §3.1]. This result can easily be generalized to any pair of nested models.
2.5 Sampling scheme
As mentioned above, the total number of partitions of a set with elements is given by the th Bell number . The first six Bell numbers are 1, 2, 5, 15, 52, and 203. The growth rate of is given by (see §3.1 of online supplement)
which is faster than exponential and slower than factorial. For instance, we have , while is larger than . As a consequence, exhaustive examination of all potential partitions quickly becomes intractable. To circumvent this issue, it is possible to resort to Markov chain Monte Carlo (MCMC) sampling, a powerful tool that is widely used in Monte Carlo integration but also in Bayesian data analysis to generate samples that, under certain conditions, will approximate a distributions of interest [28, Chap. 5]. In our case, we use it to generate a sample of partitions from our posterior distribution. Following [29, p. 322], we proceed according to the following sampling scheme:
From these partitions, use importance resampling to draw partitions (that is, sample partitions from the partitions without replacement with a probability of sampling each partition proportional to its posterior probability, see [29, §10.5]).
Use these partitions as starting points to run independent parallel chains of samples.
For the sampling itself,  proposed a Gibbs sampling approach (henceforth coined Gibbs) that sequentially scans all the elements one by one and considers moves from the current partition to any other partition differing from the current one in only that one element. Such an algorithm has the advantage of considering only a limited number of potential partitions at each step. Each step scans through the variables and, for each variable, there are as many options as there are partitions, which is limited by the number of variables; it is therefore , and the number of partitions considered is . However, this algorithm, by only moving one element at a time, is expected to generate highly correlated states and precludes large changes. To improve convergence, we also considered parallel tempering (PT) and a sampling scheme that can be conceptualized as an implementation of 2-way stochastic hierarchical clustering (2wSHC).
2.5.1 Parallel tempering
To allow for an exploration of the hypothesis space that is less likely to get trapped in (or around) local maxima, we can resort to parallel tempering [28, §10.4], see also  or . Parallel tempering is a sampling scheme that runs several dependent sequences with different target probabilities and allows state swaps between sequences with a certain probability. In our case, if the posterior probability for a model involving partition is denoted by , we set the target probabilities of the sequences as
We set , so that corresponds to the original distribution, while the ’s, , correspond to increasingly flattened versions of it. In that sense, parallel tempering is similar to simulated annealing, but with the advantage of respecting the detailed balance equation and, therefore, ensuring convergence towards the target distributions. At each step , the algorithm chooses between swapping (with probability ) and updating (with probability ). Swapping is proposed between , the current state of sequence , and , the current state of sequence , uniformly on and accepted with probability
2.5.2 2-way stochastic hierarchical clustering
An alternative approach to  is to resort to a method that also relies on MCMC but, at each step, considers as potential new states the current partition as well as all partitions that are obtained by either the merging of two blocks of the current partition or the division of one block of the current partition into two blocks. For a partition of into blocks , there are partitions that can be obtained by merging, and
that can be obtained by division, where is the Stirling number of the second kind, i.e., the number of partitions of a set with elements in blocks. It can furthermore be shown that (see §3.2 of online supplement). Such an algorithm can be seen as the stochastic exploration of a hierarchy through a simultaneous combination of agglomerative (bottom-up) and divisive (top-down) hierarchical clustering, by considering moving up the current state (through merging) or down (through division)—whence the term “2-way”, see also Fig. 1. From an algorithmic perspective, all merged partitions can be obtained in a straightforward manner, while divided partitions can be obtained using an algorithm proposed by , see also [26, §184.108.40.206]. This algorithm has the advantage of allowing for larger moves compared to . However, it also has two downsides. First, the structure of the discrete space might still make it hard to escape local maxima, all the more that their probabilities can become very large with increasing and . Besides, the number of potential partitions considered at each step quickly increases with the number of variables and may considerably slow down the sampling scheme.
2.5.3 Our approach
Practically, we implemented a sampler based on parallel tempering where swapping states is performed with probability , element-wise Gibbs sampling with probability , and sampling with the 2-way stochastic hierarchical clustering with probability . See Fig. 2 for an algorithmic description of the sampling scheme. As specified in Table 1, this approach includes as special cases Gibbs (, and ), 2wSHC (, and ), Gibbs+2wSHC (, and ), Gibbs+PT (, and ), 2wSHC+PT (, and ), and Gibbs+2wSHC+PT (, and ). Note that, due to the parallel tempering algorithm, each of the independent parallel chain is itself composed of dependent sequences.
3 Simulation study
We now consider a simulation study with variables, corresponding to potential partitions. The low dimension of the problem allows for an exhaustive calculation of all posterior probabilities on the solution space.
For , we considered partitions with an increasing number of blocks (). For a given value of , we performed 500 simulations. For each simulation, the 6 variables were randomly partitioned into clusters, all partitions having equal probability of occurrence [35, Chap. 12], . For a given partition of , we generated 300 i.i.d. samples following a distribution structured as in Eq. (1).
In a first batch of simulations (“Gaussian data”), each was set to either a univariate (if the size of was equal to 1) or multivariate (if ) normal distribution with mean and covariance matrix sampled according to a Wishart distribution with degrees of freedom and scale matrix the identity matrix and then rescaled to a correlation matrix. Such a sampling scheme on
degrees of freedom and scale matrix the identity matrix and then rescaled to a correlation matrix. Such a sampling scheme ongenerated correlation matrices with uniform marginal distributions for all correlation coefficients .
For the second batch of simulations (“non-Gaussian data”), we kept all ’s generated in the first batch but set each to a Student- distribution with degres of freedom, location parameter and scale matrix . was set in (the normal case would correspond to ).
We have to deal with multivariate normal distributions, and the corresponding resolution method involves two hyperparameters, namely a degree of freedom
We have to deal with multivariate normal distributions, and the corresponding resolution method involves two hyperparameters, namely a degree of freedomand a scale matrix (see Section 2.3). We here considered three alternative approaches to set these hyperparameters. The first approach, BayesOptim, sets the degree of freedom to the lowest value that still corresponds to a well-defined distribution, that is , and a diagonal scale matrix that optimizes the marginal model likelihood of Eq (12) in the case of 6 mutually independent variables . An alternative approach, BayesCorr, works with the sample correlation matrix instead of the sample covariance matrix. One can then set the number of degrees of freedom to and the scale matrix to the identity matrix. As mentioned in the previous paragraph, the corresponding prior distribution yields uniform marginal distributions for the correlation coefficients . A third approach, Bic, computes the posterior probability using the BIC approximation, which does not involve hyperparameters and is also insensitive to the fact that the input is the covariance matrix or the correlation matrix.
To assess the quality of the inference process, we considered three quantities: the posterior probability of the true underlying partition, the ratio between this posterior probability and the probability of the maximum a posteriori (MAP) partition, and the entropy of the posterior distribution in log 203. The posterior probability of the true underlying partition gives an absolute sense of the quality of the inference process as a consequence of both the information contained in the data and the quality of the method used. The two other quantities help to disentangle the relative contribution of the two factors (information contained in the data and method used). The entropy of the posterior distribution in log 203 yields values ranging from 0 for a degenerate posterior distribution (one partition has a posterior probability of 1, while all other partitions have a posterior probability of 0) to 1 for a uniform posterior distribution (all partitions have posterior probability of ). It is an indicator of the (lack) of information contained in the data. As to the ratio between the posterior probability of the true model and the probability of the MAP partition, it gives a sense of how far away the inference process is from picking the true model as the preferred model.
3.3.1 Gaussian data
We first compared the probability distributions obtained for the three methods applied to the Gaussian data (see Fig.
We first compared the probability distributions obtained for the three methods applied to the Gaussian data (see Fig.3 or §4.1 of online supplement). There was a relative correspondence between values, with a relationship that varied depending on the number of blocks () in the synthetic data.
The posterior probability of the true model increased with the number of samples and decreased with the number of clusters (Fig. 4, top left panel). Globally, the inference process tended to detect the true model as one of the most probable ones (Fig. 4, top right and bottom right panels). The difference lay in the entropy of the posterior distributions (Fig. 4, bottom left panel). For cluster, it quickly tended toward 0, indicating a very localized distribution and, hence, a precise inference. By contrast, increasing led to both an increase in entropy for a given sample size, but also a decrease in the speed at which entropy decreased with the sample size.
All in all, we found that the number of clusters in the simulated data had a dramatic influence on the inference process, in that it was all the harder to confirm the existence of a specific model that the given model had many blocks of mutually independent variables. Such a behavior in disfavor of models with sparse covariance matrices is further discussed in Section 6.6.
3.3.2 Non-Gaussian data
Analysis of the non-Gaussian data with the same methods led to several changes (see §4.2 of online supplement). First, the relationship between probabilities computed with BayesOptim and BayesCorr in the non-Gaussian case remained similar to the Gaussian case regardless of the degrees of freedom. By contrast, Bic behaved quite differently from BayesOptim, with a difference that decreased when the degrees of freedom increased to become quite similar to the Gaussian case for .
Besides, the medians of the four quantities used to assess the behavior of our approach were found to be quite similar to the Gaussian case. However, they also exhibited much more variability than in the Gaussian case, with again a variability that decreased with increasing to become quite similar to the Gaussian case for .
4 HIV study data
In this section, we consider a toy example of mutual independence extraction. The problem was already analyzed elsewhere [39, 40, 5]. As for the simulation study, its low dimension ( variables and potential partitions) allows for an exhaustive calculation of all posterior probabilities on the solution space.
The data originates from a study investigating early diagnosis of HIV infection in children from HIV positive mothers . The variables are related to various measures on blood and its components: and immunoglobin G and A, respectively; the platelet count; , lymphocyte B and T4, respectively; and the T4/T8 lymphocyte ratio. The observed correlation matrix is given in Table 2. According to , discussion with experts suggested the existence of a strong association between variables and as well as between variables , , and . Using conditional independence graphs,  found that the values of partial correlation between and other variables had probability around zero, which led him to hypothesize that his original model was over-parameterized. Still with conditional independence graphs,  found that the links between and the five other variables had low probability of existence and that no single graphical model was able to accurately account for the data, hinting that models of conditional independence graphs might be too refined for that specific dataset.  found that various hierarchical clustering methods tended to cluster and as well as and ; variable tended to be cluster with or depending on the clustering method.
We first computed the exact probability distribution of all potential partitions using the three variants BayesOptim, BayesCorr, and Bic mentioned in the simulation study. We then focused on BayesOptim. Regarding the quantities of interest, we first computed the probabilities for all potential partitions. From there, we computed the relevances associated to all subsets of . For a subset of , the relevance of is the probability to find as a block in the partitioning of ; it is calculated as the sum of all probabilities associated with partitions for which is a block. Finally, we computed the probability corresponding to the following two expert statements:
“There is a strong association between and ”: posterior probability for and to be partitioned in the same block regardless of the rest.
“There is a strong association between , , and ”: posterior probability for , , and to be partitioned in the same block regardless of the rest.
The three methods (BayesOptim, BayesCorr, and Bic) yielded similar results. The four most probable patterns of mutual independence were found to be the same in the same order (, , , and ) with a good agreement as to the weight of these models (see Table 3). These four models accounted for more than 99% of the probability distribution. Importantly, these four models are not all nested in one another (e.g., and ; and ). The fact that some of them were found to be nested in one another ( nested in and ; nested in ) was extracted by the analysis and not imposed a priori.
Relevances were computed for all subsets of in the case of BayesOptim (see §5 of online supplement). The most important features were that: 4 had a relevance of 0.994 and 12356 a relevance of 0.852. This means that the probability to find 4 in a block all by itself was equal to 0.994, while the probability to find block 12356 was equal to 0.852. The fact that the relevance of 4 was found to be larger than that of 12356 means that in some cases we found a partition with a block containing 4 alone but where 12356 was split, e.g., and .
The probability for and to be found in the same block was found to be very close to 1 (its negation had probability ), decomposed as follows for the most probable cases: within block 12356 with probability 0.852, within block 12 with probability 0.134, within block 126 with probability , and within block 124 with probability . The probability for , and to be found in the same block was equal to 0.989, decomposed as: within 12356 with probability 0.852, and within 356 with probability 0.136.
5 fMRI functional connectivity analysis
5.1 Materials and methods
Functional magnetic resonance imaging (fMRI) is an imaging modality that makes it possible to dynamically and non-invasively follow metabolic and hemodynamic consequences of brain activity. fMRI functional connectivity analysis is a field that investigates the organization of brain networks in resting-state fMRI data by extracting clusters of brain regions whose spontaneous activities are highly correlated [42, 43, 44]. We applied our approach to a real resting-state fMRI data composed of 205 time samples recorded for 82 brain regions in a young healthy subject (subject #1 of ). Since both assumptions of variables following a multivariate normal distribution and of i.i.d. realizations are quite common in the field, we resorted to the results of Section 2.3; to speed up the process, we used the BIC approximation of Eq. (16). We ran the analyses with Matlab on a desktop PC with Intel Xeon Silver 4114, 2.2 GHz, 10 cores and 64 GB RAM.
5.1.2 Numerical sampling
A system with variables can be associated with different partitions, preventing explicit computation of all the corresponding posterior probabilities. We therefore resorted to the approximate sampling scheme discussed in Section 2.5. More specifically, we ran 5 distinct sampling schemes: Gibbs, 2wSHC, Gibbs+2wSHC (with ), Gibbs+PT (with and ), 2wSHC+PT (with and ), and Gibbs+2wSHC+PT (with , and ). All sampling schemes were run with initial samples according to a uniform distribution, parallel chains, and samples for each chain. To account for burn-in, we discarded the first half of the samples and computed our quantities of interest on the second half.
We observed that 2wSHC tended to visit a limited number of states, but each visited state was associated with many proposal states. To speed up the calculation, we kept a database of states associated with more than proposal states that were visited in the last 200 steps. At each iteration, the procedure checked if the current state was in the database; if so, it used the values already computed instead of computing them over. This significantly sped up the process, but also increased the memory load.
To assess algorithmic variability, all sampling schemes were run 10 times, for a total of 50 runs.
5.1.3 Comparing algorithms
We assessed convergence of each of the 50 runs by quantifying between-chain heterogeneity with the average distance between estimated probabilities. More specifically, assume that partitions appeared during the sampling regardless of the chain. Let be the frequency of the th partition in chain , the frequency of the th partition when all chains are pooled, the vector of frequency estimates for chain
the vector of frequency estimates for chain, and the vector of frequency estimates from the whole sample. Between-chain heterogeneity was measured as the average of the distances between the frequencies observed in each chain and the frequencies observed in the whole sample, that is,
This quantity is greater than 0, and equal to 0 only when all chains yield the same estimates. To assess within- and between-algorithm variability, we also computed the distance between probability estimates obtained from the 50 runs.
All algorithms were quite computationally demanding. Gibbs, Gibbs+2wSHC, and Gibbs+PT ran smoothly for all 10 repetitions with the above parameters. 2wSHC failed twice out of ten repetitions. Gibbs+2wSHC+PT failed 10 times out of 10 with the original parameters; with a reduced chain length of , it failed once out of 10. As to 2wSHC+PT, we were not able to run it, even for a chain length as low as . All failures were due to insufficient memory to store the database of previous states visited and corresponding probabilities (see Section 5.1.2), either because the current state was related to too many potential states through 2wSHC, or because the global database was too large.
Computational times are summarized in Fig. 5. Gibbs was the fastest algorithm. Allowing for steps of 2wSHC added an extra burden. Running parallel tempering had two opposite effects. First, 7 parallel steps were run for each chain, adding computational burden in terms of time and memory load. However, about half of the time, a step only consisted of testing if two states could be swapped, which is faster than computing several probabilities. The two effect made that Gibbs+PT had the fastest steps and Gibbs+2wSHC+PT the longest steps.
The level of convergence varied quite widely between algorithms, as summarized in Fig. 6. Gibbs and 2wSHC performed quite badly, as all chains of a given algorithm converged to different states (characterized by a heterogeneity of ). Convergence of Gibbs was improved by adding steps of 2wSHC and parallel tempering. After samples, Gibbs+2wSHC+PT was the algorithm that provided chains with the best convergence level. However, taking time into account, Gibbs+PT was able to provide a better convergence level with in less time.
These differences in convergence were confirmed when considering the probabilities estimated from the 50 runs (see Fig. 7 and Table 4). The 10 repetitions of Gibbs and 2wSHC essentially converged to different partitions, leading to distance values close to 2. The estimates given by Gibbs+2wSHC were less dissimilar to one another. Gibbs+PT and Gibbs+2wSHC+PT generated estimates that were consistent across repetitions, and relatively similar between algorithms.
In the present manuscript, we first exposed the problem of extracting patterns of mutual independence, translated it in terms of comparing partitions, and proposed a theoretical treatment in the form of Bayesian model comparison. We investigated two particular cases: multivariate normal distributions and cross-classified multinomial distributions, showing that the Bayesian solutions relate to the likelihood ratio criterion, the minimum discrimination information criterion and the Bayes information criterion in the asymptotic case of large sample size. We proposed a general sampling scheme on the set of all partitions which combines Gibbs sampling, a stochastic 2-way exploration of a hierarchy, and parallel tempering. We finally demonstrated the interest of the method on synthetic data as well as two real datasets, showing the unique insights provided by this approach.
From a theoretical perspective, we believe that the present approach opens a new avenue for the investigation of mutual independence. Compared to existing tools (likelihood ratio criterion, minimum discrimination information statistic), the main advantage of our method is that it does not require to restate the problem in the form of two competing nested models to be compared in the asymptotic case of large sample size, but allows for a full data-driven exploration of the patterns of mutual independence. It also allows for a variety of questions to be formulated regarding any feature induced by the potential pattern of mutual independence, such as the number of blocks, the specific relationship between two variables, or the relationship of one variable to all other variables.
6.1 Mutual independence extraction and clustering
As mentioned in the introduction, the problem faced in the present study belongs to the general area of clustering, but does not fall into either of two of the main subareas that are the clustering of mixture models and the clustering of functional data and curves. For instance, a classical way to perform cluster analysis of the HIV data would be to consider the children as objects and the six variables as features. One could then want to cluster the 107 objects (children) into subgroups of objects that share similar profiles in terms of the 6 features. In this context, a common assumption is that (i) each observation in the sample may arise from any of a small number of different distributions (one per cluster), (ii) the objects to be classified (children) are independent from one another given the cluster they belong to, and (iii) the distribution corresponding to each cluster is characterized by a set of parameters whose dimension is fixed (and, in particular, does not depend on the number of elements in the corresponding class). This is not the issue that the current study tried to address. Instead, it considered the 6 variables as objects, each object having 107 features (the realizations over children), and tried to classify them according to their
As mentioned in the introduction, the problem faced in the present study belongs to the general area of clustering, but does not fall into either of two of the main subareas that are the clustering of mixture models and the clustering of functional data and curves. For instance, a classical way to perform cluster analysis of the HIV data would be to consider the children as objects and the six variables as features. One could then want to cluster the 107 objects (children) into subgroups of objects that share similar profiles in terms of the 6 features. In this context, a common assumption is that (i) each observation in the sample may arise from any of a small number of different distributions (one per cluster), (ii) the objects to be classified (children) are independent from one another given the cluster they belong to, and (iii) the distribution corresponding to each cluster is characterized by a set of parameters whose dimension is fixed (and, in particular, does not depend on the number of elements in the corresponding class). This is not the issue that the current study tried to address. Instead, it considered the 6 variables as objects, each object having 107 features (the realizations over children), and tried to classify them according to theirrelationships with one another. In this respect, our problem shares similarities with the clustering of functional data or curves, e.g., [15, 16, 17, 18, 19, 20]. However, while curve clustering usually assumes a certain form of temporal dependence, we here assumed independent and identically distributed realizations of a variable. This is the reason why the clustering of variables based on their pattern of independence belongs to neither class mentioned above and defines a class of its own.
Another possibility would be to consider biclustering . Biclustering has a natural translation in terms of partitions and Bayesian model comparison, but these models usually relate the values taken by the variables across examples. We are therefore not quite sure how it would apply to models of mutual independence.
6.2 Bayesian analysis, asymptotic approximation and model selection criteria
In two common cases—multivariate normal distributions (Section 2.3) and cross-classified multinomial distributions (Section 2.4)—we showed that the log of the posterior distribution could be asymptotically approximated by a criterion that is the sum of a maximum-likelihood ratio and a BIC correction for model complexity. In the simulation study (Section 3), we showed that the method based on the asymptotic approximation (coined Bic) gave results that were similar to the posterior distribution (see in particular Fig. 3, bottom). Similar results were found in the HIV study data (Section 4; see in particular Table 3). Retrospectively, one could have thought of the BIC criterion as a valid approach to perform blind extraction of mutual independence patterns. Yet, we are not aware of any reference advocating such an approach. It is only after the full Bayesian analysis we conducted in a general case, its application to multivariate normal distributions, and an investigation of the asymptotic behavior that the BIC appeared as a potential solution to the problem. As a matter of fact, we believe that considering extraction of mutual independence patterns as one of model comparison is a major contribution of the present manuscript.
6.3 Sampling scheme
From a practical point of view, the sampling scheme is a key factor for a full exploration of all potential patterns of mutual independence. In the present study, we combined a Gibbs sampling scheme that sequentially scans elements (Gibbs), a stochastic 2-way algorithm for hierarchical clustering (2wSHC), and parallel tempering (PT). We observed that an algorithm solely based on Gibbs or 2wSHC generated chains that were slow to converge and had very limited mixing. By contrast, mixing Gibbs and 2wSHC as well as introduction of PT allowed for a better exploration of the state space (quantified in terms of between-chain heterogeneity).
Our algorithm could be modified by completing the initial (uniform) random sampling in the set of all partitions with a step consisting of running the agglomerative hierarchical clustering on the data  and then base the importance resampling step on the visited states. This potential improvement was not carried through in the present study for the following reason. We used mixing of the chains as a measure of convergence. The validity of such a monitoring could be challenged in the case of chains starting from seeds that are close to one another from the algorithm’s perspective. Indeed, by construction, the various states of the hierarchical clustering are rather close to one another from the perspective of 2wSHC (at most steps). As a consequence, using seeds from the agglomerative hierarchical clustering might have artificially accelerated convergence for 2wSHC and biased our convergence analysis in favor of 2wSHC. By contrast, we hoped that feeding the chains with seeds sampled uniformly would make between-chain similarity a good indicator of convergence.
Besides the sampling algorithm proposed here, another approach might be to use the generalized Swendsen–Wang sampler (SWC) or generalized Gibbs sampler , even though such approaches are expected to work best on relatively sparse connectivity graphs—by contrast, in our case, the graph would be fully connected.
In the cases that we considered, the problem was simplified by the fact that all parameters could be integrated out and the marginal likelihood computed explicitly. When this is not possible, one has to deal with a joint posterior distribution of the partition and corresponding parameters, whose dimensions vary depending on the partition. To perform numerical sampling of such a distribution, one could consider resorting to reversible jump MCMC (RJ-MCMC) and, more precisely, split-and-merge [47, 48, 49], which would have to be adapted to deal with the specificity of the problem at hand.
6.4 Prior distributions
Setting prior distributions is a key and touchy issue that one usually has to deal with when performing a Bayesian analysis. Here, we faced the problem at two different levels: for the competing models of mutual independence, , and for the model parameters, e.g., for the multivariate normal distribution and for the cross-classified multinomial distribution.
In the present manuscript, the priors for the parameters were set as conjugate priors for the sake of convenience. While this choice is not fully satisfying, it has the advantage of allowing for closed form expressions for the marginal model likelihoods. The choice of a prior for
In the present manuscript, the priors for the parameters were set as conjugate priors for the sake of convenience. While this choice is not fully satisfying, it has the advantage of allowing for closed form expressions for the marginal model likelihoods. The choice of a prior foris further discussed in . More generally, consistency requires that parameters found in different models be assigned the same prior. One way to do this is to define the parameter’s prior distribution for the model with no mutual independence and then derive all other distributions through marginalization (see again discussion of ). In any case, the importance of the prior distributions decreases when the data size increases. In the limiting case of large sample size, the asymptotic expressions do not depend on the priors set for the model parameters, as we found for multivariate normal distributions and cross-classified multinomial distributions.
Regarding the prior probabilities for the competing models of mutual independence, we set them to a uniform prior. The consequences of setting such a uniform prior on the set of all partitions need to be investigated, for the structure of this set and its size might induce undesired properties. For instance, in the simulation study and the HIV study data, setting a uniform prior entailed that the probability to have a partition with 1, 2, 3, 4, 5, or 6 blocks was given by , 0.153, 0.443, 0.3202, , and , respectively. With increasing , the difference in probability increases dramatically. For , the prior probability for a partition to have a number of blocks in the range is given by , see also .
Now, depending on the information available for a given problem, one might wish to set different priors. Various models have been proposed for partitions. A general family of priors is the so-called product partition models [41, 31]. Also, specific features might be desirable for the prior distribution. For instance, in the case where the assignment of the labels to the variables is arbitrary, it would make sense to require the prior distribution to be exchangeable [50, §2]—see also §3.3 of online supplement. By contrast, other features might be rejected, such as consistency as defined in —see also §3.4 of online supplement.
Finally, it might be of interest to incorporate expert knowledge into the analysis. In a Bayesian framework, expert statements can easily be incorporated in the form of prior information, that is, by selecting a prior distribution over potential partition models that respect the prior knowledge. Information regarding the number of blocks as well as preferred or forbidden partitions may be relatively easily to model. In other cases (e.g., preferential connectivity patterns in the case of brain networks observed in fMRI, Section 5), translating specific information into prior probability might turn out to be a real challenge.
6.5 Dimension of the problem
The previous discussion on priors shows that the complexity of the model, and, hence, the ease with which it can be solved, is influenced by two factors: the set of all potential patterns of mutual independence and the model parameters corresponding to each pattern of mutual independence. For variables, the set of all potential partitions of a given set of variables into mutually independent components is of size , which is a function of only. As to the dimension of the model parameter space, it depends on the underlying model of mutual independence. For instance, in the case of multivariate normal distributions, a covariance matrix is fully specified by parameters; of these, one could argue that only (correlation coefficients) have a decisive influence on the pattern of mutual independence. Patterns of mutual independence, by setting some correlation coefficients to 0, reduce this number. In the case of a cross-classified multinomial distribution, the maximal number of parameters is given by . For instance, for binary variables (
. For instance, for binary variables (), we have a maximum of parameters, which is , but grows slower than according to (26). For a given data size, the number of model parameters has an influence on how well these parameters can be estimated and, as a consequence, our capacity to discriminate between models.
6.6 The difficulty to extract sparse models
In the simulation study (Section 3 ), we observed that the number of clusters underlying the data had a dramatic influence on the inference process, in that it was all the harder to confirm the existence of a specific model that the given model had many groups of mutually independent variables. An explanation for such a behavior, which can be related to the common issue of overfitting in statistics and machine learning, can be seen in the BIC approximation. Consider for instance the multivariate normal case (Section
), we observed that the number of clusters underlying the data had a dramatic influence on the inference process, in that it was all the harder to confirm the existence of a specific model that the given model had many groups of mutually independent variables. An explanation for such a behavior, which can be related to the common issue of overfitting in statistics and machine learning, can be seen in the BIC approximation. Consider for instance the multivariate normal case (Section2.3). The Bayes factor between a “reference” partition and another model associated with a sparser covariance matrix (henceforth shortened as “sparser model”), Equation (18), can be expressed as
is positive and tends to its theoretical counterpart , which is a multivariate generalization of mutual information [51, 52]. When the data originates from , and the Bayes factor is positive, tending to as , a clear indication that is to be preferred over . By contrast, when the data originate from , will tend to . In the most favorable case, where is indeed equal to 0 (which is actually of probability 0 on real data), the Bayes factor will be negative, but going to at the much slower rate of . We suspect that this behavior might not be specific to the simulation performed but more general, making extraction of sparse models particularly challenging.
6.7 Mutually exclusive and exhaustive models
In a Bayesian analysis, a key strategy when comparing competing hypotheses is the use of mutually exclusive and exhaustive propositions or models [53, §2.4]. This is the case for our competing patterns of mutual independence. In particular, the set of all partitions with the finer-than relationship forming a lattice [54, p. 15], we have to explicitly state that selection of one model precludes the selection of another, nested model to ensure mutual exclusion. This should be contrasted to the classical definition of mutual independence where such exclusion is not mentioned. For instance, for the simulation study and the HIV study data analyzed in the present manuscript with , assuming a pattern of mutual independence of means that the distribution for can be decomposed as
but does not specify anything regarding . In particular, we could further have
which would entail a pattern of mutual independence of . This standard definition of mutual independence is in line with the prior distributions set on the model parameters, which theoretically do not prevent such situations to occur. In the example above, the prior probability distribution for all correlation coefficients between and to be equal to 0 is not equal to 0. To achieve mutual exclusion of models, we should exclude for each prior distribution parameter values that are compatible with a more refined model. Such an approach is rarely carried out. Instead, many analyses using Bayesian model comparison (e.g., polynomial approximations of increasing degree) rely on parameter spaces that are embedded in one another. A theoretical justification of this approach can be found in . Practically, the consequence of this issue is rather limited. In the above example, the probability to obtain with a full correlation matrix is equal to 0. Therefore, a model with a full correlation matrix exhibits almost surely no pattern of mutual independence.
6.8 Covariance matrix vs. full dataset
In the special case of multivariate normal distributions (Section 2.3), we performed our inference based solely on the sample covariance matrix. The rationale for this is that, from a theoretical point of view, mutual independence is a property of the covariance matrix, which has to be block diagonal. As a consequence, it is often thought that the sole statistic of interest is the sample covariance matrix (see for instance the HIV study, Section 4), and we wanted our approach to be applicable to such cases. Note however that, to do so, we used a formula that is only valid when the number of samples is larger than the dimension of the problem, . When this is not the case, we need to use the whole data set and perform Bayesian inference on
and perform Bayesian inference oninstead of as was done here. This implies, for instance, using non-degenerate prior distributions for and , e.g., conjugate priors [29, §3.6]. The corresponding analysis was performed in §1.2 of online supplement, yielding a posterior probability that tends to the one found in Section 2.3 when a hyperparameter tends to 0.
6.9 Mutual independence extraction and graphical models
A point to discuss is the relationship between the extraction of patterns of independence and graphical models. Graphical models are graphical representations of variables that conveniently encode relations of dependence and independence between variables. There are for instance conditional independence graphs  and covariance graphs . Conditional independence graphs are usually considered in the context of multivariate normal distributions (graphical Gaussian models) or discrete distributions (graphical log-linear models), while covariance graphs are mostly used in the context of multivariate normal distributions. Conditional independence graphs are more refined than models of mutual independence. For instance, for variables, there are potential models of mutual independence but models of conditional independence graphs, with and . In conditional independence graphs, mutual independence has a natural translation: mutually independent clusters of variables are disjoint maximal connected components of the graph. But conditional independence graphs can code for much more than connected components. As such, conditional independence graphs are much more flexible models. One could therefore contemplate using tools from the field of graphical models to estimate the patterns of mutual independence. However, we do not expect this direction to provide efficient solutions. First, the problem of inferring the pattern of a graphical model is itself challenging, generally requiring specific assumptions and numerical approximations. For instance, in the simplest case of multivariate normal distributions, the structure of conditional independence is strongly related to that of the precision (or concentration) matrix, that is, the inverse of the covariance matrix. Such a matrix is particularly complex to estimate, as it involves a matrix inversion. The maximum-likelihood is complex to obtain under constraint of a given conditional independence graph (e.g., using the iterative proportional fitting algorithm); nondecomposable graphical models are particularly hard to deal with. Then, once performed, such an analysis provides information regarding the underlying pattern of conditional independence, a large part of which is irrelevant for mutual independence and would have to be discarded by marginalization. We therefore expect mutual independence extraction to be more robust than graphical model procedures. Also, one could think of cases (e.g., multivariate Student-t distributions) where models of mutual independence could be compared, whereas graphical mod