Graphical models are widely used in the analysis of multivariate data, providing a convenient and interpretable way to study relationships among potentially large numbers of variables. They are key tools in modern statistics and machine learning and play an important role in diverse applications. Undirected graphical models are used in a wide range of settings, including, among others, systems biology, omics, deep phenotyping[see, e.g. dobra2004, Finegold11, monti2014]wei2007, verzelen2009, stadler2013, stadler2015, perrakis2021].
A large part of the graphical models literature has focused on the case in which either only continuous variables or only discrete variables are present. Pertaining to the former case, Gaussian graphical models have been extensively studied, including in the high-dimensional regime [see among others Meinshausen06, Friedman08, Banerjee08, Lam09, Yuan10, Ravikumar11, Cai11]
. In such models, it is assumed that the observed random vector follows a multivariate Gaussian distribution and the graph structure of the model is given by the zero-pattern in the inverse covariance matrix. Generalizations for continuous, non-Gaussian data have also been studied[Miyamura06, Liu09, Finegold11]. In the latter case, discrete graphical models – related to Ising-type models in statistical physics – have also been extensively studied [see e.g. wainwright2006, ravikumar2010].
However, in many applications it is common to encounter data that is mixed
with respect to variable type, i.e. where the data vector includes components that are of different types (e.g. continuous-Gaussian, continuous-non-Gaussian, count, binary etc.). Such “column heterogeneity" (from the usual convention of samples in rows and variables in columns) is the rule rather than the exception. For instance, in statistical genetics the construction of regulatory networks using expression profiling of genes may involve jointly analyzing gene expression levels alongside categorical phenotypes. Similarly, diagnostic data in many medical applications may contain continuous measurements such as blood pressure as well as discrete information about disease status or pain levels for example. In analysing such data, it is often of interest to estimate a joint multivariate graphical model spanning the various variable types. In practice, this is sometimes done usingad hoc pipelines and data transformations. However, in graphical modelling, since the model output is intended to be scientifically interpretable and involves statements about properties such as conditional independence between variables, the use of ad hoc workflows without an understanding of the resulting estimation properties is arguably problematic.
There have been three main lines of work that tackle high-dimensional graphical modelling for mixed data. The earliest approach is conditional Gaussian modelling of a mix of categorical and continuous data [Lauritzen96] as treated by Cheng17, Lee15. A second approach is to employ neighborhood selection which amounts to separate modeling of conditional distributions for each variable given all others [see e.g. Chen15, Yang14, Yang19]. A third approach uses latent Gaussian models with a key recent reference being the paper of Fan17 who proposed a latent Gaussian copula model for mixed data. The generative structure in their work posits that the discrete data is obtained from latent continuous variables thresholded at certain (unknown) levels. However, Fan17
consider only a mix of binary and continuous data, and do not allow for more general combinations (including counts or ordinal variables) as found in many real-world applications.
This third approach will be in the focus of this paper, which aims to provide a simple framework for working with latent Gaussian copula models in order to analyze general mixed data. To do so, we combine classical ideas concerning polychoric and polyserial correlations with approaches from the high-dimensional graphical models and copula literature. As we discuss below, this provides an overall framework that is scalable, general, and straightforward from the user’s point of view.
Already in the early 1900s, Pearson1900, Pearson13 worked on the foundations of these ideas in form of the tetrachoric and biserial correlation coefficients. From these arose the maximum likelihood estimators (MLEs) for the general version of these early ideas, namely the polychoric and the polyserial correlation coefficients.
One drawback of these original measures is that they have been proposed in the context of latent Gaussian variables. A richer distributional family is the nonparanormal proposed by Liu09 as a nonparametric extension to the Gaussian family. A random vector is a member of the nonparanormal family when is Gaussian, where is a set of univariate monotone transformation functions. Moreover, if the ’s are differentiable on top of being monotone, the nonparanormal family is equivalent to the Gaussian copula family. As the polychoric and polyserial correlation assume that observed discrete data are generated from latent continuous variables, they in fact adhere to a latent copula approach.
We propose two estimators of the latent correlation matrix which can subsequently be plugged in to existing routines to estimate the precision matrix such as the graphical LASSO (glasso) [Friedman08], CLIME [Cai11], or the graphical Dantzig selector [Yuan10]. The first one is appropriate under a latent Gaussian model and simply unifies the aforementioned MLEs. The second one is more general and is applicable under the latent Gaussian copula model. Both approaches can deal with discrete variables with general numbers of levels. We show that both estimators exhibit favorable theoretical properties and include simulation as well as real data results. Thus, the main contributions of the paper are as follows:
We make the observation that incorporating polychoric and polyserial correlations into the latent Gaussian copula framework provides an elegant, simple and effective method for graphical modelling for fully general mixed data.
We provide theoretical results concerning the behaviour of the proposed estimators, including in the high-dimensional setting. These concentration results demonstrate that the procedures proposed are statistically valid.
We study the estimators empirically, via a range of simulations as well as an example using real phenotyping data of mixed type (from the UK Biobank). These results illustrate how the proposed methods can be used in practice and demonstrate that performance is often close to an oracle that is given access to true latent data.
Taken together, our results provide users a way to carry out graphical modelling of mixed data that is statistically sound and practically easy-to-use, involving no more overhead than standard high-dimensional Gaussian graphical modelling approaches and in particular no need to manually specify any model components (such as bridge functions) that are specific to the variable types.
The remainder of this paper is organized as follows. In Sections 2 and 3 we present the estimators based on polychoric and polyserial correlations, including theoretical guarantees in terms of concentration inequalities. In Section 4 we describe the experimental setup used to test the proposed approaches on simulated data with the results themselves appearing in Section 5. Section 6 showcases an illustrative empirical application using real data from the UK Biobank. We close with conclusions and directions to our R package hume which allows users to readily implement the herein developed methods.
2 Background and model set-up
The goal of this paper is to learn undirected graphical model structure for general mixed and high-dimensional data. To this end, we extend the Gaussian copula model[Liu09, Liu12, Xue12] so as to allow inclusion of any type of discrete and continuous data.
Definition 2.1 (latent Gaussian copula model for general mixed data).
Assume we have a mix of ordinal and continuous variables, i.e. where denotes -dimensional possibly ordered discrete variables and represents -dimensional continuous variables. Then, satisfies the latent Gaussian copula model, if there exists a corresponding -dimensional random vector of latent continuous variables s.t. where is the mean vector and the correlation matrix and a set of monotone univariate functions.
where represent some thresholds. For simplicity, we denote and , and the number of levels of .
Then we say that the -dimensional vector satisfies the latent Gaussian copula model, i.e. where is a collection of thresholds. If , then satisfies the latent Gaussian model, i.e. .
Let denote the latent precision matrix. Then, as shown in Liu09, the zero-pattern of under the latent Gaussian copula model still encodes the conditional independencies of the latent continuous variables. Thus, the underlying undirected graph is represented by just as for the parametric normal. Note that the latent Gaussian copula model for general mixed data in Definition 2.1 agrees with that of Quan18 and of Feng19. The problem phrased by Fan17 is a special case of Definition 2.1. A more detailed comparison between both approaches can be found in Section 3. Nominal discrete variables need to be transformed to a dummy system.
For the remainder of the paper assume we have an independent -sample of the -dimensional vector . We estimate by considering the corresponding entries separately i.e. the couples . Consequently, we have to keep in view three possible cases depending on the couple’s variable types, respectively:
Case I: Both and are continuous, i.e. .
Case II: is discrete and is continuous, i.e. and . By symmetry the case where is continuous and is ordinal is identical to case II.
Case III: Both and are discrete, i.e. .
2.1 Maximum-likelihood estimation under the latent Gaussian model
At the outset, we examine each of the three cases under the latent Gaussian model. Let us start with case I, where both and are continuous. This corresponds to the regular Gaussian graphical model set-up discussed thoroughly for instance in Ravikumar11. Hence, the estimator for when both and are continuous is:
Definition 2.2 (Estimator of ; Case I).
Let denote the sample mean of . The estimator of the correlation matrix is defined by:
for all .
Clearly, this is simply the Pearson product-moment correlation coefficient which of course coincides with the maximum likelihood estimator (MLE) for the bivariate normal couple.
Turning to case II, let be ordinal and be continuous. We are interested in the product-moment correlation between two jointly Gaussian variables, where is not directly observed but only the ordered categories (see Eq. (1)). This is called the polyserial correlation [Olsson82]. The likelihood and log-likelihood of the -sample are defined by:
denotes the joint probability ofand and the marginal density of the Gaussian variable . For notational simplicity the subscripts in and will be omitted. MLEs are obtained by differentiating the log-likelihood in Eq. (3) with respect to the unknown parameters and setting the partial derivatives to zero and solving the system of equations for , and .
Definition 2.3 (Estimator of ; Case II).
Recall the log-likelihood in Eq. (3). The estimator of the correlation matrix is defined by:
for all .
Regularity conditions ensuring consistency and asymptotic efficiency, as well as asymptotic normality can be verified to hold here [Cox74].
Lastly, consider case III, where both and are ordinal. Consider the probability of an observation with and :
where and and denotes the standard bivariate density with correlation . Then, as outlined by Olsson79 the likelihood and log-likelihood of the -sample are defined as:
where is a constant and denotes the observed frequency of and in a sample of size . Differentiating the log-likelihood, setting it to zero, and solving for the unknown parameters yields the estimator for for case III:
Definition 2.4 (Estimator of ; Case III).
Recall the log-likelihood in Eq. (6). The estimator of the correlation matrix is defined by:
for all .
Regularity conditions ensuring consistency and asymptotic efficiency, as well as asymptotic normality can again be verified to hold here [Wallentin17].
Summing up, under the latent Gaussian model is a consistent and asymptotically efficient estimator for the underlying latent correlation matrix . Corresponding concentration results are derived in Section 3.5.
3 Latent Gaussian Copula Models
Fan17 propose a special case of the latent Gaussian copula model in Definition 2.1 where they consider a mix of binary and continuous variables. In the spirit of the nonparanormal SKEPTIC [Liu12] they avoid estimating the monotone transformation functions directly by making use of rank correlation measures such as Kendall’s tau or Spearman’s rho. These measures are invariant under monotone transformations and for case I there exists a well-known mapping between Kendall’s tau and Spearman’s rho and the underlying Pearson correlation coefficient . Consequently, the main contribution of Fan17
is the derivation of corresponding bridge functions for cases II and III. When pondering the general mixed case, they recommend binarizing all ordinal variables.
This thought has been taken up by Feng19 who propose to first binarize all ordinal variables to form preliminary estimators and subsequently combine them meaningfully by some weighted aggregate.
In an attempt to work on generalizations regarding the bridge functions, Quan18 extended the binary latent Gaussian copula model to the setting where a mix of continuous, binary, and ternary variables is present. However, a considerable drawback of this procedure becomes apparent. While for the binary-continuous mix three bridge functions were needed – one for each case – the number of mappings increases for each discrete variable with distinct state space. Indeed, a mix of continuous variables and discrete variables with say different state spaces already requires bridge functions.
For this reason, we take a different avenue to the latent Gaussian copula model for general mixed data where the discrete variables are allowed to have any number of states. In this approach, the number of cases we need to consider remains exactly three as already introduced in the previous section.
3.1 Nonparanormal Case I
For case I, the mapping between and the population versions of Spearman’s rho and Kendall’s tau is well known [Liu09]. Here we make use of Spearman’s rho with and
denoting the cumulative distribution functions (CDFs) ofand , respectively. Then . In practice, we use the sample estimate
with corresponding to the rank of among and [similar setup as Liu12]. From this we obtain the following estimator:
Definition 3.1 (Estimator of ; Case I nonparanormal).
The estimator of the correlation matrix is defined by:
for all .
3.2 Nonparanormal Case II
For case II, matters become more involved. In order to make use of the rank-based approach regarding the nonparanormal model the ML procedure can no longer be applied as we do not observe the continuous variable in its Gaussian form. Instead, we will proceed by suitably modifying other approaches that address the Gaussian case through more direct examination of the relationship between and the point polyserial correlation [Bedrick92, Bedrick96].
In what follows, in the interest of readability we omit the index in the monotone transformation functions but explicitly allow them to vary among the . According to Defintion 2.2, we have the following Gaussian conditional expectation
where we can assume w.l.o.g. that . After multiplying both sides with the discrete variable we move it into the expectation on the left hand side of the equation. This is permissible as is a function of , i.e.
Now let us take again the expectation on both sides, rearrange and expand by , yielding
where is the product-moment correlation between the Gaussian (unobserved) variable and the observed discretized variable .
All that remains is to find sample versions of each of the three components in Eq. (10). Let us start with the expectation in the denominator . By assumption and therefore w.l.o.g. for all . Consequently, we have:
where denotes the standard normal density. Whenever the ordinal states are consecutive integers we have . Based on this derivation it is straightforward to give an estimate of , once estimates of the thresholds have been formed (see Section 3.4 for more details). Let us turn to the numerator of Eq. (10
). The standard deviation ofdoes not require any special treatment, and we simply use in order to be able to treat discrete variables with a general number of states. However, the product moment correlation is inherently more challenging as it involves the (unobserved) back-transformed version of the continuous variables. Therefore, we proceed to estimate the back-transformation.
To this end, consider the marginal distribution function of , namely such that . In this setting, Liu09
propose to evaluate the quantile function of the standard normal at a Winsorized version of the empirical distribution function. This is necessary as the standard Gaussian quantile function diverges quickly when evaluated at the boundaries of theinterval. More precisely, consider , where is again the quantile function of the standard normal and is a Winsorization operator, i.e. . The truncation constant can be chosen in several ways. Liu09 propose to use
in order to control the bias-variance trade-off. Thus, equipped with an estimator for the transformation functions, the product moment correlation is obtained the usual way, i.e.
where and . The resulting estimator is a double-two-step estimator of the mixed couple and .
Definition 3.2 (Estimator of ; Case II nonparanormal).
The estimator of the correlation matrix is defined by:
for all .
3.3 Nonparanormal Case III
Lastly, let us turn to case III where both and are discrete but they might differ in their respective state spaces. In the previous section the ML procedure could no longer be applied because we do not observe the continuous variable in its Gaussian form. In case III however, we only observe the discrete variables generated by the latent scheme outlined in Definition 2.2. Due to the monotonicity of the transformation functions the ML procedure for case III from Section 2.1 can still be applied i.e.
Definition 3.3 (Estimator of ; Case III nonparanormal).
The estimator of the correlation matrix is defined by:
for all .
In summary, the estimator under the latent Gaussian copula model is a simple but important tool for flexible mixed graph learning. By using ideas from polyserial and polychoric correlation measures, we not only have an easy-to-calculate estimator but also overcome the issue of finding bridge functions between all different kinds of discrete variables.
3.4 Threshold estimation
The unknown threshold parameters play a key role in linking the observed discrete to the latent continuous variables. Therefore, being able to form an accurate estimate of the is crucial for both the likelihood-based procedures as well as the nonparanormal estimator outlined above.
We start by highlighting that we set the model up such that for each there exists a constant such that for all , i.e. the estimable thresholds are bounded away from infinity. Let us define the cumulative probability vector . Then, by Eq. (1), it is easy to see that
denotes the cumulative distribution function of a standard normal random variable. From this equation it is immediately clear that the thresholds satisfy. Consequently, when forming sample estimates of the unknown thresholds we replace the cumulative probability vector by its sample equivalent, namely
and plug it into the identity, i.e. for . The following lemma assures that these threshold estimates can be formed with high accuracy.
Consider the event . The following bound holds for all and for some Lipschitz constant
The proof of Lemma 3.1 is given in Section 4 of the Supplementary Materials. All herein developed methods are applied in a two-step fashion. We stress this by denoting the estimated thresholds as in the ensuing theoretical results.
3.5 Concentration results
We start by stating the following assumptions:
For all , . In other words, there exists a constant such that .
For there exists a constant such that for any and for all
are not critical points of the respective log-likelihood functions.
The log-likelihood functions have a finite number of critical points.
Every critical point that is different from is non-degenerate.
All joint and conditional states of the discrete variables have positive probability.
Assumption 3.3 assures that the likelihood functions under the latent Gaussian model behave in a “nice” way. This is again a requirement that resembles a mild technical assumption. The following theorem, relies on Mei18 and requires four conditions that are verified to hold in Section 2 of the Supplementary Materials. We note that a similar approach has been employed by Anne19 in the context of zero-inflated Gaussian data under double truncation.
Case I of the latent Gaussian model deals only with observed Gaussian variables and concentration results can be retrieved for example from Ravikumar11. Having treated the latent Gaussian model, we now turn to the nonparanormal extension.
In principle the three cases will have to be considered again.
Case I: When both random variables are continuous concentration results follow immediately from Liu12 who make use of Hoeffding’s inequalities for -statistics.
Case II: For the case where one variable is discrete and the other one continuous, we present concentration results below.
Case III: When both variables are discrete we make the important observation that Theorem 3.2 above still applies and needs not be altered. We do not observe the continuous variables directly but only their discretized versions. As a consequence, the threshold estimates remain valid under the monotone transformation functions and so does the polychoric correlation.
The following theorem provides concentration properties for case II.
The proof of the theorem is given in Section 5 of the Supplementary Materials. With regards to the scaling of the dimension in terms of sample size the ensuing corollary follows immediately.
For some known constant independent of and we have
3.6 Estimating the precision matrix
Similar to Fan17, we plug our estimate of the sample correlation matrix into existing routines for estimating . In particular, we employ the graphical lasso (glasso) estimator [Friedman08], i.e.
where is a regularization parameter. As exhibits at worst the same theoretical properties as established in Liu09, convergence rate and graph selection results follow immediately.
We do not penalize diagonal entries of and therefore have to make sure that is at least positive semidefinite to establish convergence in Eq. (16). Hence, we need to project into the cone of positive semidefinite matrices [compare also Liu12, Fan17]. In practice, we use an efficient implementation of the alternating projections method proposed by Higham88.
In order to select the tuning parameter in Eq. (16) Foygel10 introduce an extended BIC (eBIC) in particular for Gaussian graphical models establishing consistency in higher dimensions under mild asymptotic assumptions. We consider
where governs penalization of large graphs. Furthermore, represents the cardinality of the edge set of a candidate graph on nodes and denotes the corresponding maximized log-likelihood [see Foygel10, for more details] which in turn depends on from Eq. (16).
In practice, first one retrieves a small set of models over a range of penalty parameters (called glasso path). Then, we calculate the eBIC for each of the models in the path and select the one with the minimal value.
4 Experimental Setup
In order to numerically assess the accuracy of our mixed graph estimation approach we begin with a simulation study in which the estimators can be assessed in a gold-standard fashion and compared against oracles.
To facilitate comparison, we follow a similar data-generating strategy to Fan17. We split the experiments into three parts. First, we consider a binary mixed case and benchmark the performance of our approach against Fan17. Second, we generate a mix of binary-ternary-continuous data. Although Quan18 do not report any numerical results in their simulation study, we compare our approach to their extension of Fan17. Third, from we generate discrete data with arbitrary numbers of levels and compare performance with the latent continuous oracle. The results for the binary-ternary-continuous data simulations as well as a detailed description of the simulation setup can be found in Section 6
of the Supplementary Materials. We set the dimension to for sample size and choose such that the number of edges is roughly equal to the dimension – except for , where we allow for edges in accordance with Fan17.
Performance metrics. To assess performance, we report the mean estimation error as evaluated by the Frobenius norm. Furthermore, we consider graph recovery metrics. To this end, we define the number of true positives and false positives depending on the glasso path as the number of nonzero lower off-diagonal elements that agree both in and and the number of nonzero lower off-diagonal elements in that are actually zero in , respectively. The true positive rate and the false positive rate are defined as and , respectively. Lastly, we consider the area under the curve (AUC) where a value of corresponds to random guessing of the presence of an edge and a value of corresponds to perfect error-free recovery of the underlying latent graph (in the rank sense of ROC analysis).
5 Simulation Results
5.1 Simulation results: binary-continuous mix
We start by considering a mix of binary and continuous variables generated as outlined in Section 4 in order to be able to compare results with those of Fan17. For this purpose, Table 1 reports mean estimation errors under the different regimes. When for all , we recover the latent Gaussian and when the latent Gaussian copula model.
The oracle estimator in the third column of Table 1 corresponds to estimating from Definition 2.2 based on realization of the latent data . Next in column four, the binary indicates the nonparanormal estimator proposed by Fan17. The next two columns, namely and indicate the ML approach and the general mixed estimator developed in Sections 2.1 and 3, respectively.
simulation runs. Standard errors in brackets
As expected, the estimation error for and in the latent Gaussian setting is almost identical for and our proposed nonparanormal estimator . Additionally, performs best in this case. Compared to the oracle estimator, all three approaches exhibit very little loss in accuracy that arises due to the binarization. Looking at graph recovery in terms of FPR, TPR, and AUC the picture is similar.
Turning to the nonparanormal setting with under and