MGM
Source code for mixed graphical model estimation
view repo
While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have been extensively studied, there is little work on graphical models linking both continuous and discrete variables (mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by focusing on the conditional log-likelihood of each variable given the rest. The parameters have a natural group structure, and sparsity in the fitted graph is attained by incorporating a group lasso penalty, approximated by a weighted ℓ_1 penalty for computational efficiency. We demonstrate the effectiveness of our method through an extensive simulation study and apply it to a music annotation data set (CAL500), obtaining a sparse and interpretable graphical model relating the continuous features of the audio signal to categorical variables such as genre, emotions, and usage associated with particular songs. While we focus on binary discrete variables, we also show that the proposed methodology can be easily extended to general discrete variables.
READ FULL TEXT VIEW PDFSource code for mixed graphical model estimation
Graphical models have proven to be a useful tool in representing the conditional dependency structure of multivariate distributions. The undirected graphical model in particular, sometimes also referred to as the Markov network, has drawn a lot of attention over the past decade. In an undirected graphical model, nodes in the graph represent the variables, while an edge between a pair of variables indicates that they are dependent conditional on all other variables. The vast majority of the graphical models literature has been focusing on either the multivariate Gaussian model (Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Rothman et al., 2008; d’Aspremont et al., 2008; Rocha et al., 2008; Ravikumar et al., 2008; Lam and Fan, 2009; Peng et al., 2009; Yuan, 2010; Cai et al., 2011; Friedman et al., 2008), or the Ising model for binary and discrete data (Höfling and Tibshirani, 2009; Ravikumar et al., 2010; Guo et al., 2010)
. The properties of these models are by now well understood and studied both in the classical and the high-dimensional settings. Both these models only deal with variables of one kind – either all continuous variables in the Gaussian model or all binary variables in the Ising model (extensions of the Ising model to general discrete data, while possible in principle, are rarely used in practice). In many applications, however, data sources are complex and varied, and frequently result in mixed types of data, with both continuous and discrete variables present in the same dataset. In this paper, we will focus on graphical models for this type of mixed data (mixed graphical models).
The conditional Gaussian distribution was originally proposed
(Lauritzen and Wermuth, 1989; Lauritzen, 1996) to model mixed data and has become the foundation of most developments on this topic. In the original paper, Lauritzen and Wermuth (1989) define a general form of the conditional Gaussian density and characterize the connection between the model parameters and the conditional associations among the variables. The model is fitted by maximum likelihood, but the number of parameters in this model, however, grows exponentially with the number of variables, which renders this approach unsuitable for high-dimensional problems arising in many modern applications. Much more recently, Lee and Hastie (2012) and Fellinghauer et al. (2011) have studied the mixed graphical model (simultaneously and independently of the present work), under a setting that could be viewed as a simplified special case of our proposal. A more detailed discussion of these papers is postponed to Section 6.In this paper, we propose a simplified version of the conditional Gaussian distribution which reduces the number of parameters significantly yet maintains flexibility. To fit the model in a high-dimensional setting, we impose a sparsity assumption on the underlying graph structure and develop a node-based regression approach with the group lasso penalty (Yuan and Lin, 2006), since edges in the mixed graphical model are associated with groups of parameters. The group lasso penalty in itself is not computationally efficient, and we develop a much faster weighted
approximation to the group penalty which is of independent interest. The simulation results show promising model selection performance in terms of estimating the true graph structure under high-dimensional settings.
To start with, we give a brief introduction to the conditional Gaussian distribution and its Markov properties following Lauritzen (1996).
Conditional Gaussian (CG) density: let
be a mixed random vector, where
is a -dimensional discrete sub-vector, and is a -dimensional continuous sub-vector. The conditional Gaussian density is defined as(1) |
where are the canonical parameters of the distribution. The following equations connect the canonical parameters in (1
) to the moments
:and the conditional distribution of given is .
The next theorem relates the graphical Markov property of the model to its canonical parameters and serves as the backbone of the subsequent analysis.
Represent the canonical parameters from (1) by the following expansions,
(2) |
where functions indexed by only depend on through . Then a CG distribution is Markovian with respect to a graph if and only if the density has an expansion that satisfies
where is the -th element of , is the -th element of , and a subgraph is called complete if it is fully connected.
The rest of the paper is organized as follows. Section 2 introduces the simplified mixed graphical model which has just enough parameters to cover all possible graph structures, proposes an efficient estimation algorithm for the model, and discusses theoretical guarantees for the proposed method under the high-dimensional setting. Section 3 uses several sets of simulation studies to evaluate the model selection performance and compare some alternative choices of regularization. In Section 4, the proposed model is applied to a music annotation data set CAL500 with binary labels and continuous audio features. In Section 5, we describe the generalization of the model from binary to discrete variables. Finally, we conclude with discussion in Section 6.
Our main contribution is a simplified but flexible conditional Gaussian model for mixed data. Model fitting is based on maximizing the conditional log-likelihood of each variable given the rest for computational tractability. This leads to penalized regression problems with an overlapping group structure of the parameters, the natural solution to which is to fit separate regressions with an overlapping group lasso penalty. This is computationally quite expensive, so we approximate the overlapping group lasso penalty by an appropriately rescaled penalty.
Recall we partition the random vector X = into the binary part , , and the continuous , . We propose to consider the conditional Gaussian distribution with the density function
(3) | |||||
where are all 0 and is the normalizing constant,
Note that the density is explicitly defined via the expanded terms in (2) but the canonical parameters can be obtained immediately by summing up the corresponding terms. This model simplifies the full conditional Gaussian distribution (1) in two ways: first, it omits all interaction terms between the binary variables of order higher than two, and second, it models the conditional covariance matrix and the canonical mean vector of the Gaussian variables as a linear function of the binary variables instead of allowing their dependence on higher order interactions of the binary variables. These simplifications reduce the total number of parameters from in the full model to . On the other hand, this model is the simplest CG density among those allowing for varying conditional covariance that can represent all possible graph structures, since it includes interactions between all the continuous and discrete variables and thus allows for a fully connected graph, an empty graph, and everything in between. The fact that it allows both the conditional mean and the conditional covariance of given to depend on adds flexibility.
Given sample data , directly maximizing the log-likelihood is impractical due to the normalizing constant . The conditional likelihood of one variable given the rest, however, is of much simpler form and easy to maximize. Hence, we focus on the conditional log-likelihood of each variable and fit separate regressions to estimate the parameters, much in the spirit of the neighborhood selection approach proposed by Meinshausen and Bühlmann (2006) for the Gaussian graphical model and by Ravikumar et al. (2010) for the Ising model. To specify the conditional distributions, let and . Then the conditional distribution of given is described by
(4) |
Since the conditional log-odds in (
4) is linear in parameters, maximizing this conditional log-likelihood can be done via fitting a logistic regression with
as predictors and as response.For the continuous variables, the conditional distribution of given is given by
where . With as defined by (3), we have
, i.e., the conditional variance of
does not depend on . Rewrite(5) |
where the redefined parameters with “tilde” are proportional to the original ones up to the same constant for each regression. Again, the conditional mean of
is linear in parameters, which can be estimated via ordinary linear regression with predictors
and response .Based on Theorem 1, the following equivalences hold:
(6) |
This means that each edge between pairs of and depends on a parameter vector, denoted by and , respectively. To encourage sparsity of the edge set under high-dimensional settings, a natural choice here would be to use a group lasso penalty, such as the penalty proposed by Yuan and Lin (2006)
for group lasso. The groups are pre-determined by parameter vectors corresponding to each edge. Denoting the loss function for the logistic regression of
by and the linear regression for by , we haveWe estimate the parameters by optimizing the following criteria separately, for and
(7) | |||||
(8) |
where is the tuning parameter. We use the same tuning parameter for all regressions to simplify tuning, but they can also be tuned separately if the computational cost is not prohibitive. Another reason to use a single tuning parameter is to simplify the treatment of overlapping groups of parameters from different regressions (see more on this below). Note that in linear regression, the parameters in (5) denoted with “tilde” are proportional to the original parameters. The original parameters can be recovered by multiplying the estimates by , which can be estimated from the mean squared error of the linear regression.
Although the optimization problems (7) and (8
) each have the form of regular group lasso regressions, they can not be jointly solved by existing group lasso algorithms, because the groups of parameters involved in each regression overlap. Specifically, in logistic regression, the parameter
is part of both and and determines both the edges and ; thus has one parameter overlapping with each of the other ’s. Similarly, in linear regression, is part of both and , and affects both the edges and . This overlapping pattern creates additional difficulties in using the group penalty to perform edge selection. The overlapping group lasso problem has received limited attention from a computational point of view. Yuan et al. (2011) recently proposed an algorithm for solving the overlapping group lasso problem, but it is computationally intensive for high-dimensional data. Instead of optimizing the overlapping group penalty directly, we look for a surrogate penalty with similar properties that is easier to optimize and thus more suitable for a high-dimensional setting.The weighted penalty can provide an upper bound for the group penalty, since for any vector , . For the logistic regression (7), for example, we have
The surrogate on the right penalizes the overlapping parameters twice as much as the other parameters, which makes intuitive sense since incorrectly estimating the overlapping parameters as non-zero will add two wrong edges, while incorrectly estimating unique parameters for each group as non-zero will only add one wrong edge.
To further illustrate why this upper bound provides a good approximation to the feasible region, consider the following toy example. Let the parameter vector be , with groups and . The optimization problems for the overlapping group lasso penalty and its surrogate boil down to optimizing the same loss function on different feasible regions (for the same tuning parameter). Figure 1 compares the two feasible regions, and . Since both the logistic loss and the least squares loss are smooth convex functions, the solutions will lie at singular points of the feasible regions and . Note that is not only a subset of but it contains all four singular points of : , which are also singular points in . Thus in this example, the optimum will be chosen from exactly the same set of singular points regardless of the penalty used. For higher dimensional , it still holds that is a subset of , but it may not contain all of the singular points of (this depends on the group structure). While that means that we may not have exactly the same optimal points, the approximation could still be good enough to identify the same non-zero groups, and this is what we found in practice.
With the group penalty replaced by the weighted surrogate, we solve the following regression problems separately as an approximation to the original problems and (8) to obtain the parameter estimates.
Logistic regression with penalty: for
(9) |
Linear regression with penalty: for
(10) |
Since we are estimating the parameters in separate regressions, all parameters determining edges will be estimated at least twice. This problem is common to all neighborhood selection approaches based on separate regressions, and is usually solved by taking either the largest or the smallest (in absolute value) of the estimates. For the mixed Gaussian model, we found that taking the maximum of absolute values results in somewhat better model selection, and that is what we use throughout the paper as the final estimate. To fit both types of regressions with a weighted penalty, we use the matlab package glmnet of Friedman et al. (2010).
Model selection consistency in high-dimensional regression problems is equivalent to estimating the non-zero parameters as non-zero. Correctly identifying the sign of non-zero parameters (in addition to correctly identifying them as non-zero) is referred to as sign consistency, while the usual convergence of all the estimated parameters is referred to as norm consistency. Regression with penalty has been shown to possess both sign consistency and norm consistency under certain regularity conditions, of which the most important one is the irrepresentable condition (see Meinshausen and Bühlmann (2006); Zhao and Yu (2006); Van de Geer and Bühlmann (2009) for linear regression and Ravikumar et al. (2010); Van de Geer (2008) for logistic regression). To show consistency of our method, we only need to require the main assumptions of the existing results to hold on a rescaled version of the original design matrix for each regression. Since we fit weighted -penalized regressions with fixed pre-determined weights, we can treat the parameters multiplied by the weights as new parameters, and rescale the design matrix for each regression by dividing each column the corresponding weights. This converts the problem to a standard -penalized regression, and consistency is thus guaranteed by existing results cited above.
This section includes simulation studies on evaluating model selection performance under different settings and comparing alternative choices of penalties. The results are summarized in ROC curves, where we plot the true positive count (TP) against false positive count (FP) and the true positive rate (TPR) against the false positive rate (FPR), for both parameters and edges over a fine grid of tuning parameters. Let and denote the true parameter vector and the fitted parameter vector respectively (without the intercept terms in the regressions), the parameter based quantities are defined as follows,
TP | ||||
TPR |
The quantities based on the true edge set and the estimated edge set are defined in a similar fashion, with parameter sets replaced by edge sets.
In the first simulation, we fix the variable dimensions to be continuous variables and discrete variables, with the sample size . We vary the maximum node degree (i.e., the number of edges from the node) in the graph while maintaining the total number of edges fixed at 80. This results in graph structures with varying “uniformity”, because given the total number of edges in a graph, the smaller the maximum node degree, the more uniform the graph is. A chain graph, for example, is the most extreme case of a uniform graph.
The data are generated as follows. First we generate the underlying graph structure given a maximum node degree: for the case where the maximum node degree is 2, we use a chain graph of the first 81 nodes; when the maximum node degree is 6, we generate a random graph from the Erdos-Renyi model. To enforce the maximum node degree constraint, we simply regenerate the adjacency matrix if there are any degrees larger than 6, which does not require many attempts since the graphs are very sparse under this setting. For the maximum node degree of 10, we assign 10 edges to the first node and randomly generate the other 70 edges from the Erdos-Renyi model, thus creating a sparse graph with a hub. The first nodes in the adjacency matrix are taken to correspond to the binary variables. Given the true graph, we set all parameters corresponding to absent edges to be 0. For the other (non-zero) parameters, we set the to be 1 or
with equal probability, and the off-diagonal elements of
to 2 or with equal probability. The diagonal elements of are chosen so that is positive definite for all possible ’s. We then generate the discrete variables ’s based on probabilities given by in (1). Since we use the exact discrete probabilities rather than MCMC methods to generate the binary data, the storage requirements prohibits taking a very large in simulations; however, this does not affect real data, and the method works well with large in Section 4. Finally, for each we generate the continuous variables from a multivariate Gaussian distribution with mean and covariance defined by (1).The results in Figure 2 contain four ROC curves for both parameters and edges across a fine grid of the tuning parameters , recorded both in percentage terms (FPR and TPR) and also in terms of counts (TP and FP), to get a better sense of the results on the real scale of the problem. The cut-off point for the parameter rate-based FPR is chosen at the point after which the curve does not change much, and the range of count-based FP is chosen to approximately match the range of FPR. The results show that as the maximum node degree increases, the model selection performance deteriorates, even though the total number of edges remains fixed.
Next, we vary the degree of sparsity in the true graphs. The variable dimensions are again fixed at , , and the sample size at . The number of edges is set to either , , or , while the degree of all nodes is at most 3. The true graph is again generated from the Erdos-Renyi model with the specified number of edges. The results are shown in Figure 3. Even though the underlying graphs are getting more dense as the number of edges increases, as long as the maximum degree is controlled, the percentage-based ROC curves for both edges and parameters stay nearly the same. The count-based ROC curves are not directly comparable in this case due to the varying number of parameters and edges of the true model.
In this simulation, we compare our proposed method (denoted by weighted ) with two other penalized regression approaches using the edge ROC curves. The first alternative (denoted by simple ), discussed by Fellinghauer et al. (2011), is to fit separate regularized regressions by regressing each variable on the others, without including any of the interaction terms. The second alternative (denoted by _regular) we consider is to replace the penalty in our method with the regular penalty, ignoring the grouping patterns and the overlaps between groups. We consider two settings relevant for this comparison: true graphs with and without complete subgraphs.
When the underlying graph does not contain any complete subgraphs, which can easily happen if it is very sparse, all the overlapping parameters are zero. From (6), we can see that each edge is then represented by a unique parameter: the edge corresponding to is determined by ; the edge for is determined by ; and the edge for is determined by . Then all the interaction terms in regressions (4) and (5) vanish. We follow the set-up of the first simulation, where the maximum node degree takes values in , the total number of edges is fixed at 80, the dimensions are , and . If a true graph we generate contains a complete subgraph, we discard it and generate a new one. Figure 4 shows that in each subplot, simple and weighted perform similarly, and they both outperform the regular . The results are to be expected, because for a true model with no interaction terms the simple , which excludes interaction terms automatically, should perform the best. Our weighted approach penalizes the interaction terms twice as much as the other parameters, which allows it to achieve a similarly good performance. The generic regular penalty fails to capture the group structure among the parameters and performs noticeably worse than the other two methods. We can conclude from this that if the underlying model is believed to be very sparse, simple does well by not including the interaction terms; our method weighted does equally well even with the ineffective interactions. The generic regular , which ignores the group structure, is the least accurate of the three alternatives.
In this part, we study the case where the true graph contains fully connected dense subgraphs. Specifically, we set and , make the first 20 nodes completely connected, and the other nodes are not connected to anything. This gives approximately 190 edges and 650 non-zero parameters. To obtain comparable signal-to-noise ratios and comparable ROC curves in this setting, we adjust the sample size to and decrease the non-zero parameter values by a factor of 10. For graphs with complete subgraphs, there are non-zero parameters in overlapping groups, which implies that the true model includes some interaction terms in (4) and (5).
Further, we consider two types of true models compatible with the graph structure above for regressions: in model I, both main and interaction effects in (4) and (5) are non-zero, and model II has main effects only and all the interaction effects are zero. Figure 5 shows the results. When interaction terms are present in the true model, the regular penalty and our weighed penalty perform similarly, and better than the simple penalty, as one would expect. When only main effects are present in the true model, all three methods show similar performance with the weighted and simple approaches having a slight advantage for reasons similar to those in Section 3.2.1.
Taken together, the simulation results indicate that the weighted penalty approach does a good job balancing the need for penalizing interactions more than the main effects because of double-counting, and the need to keep them in the model. If we have no prior information on whether to expect complete subgraphs in the network or not, the weighted penalty appears to be the safest choice.
Music annotation has been studied by researchers in many areas, including audio signal processing, information retrieval, multi-label classification, and others. Music annotation data sets usually consist of two parts: “labels”, typically assigned by human experts, contain the categorical semantic descriptions of the piece of music (emotions, genre, vocal type, etc.); and “features”, continuous variables extracted from the time series of the audio signal itself using well developed signal processing methods. Representing these mixed variables by a Markov graph would allow us to understand how these different types of variables are associated with each other. For example, one can ask which rhythm and timbre features are associated with certain genres. We apply our method to the publicly available music data set CAL500 (Turnbull et al., 2008), from the Mulan database (Tsoumakas et al., 2011), to find the conditional dependence patterns among the mixed variables.
The CAL500 dataset consists of 502 popular western music tracks (including both English language songs and instrumental music) composed within the last 55 years by 502 different artists. The collection covers a large range of acoustic variations and music genres, and the labeling of each song is supervised by at least three individuals. For each song, the label part includes a semantic vocabulary of 149 tags represented by a 149-dimensional binary vector indicating the presence of each annotation. Specifically, the labels are partitioned into the following six categories: emotions (36 total), genres (31), instruments (24), song characteristics (27), usages (15), and vocal types (16). The continuous features of the music are based on the short time Fourier transform (STFT) and are calculated for each short time window by sliding a half-overlapping, 23ms time window over the song’s digital audio file. Detailed description of the feature extraction procedure can be found in
Tzanetakis and Cook (2002). For each analysis window of 23ms, the following continuous features are extracted to represent the audio file: the spectral centroid, a measure of ‘brightness’ of the music texture with higher value indicating brighter music with more high frequencies; spectral flux, a measure of the amount of local spectral change; zero crossings, a measure of the noisiness of the signal; and the first MFCC coefficient (Logan, 2000)representing the amplitude of the music, which comes from a two-step transformation designed to capture the spectral structure. Every consecutive 512 of the 23ms short frames are then grouped into 1s long texture windows, based on which the following summary statistics for the four features defined above were calculated and used as the final continuous part of the data: overall mean, mean of the standard deviations of each texture window, standard deviation of the means of each texture window, and standard deviation of the standard deviations of each texture window.
In our analysis, we omitted labels which were assigned to less than 3% of the songs and kept only the first MFCC coefficient since it can be interpreted as the overall amplitude of the audio signal, and other coefficients are not readily interpretable. Also, we standardized the continuous variables. This resulted in a dataset with observations, discrete variables, and continuous variables.
We applied our method coupled with stability selection (Meinshausen and Bühlmann, 2010) to identify the underlying Markov graph for the purposes of exploratory data analysis, which is the primary usage of graphical models. To perform stability selection, we run our algorithm 100 times on randomly drawn sub-samples of size , and kept only the edges that were selected at least 90 times. The results are shown in Figure 6. The continuous timbre features are represented by squares labeled 1-16 and the binary variables are represented by circles labeled 1-118. Each color represents a category of variables as shown in the legend. The graph has a number of interesting and intuitive connections. The continuous variables that represent the audio signal features are quite densely connected within themselves, which is to be expected. More interesting edges can be found between continuous features and expert labels. The average amplitude of the music (square 4) is connected with the genre “alternative rock” (circle 37) and the instrument “Synthesizer” (circle 72). The noisiness of the music (square 13) is associated with “negative feelings” (circle 87), which seems reasonable and has also been reported by Blumstein et al. (2012). We also find connections between short period amplitude variation (square 8) with popular likeable songs (circle 84) and not tender or soft emotions (circle 34). Moreover, there are some interesting patterns within the group of binary labels, which allow us to infer connections between different emotions, genres, instruments, usages and so on. For example, songs with positive feelings (circle 86) are connected to piano (circle 67), happy and not happy emotions (circle 17 and 18, highly negatively correlated), songs with high energy (circle 82) and optimistic emotions (circle 27). We also find edges connecting fast and not fast tempo music (circles 78 and 79) to classic rock (circle 38), songs with high energy (circle 82) and very danceable and not danceable songs (circles 97 and 98). Likeable or popular songs (circle 84) are associated with usages such as driving (circle 101) and reading (circle 107), which makes intuitive sense.
To extend our model to the general case where the discrete variables can take more than two values, we modify the previous model (3) into the following,
(11) | |||||
where each takes integer values 1 to ; are all discrete functions which take on possible values and is a discrete function with values. For identifiability, we set and . The correspondence between the parameters and the edges is then given by
(12) |
The generalized model can be fitted with separate regressions based on the conditional likelihood of each variable. The parameters in (12) still have group structure, which calls for using the group lasso penalty as in (7) and (8). The overlapping structure is more complex in this case, and we use the upper bound approximation as in (9) and (10) to obtain the final estimates. Specifically, we minimize the following criteria separately:
Logistic regression with penalty: for
Linear regression with penalty: for
We have proposed a new graphical model for mixed (continuous and discrete) data, which is particularly suitable for high-dimensional data. While the general conditional Gaussian model goes back to Lauritzen and Wermuth (1989), it is not appropriate for high-dimensional data, and there is little previous work on mixed graphical models that can scale to modern applications. While our model is substantially simpler than the original general model, it scales much better. Given that graphical models are primarily a tool for exploratory data analysis, this is a reasonable trade-off, and the ability to explore conditional dependence relationships between large numbers of discrete and continuous variables will hopefully be of use to practitioners in a range of application domains.
We have recently become aware of two new developments on this topic (which were derived in parallel with and independently of this manuscript). Lee and Hastie (2012) assume a more restricted version of the conditional Gaussian density by assuming constant conditional covariance for all the continuous variables. Their model can be viewed as a special case of ours in (3), where all the are 0, which can be too restrictive for some applications. They considered both the maximum likelihood approach and maximum pseudo likelihood approach with a group lasso penalty for general discrete variables; when the discrete variables are all binary, this is equivalent to using the regular penalty, which is one of the alternatives we compared in simulations in Section 3.2. Lee and Hastie (2012) did not focus on high-dimensional applications. Another recent paper, Fellinghauer et al. (2011)
, applied random forests and stability selection in fitting
-regularized regressions of each variable on the rest, without specifying any generative models for the mixed data. In our framework, this would amount to taking the separate regression approach to a simplified conditional Gaussian model as in Lee and Hastie (2012), except with regression fitted using random forests coupled with stability selection. As our simulations show, such an approach performs well when the true graph is very sparse and has no complete subgraphs; when the true graph has complete subgraphs, our method is expected to outperform a separate regression approach.Journal of Machine Learning Research
, 10, 883–906.Sparse inverse covariance matrix estimation via linear programming.
Journal of Machine Learning Research, 11, 2261–2286.