A parallel algorithm for penalized learning of the multivariate exponential family from data of mixed types

12/06/2018 ∙ by Diederik S. Laman Trip, et al. ∙ VUmc 0

Computational efficient evaluation of penalized estimators of multivariate exponential family distributions is sought. These distributions encompass amongst others Markov random fields with variates of mixed type (e.g. binary and continuous) as special case of interest. The model parameter is estimated by maximization of the pseudo-likelihood augmented with a convex penalty. The estimator is shown to be consistent. With a world of multi-core computers in mind, a computationally efficient parallel Newton-Raphson algorithm is presented for numerical evaluation of the estimator alongside conditions for its convergence. Parallelization comprises the division of the parameter vector into subvectors that are estimated simultaneously and subsequently aggregated to form an estimate of the original parameter. This approach may also enable efficient numerical evaluation of other high-dimensional estimators. The performance of the proposed estimator and algorithm are evaluated in a simulation study, and the paper concludes with an illustration of the presented methodology in the reconstruction of the conditional independence network from data of an integrative omics study.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the increasing capacity for simultaneous measurement of an individual’s many traits, networks have become an omnipresent visualization tool to display the cohesion among these traits. For instance, the cellular regulatory network portraits the interactions among molecules like mRNAs and/or proteins. Statistically, a network captures the relationships among variates implied by a joint probability distribution describing the simultaneous random behavior of the variates. These variates may be of different type, representing – for example – traits with continuous, count, or binary state spaces. Generally, the relationship network is unknown and is to be reconstructed from data. To this end we present methodology that learns the network from data with variates of mixed types.

In literature a collection of variates of mixed type is mostly modelled by a pairwise Markov random field (MRF) distribution (which is a special case of multivariate exponential family distributions defined in Section 2

). A Markov random field is a set of random variables

that satisfy certain conditional independence properties which are specified by an undirected graph. This is made more precise by introduction of the relevant notions. A graph is a pair with a finite set of vertices or nodes and a collection of edges that join node pairs. In an undirected graph any edge is undirected, i.e. is an unordered pair implying that . A subgraph with and is a clique if is complete, i.e. all nodes are directly connected to all other nodes. The neighborhood of a node , denoted , is the collection of nodes in that are adjacent to : . The closed neighborhood is simply and denoted by . Now let be a -dimensional random vector. Represent each variate of with a node in a graph with . Node names thus index the elements of . Let , and be exhaustive and mutually exclusive subsets of . Define the random vectors , and by restricting the -dimensional random vector to the elements of , and , respectively. Then and are conditionally independent given random vector , written as , if and only if their joint probability distribution factorizes as . The random vector satisfies the local Markov property with respect to a graph if for all . Hence, knowledge of the variates associated with the neighborhood of node blocks the flow of information between the random variables and . Graphically, conditioning on the neighbors of detaches from . A Markov Random Field (or undirected graphical model) is a pair consisting of an undirected graph with associated random variables that satisfy the local Markov property with respect to (cf. Lauritzen, 1996). For strictly positive probability distributions of and by virtue of the Hammersley-Clifford theorem (Hammersley and Clifford, 1971) the local Markov property may be assessed through the factorization of the distribution in terms of clique functions, i.e. functions of variates that correspond to a clique’s nodes of the associated graph .

In this work we restrict ourselves to cliques of size at most two. Thus, only pairwise interactions between the variates of are considered. In the Ising model, a MRF for a collection of binary variates this amounts to – due to its lattice layout – the study of only nearest-neighbor interactions (Ravikumar et al., 2010). Although restrictive, many higher-order interactions can be approximated by pairwise interactions (Gallagher et al., 2011). Under the restriction to pairwise interactions and the assumption of a strictly positive distribution, the probability distribution can be written as:


with log-normalizing constant or log-partition function and pairwise log-clique functions . The pairwise MRF distribution , and therefore the graphical structure, is fully known once the log-clique functions are specified. In particular, nodes are connected by an edge whenever as the probability distribution of would then not factorize in terms of the variates, and , constituting this clique.

The strictly positive MRF distribution (1

) with pairwise interactions will be studied here as special case of interest, while our results are derived for the more general exponential family. This is hampered by the complexity of the log-partition function. Although analytically known for, e.g. the multivariate normal distribution, it is – in general – computationally not feasible to evaluate. For example, the partition function of the Ising model with

variates sums over all spin configurations. Indeed, the partition function is computationally intractable for MRFs that have variables with a finite state space (Welsh, 1993, Höfling and Tibshirani, 2009), or more generally for MRFs with variables of mixed type (Lee and Hastie, 2013).

The contribution of this work to existing literature is two-fold. First, we present machinery for estimation of the mixed variate graphical model with a quadratic (ridge or ) penalty. Therefore this study extends the existing framework of sparse network estimation by allowing for estimation of dense networks. This is motivated by the fact that the dominant paradigm of sparsity is not necessarily valid in all fields of application. In particular, in molecular biology this paradigm has recently come under fire (Boyle et al., 2017), and more dense (graphical) structures are advocated. Moreover, the severe degree of sparsity required by the theoretical resulting underpinning estimation procedures may lead to too stringent inference on the underlying graph, potentially oversimplifying. Therefore, an estimation procedure might be more appropriate, especially in molecular biology applications, as it makes no sparsity assumption.

The other contribution is to be found in the efficient algorithm for the evaluation of the presented estimator. This exploits the high degree of parallelization allowed by modern computing systems. We developed a Newton-Raphson procedure that uses full second-order information with comparable computational complexity to existing methods that use only limited second-order information. Our approach translates to other high-dimensional estimators that may profit in their numerical evaluation.

The paper is structured as follows. First, the Markov random field distribution for variates of mixed types is recapped as a special case of the more general exponential family. For application purposes, parameter constraints that ensure a well-defined Markov random field are recapped. Next, the exponential family model parameters, and thereby that of the MRF, are estimated by convex pseudo-likelihood maximization, which is shown to yield a consistent estimator. The estimator is numerically evaluated by a form of the Newton-Raphson algorithm. This algorithm is parallelized to exploit the multi-core capabilities of modern computing systems, and conditions that ensure convergence of the algorithm are identified. This work concludes with a demonstration of the estimation procedure by re-analyzing data from an integrative omics study.

1.1 Related work

There is extensive literature on (learning) graphical models. Here we point to the sources relevant for the remainder. Attention originally focussed on a set of variates with a continuous state space following a multivariate normal distribution and thus giving rise to the Gaussian graphical model. The penalized estimation approaches of this model either maximize the likelihood augmented with a penalty (be it or , (Banerjee et al., 2008, Friedman et al., 2008, van Wieringen and Peeters, 2016), or comprise node-wise penalized regressions (Meinshausen and Bühlmann, 2006, Ravikumar et al., 2011). The Ising model is the classic case for graphical models with discrete variables. Its estimation is prohibited by the computationally intractability of its partition function, which is generally the case for distributions of variates with a finite state space (e.g. Bernoulli). This may be circumvented by a stringent sparsity assumption (Höfling and Tibshirani, 2009, Lee et al., 2007). Alternatively, an approximation to the partition function gives rise to approaches that amount to node-wise regressions (Ravikumar et al., 2010, Wainwright et al., 2007, Friedman et al., 2010

). This approach is extended to node-conditional distributions of multinomial or Poisson distributed variates (

Jalali et al., 2011, Allen and Liu, 2013). A final alternative would be to maximize the penalized pseudo-loglikelihood (Höfling and Tibshirani, 2009, building on Besag, 1974), which has the advantage over the node-wise regression of producing a single parameter estimate for the whole graphical structure instead of multiple (possible contradictory) local ones. Moreover, reported simulations indicate that the resulting estimate is close to its penalized maximum likelihood counterpart (Höfling and Tibshirani, 2009). Originating in Besag (1974), work on graphical models of mixed type focusses on the derivation the MRF distribution by the specification of the node-conditional distributions. Lauritzen (1996) and Lauritzen and Wermuth (1989) produce the first examples for combinations of Bernoulli and Gaussian variates. Recently, Yang et al. (2012)

extended this work and specified the node-conditional distribution of each variate as an univariate exponential family member, from which a corresponding joint MRF distribution was derived. These univariate exponential families include multi-parameter members with all but one parameter constant (e.g. Gaussian variables with constant variance). Subsequently,

Yang et al. (2012) estimate the MRF parameter by a limited information approach exploiting the node-conditional distributions. Later Yang et al. (2014) generalized their work on MRF distributions to allow variates of mixed type with node-conditional distribution of any univariate exponential family member. An penalized pseudo-likelihood estimator for this model, limited to a combination of Bernoulli and Gaussian variates, was put forward by Lee and Hastie (2013). In parallel effort, Chen et al. (2015) derived a similar joint MRF distribution that also includes multi-parameter exponential family members (e.g. Gaussian variables with unknown mean and variance), and specify conditions on its parameter space. However, after model specification, both Yang et al. (2014) and Chen et al. (2015) only consider variates of at most two different types in their estimation procedure.

2 Model

This section describes the graphical model for data of mixed types. In its most general form our model is any exponential family distribution, motivated for application purposes by the functional form of the general Markov random field distribution (1

). Within the exponential family the model is first specified variate-wise, conditionally on all other variates. The parametric form of this conditionally formulated model warrants that the implied joint distribution of the variates is also an exponential family member. This correspondence between the variate-wise and joint model parameters endow the former (by way of zeros in the parameter) with a direct relation to conditional independencies between variate pairs, thus linking it to the underlying graph. Finally, to ensure the proposed distribution is well-defined, parameter constraints are discussed.

The multivariate exponential family is a rather broad class of probability distributions describing the joint random behaviour of a set of variates (possibly of mixed type), and encompasses many distributions for variates with a continuous, count and binary outcome space. All distributions share the following functional form:

where is a non-negative base measure, is the natural or canonical parameter (as for any distribution it specifies all parameters of that distribution), the sufficient statistic (as the likelihood of depends only on through ), and , the log-partition function or the normalization factor, which ensures is indeed a probability distribution. The log-partition function needs to be finite to ensure a well-defined distribution. For specific choices of and , standard distributions are obtained. Theoretical results presented in Sections (3) and (4) are stated for the multivariate exponential family, and thereby apply to all encompassing distributions. To provide for the envisioned practical purpose of reconstruction of the conditional dependence graph (as illustrated in Section 6) we require and outline next a Markov random field in which the variates follow a particular exponential family member conditionally. Note that this is thus a special case of the delineated class of exponential family distributions, as will be obvious from the parametric form of the Markov random field distribution.

Following Besag (1974) and Yang et al. (2014) the probability distribution of each individual variate of of conditioned on all remaining variates is assumed to be a (potentially distinct) univariate exponential family member. Its (conditional) distribution is:


Theorem (2) below specifies the joint distribution for graphical models of variates that have a conditional distribution as in Display (2). In particular, it states that there exists a joint distribution of such that is a Markov random field if and only if each variate depends conditionally on the other variates through a linear combination of their univariate sufficient statistics. It is analogous to Theorem (1) by Yang et al. (2014).


Consider a -variate random variable . Assume the distributions of each variate , , conditionally on the remaining variates to be an exponential family member as in (2). Let be a graph which decomposes into , the set of cliques of size at most two. Finally, is a -dimensional parameter matrix with support matching the edge structure of . Then, the following assumptions are equivalent:

  • For , the natural parameter of the variate-wise conditional distribution (2) is:

  • There exists a joint distribution of such that is a Markov random field.

Moreover, by either assumption the joint distribution of is:


The theorem above differs from the formulation in Yang et al. (2014) in the sense that Theorem (2) is restricted to pairwise interactions (i.e. cliques of size at most two). Moreover, here no constraints on base measures are specified.

For the reconstruction of the graph underlying the Markov random field, the edge set is captured by the parameter : nodes are connected by a direct edge if and only if (by the Hammersley-Clifford theorem, Lauritzen, 1996). This gives a simple parametric criterion to assess local Markov (in)dependence. Moreover, the parameter can be interpreted as an interaction parameter between variables and .

We refer to the distribution from Display (4) as the pairwise (joint) MRF distribution. After normalization of (4), the joint distribution is fully specified by sufficient statistics and base measures of the exponential family members. For practical and illustrative puroposes, the remainder will feature only four common exponential family members, the GLM family

: the Gaussian (with constant variance), exponential, Poisson and Bernoulli distributions. For instance, let

be a collection of variates that are Bernoulli distributed conditionally on the other variables. Then and , and the resulting distribution is the Ising model. Similarly, when the variates of are conditionally Gaussians, and produce the Gaussian graphical model.

The joint distribution formed from the variate-wise conditional distributions need not be well-defined for arbitrary parameter choices. In order for to be well-defined, the log-normalizing constant needs to be finite. For example, for the Gaussian graphical model, a special case of the pairwise MRF distribution under consideration, this is violated when the covariance matrix is singular. Theorem (2) specifies the constraint on parameter that ensure a well-defined pairwise MRF distribution when the variates of are GLM family members conditionally.


Let a -dimensional random variable follow a pairwise MRF distribution (4), while individually its variates adhere to either a Gaussian, exponential, Poisson or Bernoulli distribution conditionally on the remaining variates of . Then is well-defined if and only if satisfies the constraints from Table (1).

The same table with specific constraints for the GLM family can be found in Chen et al. (2015). More general results for exponential family members may be derived, see e.g. Yang et al. (2014).

The parameter constraints for a well-defined are restrictive on the structure of graph and the admissible interactions. As the graph is implicated by the support by , the constraints of Table (1) imply that the nodes corresponding to conditionally Gaussian random variables cannot be connected to the nodes representing exponential and/or Poisson random variables. Moreover, when and are assumed to be Poisson and/or exponential random variables conditionally on the other variates, their interaction can only be negative. However, these restrictions can be relaxed by modeling data with, for example, a truncated poisson distribution (Yang et al., 2014), or, while possibly not resulting in a well-defined joint distribution, might even be ignored altogether for network estimation (Chen et al., 2015).

Finally, notice that the pairwise joint MRF distribution is well-defined on a convex parameter space (see Supplementary Material A). This implies that, when limiting ourselves to the class of pairwise joint MRF distributions for the GLM family with the constraints from Table (1), parameter estimation amounts to a concave optimization problem with a convex domain.

Bernoulli Poisson Gaussian Exponential
Table 1: Parameters constraints for a well-defined pairwise joint MRF distribution (4). Here, refers to the submatrix of

that correspond to variates that are conditionally Gaussian distributed, while

refers the subset of variates with a conditional Bernoulli distribution. In additional to the constraints in the table the rate parameters of variates with a conditional exponential distribution are required to be negative:


3 Estimation

The parameter of the multivariate exponential family distribution

is now to be learned from (high-dimensional) data. Straightforward maximization of the penalized log-likelihood is impossible due to the fact that the log-partition function cannot be evaluated in practice. Inspired by the work of

Besag (1974) and Höfling and Tibshirani (2009) this is circumvented by the replacement of the likelihood by the pseudo-likelihood comprising the variate-wise conditional distributions. We show that the penalized pseudo-likelihood estimator of the exponential family model parameter is – under certain mild conditions – consistent. Finally, we present an algorithm for the numerical evaluation of this proposed estimator that efficiently addresses the computational complexity. Both results carry over to the pairwise MRF parameter as special case of the multivariate exponential family.

Consider a sample of -variate random variables all drawn from . The associated (sample) pseudo-likelihood is a composite likelihood of all variate-wise conditional distributions averaged over the observations:


The penalized pseudo-likelihood augments this by a strictly convex, continuous penalty function with penalty parameter , hence . Then the penalized pseudo-likelihood estimator of is given by:


When is proportional to the sum of the square of the elements of the parameter, with the Frobenius norm, it is referred to as the ridge penalty. With the ridge penalty, the estimator (6) is called the ridge pseudo-likelihood estimator. We show that the penalized pseudo-likelihood estimator (6) is consistent in the traditional sense, i.e. a regime of fixed dimension and an increasing sample size . The motivation for such a traditional consistency result is three-fold.

First, it suffices for practical purposes in the ‘omics-field’ where the system’s dimensions are known and fixed at the outset of the analysis. Second, almost all existing high-dimensional results (cf. e.g. Lee and Hastie, 2013, Chen et al., 2015, Lee et al., 2015) show ‘sparsistency’ – a contraction of sparse and consistency – which amounts to the probability of correct (sparse) network structure selection tending to one as . Sparsistency thus addresses the selection property of an estimator but not the (limiting) quality of the estimator in terms of the parameter values. In this light sparsistency results have – understandably – only been proven for -penalized (or a generalization thereof) estimators (e.g., Lee et al., 2015). Under the assumptions of i)

strong convexity of the loss function and

ii) irrepresentability of the model (for details see Lee et al., 2015), sparsistency is proven for sparse models (with the maximum vertex degree such that , Chen et al., 2015), with a minimal parameter value (in the absolute sense) for those parameters being nonzero, and a minimum sample size dependent on the dimension and the degree of sparsity (Lee et al., 2015). These assumptions and requirements are hard – if not impossible – to verify for any practical case at hand (Bento and Montanari, 2009, Anandkumar et al., 2012), thus questioning the usefulness of sparsistency results (Lee and Hastie, 2013). Moreover, for the proposed ridge estimator, which does not select, sparsistency appears not to be the relevant concept.

Third, an information theoretic analysis shows that learning graphical models, even for simple cases such as the Ising or Gaussian graphical model, requires at least samples (in words, is at least of order ) as (Das et al., 2012). For example, when , one finds that . This, when solved for , yields: . For the maximum vertex degree would then be bounded by , which rules out dense networks. Therefore, there is no hope for successful graphical model selection in the high-dimensional setting () when the maximum vertex degree satisfies , which can be assumed to be the case for molecular biology applications or scale-free and dense networks in general.

Hence, the current focus on traditional consistency which serves as a minimal requirement of a novel estimator.


Let be -variate draws from a exponential family distribution . Temporarily supply and with an index to explicate their sample size dependence. Then the penalized pseudo-likelihood estimator maximizing the penalized pseudo-likelihood is consistent, i.e., as if,

  • The parameter space is compact and such that is well-defined for all ,

  • can be bounded by a polynomial, for constants and ,

  • The penalty function is strict convex, continuous, and the penalty parameter converges in probability to zero: as .


Refer to Supplementary Material A.

Theorem (3) thus warrants – under some conditions – the convergence of the penalized pseudo-likelihood estimator as the sample size increases . These conditions require a compact parameter space, a common assumption in the field of graphical models (Lee et al., 2015). Theorem (3) holds in general for any multivariate exponential family distribution and is therefore generally applicable with the pairwise joint MRF distribution as special case. Here, is assured to be well-defined for the GLM family when complies with the constraints from Table (1) of Theorem (2), hence we obtain the following Corollary,


Let be -variate draws from a well-defined pairwise joint MRF distribution with parameter . The ridge pseudo-likelihood estimator that maximizes the ridge-penalized pseudo-likelihood is consistent, i.e., as , if the parameter space is compact, and the penalty parameter converges in probability to zero: as .


Refer to Supplementary Material A.

4 Algorithm

Maximization of the ridge pseudo-likelihood presents a convex optimization problem (a concave pseudo-likelihood and convex parameter space). To this end we present a parallel block coordinate Newton Raphson algorithm for numerical evaluation of the penalized pseudo-likelihood estimator . We show that this algorithm yields a sequence of updated parameters that converge to and terminates after a finite number of iterations.

The strict concavity of the optimization problem (6) and the smoothness of permit the application of the Newton-Raphson algorithm to find the estimate. The Newton-Raphson algorithm starts with an initial guess and – motivated by a Taylor series approximation – updates this sequentially. This generates a sequence that converges to (Vuik et al., 2006). However, the Newton-Raphson algorithm requires inversion of the Hessian matrix and is reported to be slow for the pseudo-likelihood (Lee and Hastie, 2013, Chen et al., 2015): it has computational complexity for variates. Instead of a naive implementation of the Newton-Raphson algorithm to solve (6), the remainder of this section describes a block coordinate approach (inspired by Y. and Yin, 2013) that speeds up the evaluation of the estimator by exploiting the structure of the pseudo-likelihood and splitting the optimization problem (6) into multiple simpler subproblems. The novelty of this parallel block coordinate Newton-Raphson algorithm is necessary to answer to the ever increasing size of data sets and make optimal use of available multi-core processing systems. Finally, in contrast to other pseudo-likelihood approaches by Höfling and Tibshirani (2009) or Lee and Hastie (2013), the presented approach allows for the use of all second-order information (i.e. the Hessian) without much increase in computational complexity and the benefit of potentially faster convergence.

In order to describe the block coordinate approach some notation is introduced. Define , the number of unique parameters of . The set of unique parameter indices is denoted by and use or as shorthand for the -dimensional vector of unique parameters . Furthermore, write for , the -dimensional vector of all unique parameters of that correspond to the -th variate. Note that, consequently, for the corresponding and have parameter(s) of in common. Finally, let be the -dimensional submatrix of the Hessian limited to the elements that relate to the -th variate, i.e.: .

In our block coordinate approach we maximize the penalized pseudo-likelihood with respect to the parameter subvector for while all other parameters are temporarily kept constant at their current value. Per block we maximize by means of the Newton-Raphson algorithm starting from the initial guess and updating the current parameter value by through:


Block-wise the procedure converges to the optimum, that is, the maximum of given the other parameters of . Sequential application of the block coordinate approach is – by the concavity of – then guaranteed to converge to the desired estimate. Sequential application of the block coordinate approach may be slow and is ran here in parallel for all simultaneously. This yields . But as some elements of and map to the same element of , multiple estimates of the latter are thus available. Hence, the results of each parallel iteration need to be combined in order to provide a single update of the full estimate . This update of should increase and iteratively solve the concave optimization problem (6). We find such an update in the direction of the sum of the block-wise updates of . A well-chosen step size in this direction then provides a suitable update of .

Algorithm (1) gives a pseudo-code description of the parallel block coordinate Newton-Raphson algorithm, while Figure (1) visualizes the combination of the block-wise estimates. Theorem (4) states that Algorithm (1) converges to the penalized pseudo-likelihood estimator and terminates. Note that Theorem (4) is a rather general result for the penalized pseudo-likelihood estimator of exponential family distributions. As special case, the same result follows for the pairwise joint MRF distribution with the GLM family.


Let be independent draws from a -variate exponential family distribution . Assume that the parameter space of is compact. Let be the unique global maximum of the penalized pseudo-likelihood . Then, for any initial parameter , threshold and sufficiently large multiplier , Algorithm (1) terminates after a finite number of iterations and generates a sequence of parameters that converge to .


Refer to Supplementary Material B.

The presented Algorithm (1) balances computational complexity, convergence rate and optimal use of available information. The algorithm terminates after a finite number of iterations and one iteration, i.e. lines 3 to 9, has computational complexity when run in parallel. Moreover, Algorithm (1) uses all available second-order information (the Hessian of ) and its convergence rate is at least linear. However, the convergence rate is quadratic when the multiple updates for each parameter are identical.

The pseudo-likelihood method has previously been reported to be computationally intensive with slow algorithms (Chen et al., 2015). For instance, the computational complexity of pseudo-likelihood maximization is per iteration for a naive implementation of the Newton-Raphson algorithm. Comparable work therefore uses either the pseudo-likelihood or a node-wise regression. When maximizing the pseudo-likelihood, existing methods use a diagonal Hessian or an approximation thereof, or only first-order information (Höfling and Tibshirani, 2009, Lee and Hastie, 2013). Such approaches achieve linear convergence at best and have a computational complexity of at least per iteration as the gradient of the pseudo-likelihood must be evaluated. Alternatively, the computational complexity of node-wise regression methods is per iteration for existing algorithms, which could be optimized to with a parallel implementation. However, node-wise regression methods estimate each parameter twice and subsequently need to aggregate their node-wise estimates. This aggregated estimate does not exhibit quadratic convergence. Moreover, these node-wise estimates are potentially contradictory and their quality depends on the type of the variable (Chen et al., 2015). In summary, we expect Algorithm (1) to perform no worse than other pseudo-likelihood maximization approaches, since its computational complexity of is comparable or better than existing methods and all available second-order information is used.

Finally, in Theorem (4) and for the pairwise joint MRF distribution with the GLM family, the condition on the multiplier may be relaxed, thereby increasing the step size of the parameter update and the convergence speed of Algorithm (1). As an example for the ridge penalty, in Supplementary Material C we present a lower bound on which warrants, when , an increase of the penalized pseudo-likelihood at each iteration of Algorithm (1). However, this lower bound is computationally demanding to evaluate as it depends on and . Therefore, this result from Supplementary Material C mainly motivates the use of an smaller than . This might be implemented by initiating the Algorithm (1) with . Then, whenever the algorithm does not decrease the error , the employed is reset to (say) twice its value. Eventually, the multiplier then becomes sufficiently large to ensure an improvement of the loss function and, thus, guarantees the convergence of Algorithm (1).

Figure 1: Schemata of the parameter updating by Algorithm (1). First, the subvectors are updated to novel (line 5 of Algorithm 1). These updates are interspersed with zero’s to form the -dimensional vectors (line 6 of Algorithm 1). Finally, the ’s are averaged weightedly to produce the update of (line 8 of Algorithm 1).

input :  data matrix ; exponential family members; initial parameter ; penalty parameter ; threshold ; step size . output : sequence . 1 initialize , . 2 while   do 3       for  do in parallel 4             calculate the gradient and Hessian . 5             compute a single Newton-Raphson update of . formulate the update as a -dimensional vector by:

6       end synchronize 7      define the parameter estimate . 8       assess error and . 9 end while Algorithm 1 Pseudocode of the parallel block coordinate Newton-Raphson algorithm for evaluation of the penalized pseudo-likelihood estimator.

4.1 Implementation

4.1.1 Algorithm

Algorithm (1) was implemented in C++ using the OpenMP API that supports multithreading with a shared memory. For convenience of the user, the algorithm was wrapped in an R-package as extension for the R statistical computing software, together with some extensions such as -fold cross-validation and a Gibbs sampler to draw samples from the pairwise joint MRF distribution. The package will be made publicly available on GitHub. The boundary constraints on the parameter space (e.g. Table (1) for the GLM family) are implemented using an additional convex and twice-differentiable penalty whenever one of the boundary constraints is violated.

4.1.2 Cross-validation

The penalty parameter of the ridge pseudo-likelihood estimator is selected using -fold cross-validation. This amounts to dividing the samples over exhaustive and mutually exclusive groups. The samples from all-but-one of these groups are used to compute the estimator for a given choice of the penalty parameter. The performance of the obtained estimator is evaluated on the samples of the left-out group. This process is repeated times, leaving each group out once, but using the same penalty parameter value throughout. The resulting performances are averaged and this average is the estimated performance of the estimator for the employed penalty parameter value. The performance is assessed for a grid of penalty parameters usually spanning multiple orders of magnitude (e.g. ). The value that yields the best estimated performance is considered optimal and used to obtain the final, optimal estimator. In the remainder the performance is evaluated with the mean squared prediction error (MSPE) and unless stated otherwise.

4.1.3 Sampling

To draw data from the full pairwise MRF distribution may be difficult. But as the conditional distributions of each variate given the others are known explicitly and of relatively simple functional form, a Gibbs sampler to draw samples from the pairwise joint MRF distribution is easily constructed. It requires sampling from univariate exponential family members. Details of the resulting Gibbs sampler are immediate from its pseudo-code given in Supplementary Material C. In the remainder this algorithm is applied with a burn-in period of length and a thinning factor , such that samples are selected from the chain iterations apart ensuring independence among samples (Chen et al., 2015).

5 Simulations

We evaluate the performance of the parallel block coordinate Newton-Raphson Algorithm (1) for the numerical evaluation of the penalized pseudo-likelihood together with the quality of the resulting estimator in a numerical study with synthetic data. Throughout we use the convex and twice differentiable ridge penalty . Hence, it is the quality of the ridge pseudo-likelihood estimator of that is studied. The diagonal of the parameter matrix is left unpenalized (as recommended by Höfling and Tibshirani, 2009). Throughout this section, the cross-validated ridge pseudo-likelihood estimator of is learned with Algorithm (1) using threshold and multiplier unless stated otherwise.

5.1 Performance illustration

We illustrate the performance of algorithm and estimator within a simulation assuming a lattice graph , thus following Yang et al. (2014), Lee and Hastie (2013), and Chen et al. (2015). The chosen graph’s layout, depicted in Figure (2), represents the most general setting encompassed by the outlined theory in which each GLM family member is present with an equal number of variates. The interactions obey the parameter restrictions reported in Table (1), thus ensuring a well-defined pairwise joint MRF distribution. The resulting lattice graph for nodes has edges. The corresponding pairwise MRF distribution has unique parameters and allows for edges. Hence, the constructed network is dense in the sense that it contains of all possible edges. Moreover, the nodes have an average degree of , while graphical model selection fails high-dimensionally when the maximum vertex degree is larger than (Das et al., 2012). The employed lattice graph thus represents a setting where previous work on (sparse) graphical models with data of mixed types fails when the sample size is small (relative to the number of parameters). To ensure the resulting pairwise MRF distribution adheres to the described lattice graph , we choose its parameter as follows:

Consequently, as this choice is in line with Table (1), the pairwise MRF distribution is well-defined and all edges share the same edge weight (as captured by the nonzero off-diagonal elements of ). Moreover, the variance of all node-conditional Gaussian variables was fixed at . With the distribution fully specified data are generated with the Gibbs sampler (described in Supplementary Material C) using the node-conditional distributions derived from .

Figure 2: Top panel: The synthetic simulation lattice graph . Nodes with variates distributed in accordance with the GLM family members Bernoulli, Gaussian, Poisson and exponential are represented by a pictogram representing (the shape of) their distribution. Bottom panel: The estimated lattice graph for . Nodes have a pictogram representing (the shape of) the distribution of the associated variate. For visualization purposes only edges corresponding to parameters whose absolute value is within the largest are depicted. Correctly identified edges are highlighted in bold.

First we consider the case . Note that, although , the number of to-be-estimated parameters exceeds . From this generated data we calculate the cross-validated ridge pseudo-likelihood estimator of using Algorithm (1). To visualize the estimated distribution with as a network only the edges with the largest (in the absolute sense) weights are displayed. Here the largest 36 off-diagonal parameters are selected for visualization (Figure 2), which would – ideally – amount to a reproduction of the underlying lattice graph when is close (in some sense) to . The resulting network contains of the edges in the lattice graph (cf. Figure 2), also when averaged over multiple replicates (Figure 3). Therefore on average of largest estimated parameters correspond to the nonzero off-diagonal elements of . Strikingly, few of the edges between ‘Bernoulli nodes’ are selected, indicating that their parameter estimates are relatively small. This hints at the need for a larger sample size for the good recovery of parameters representing interactions among Bernoulli variates.

Next, we evaluate the performance of Algorithm (1) on data for drawn from the same ‘lattice graph’ distribution as described above. To this end, we compare the error of the cross-validated ridge pseudo-likelihood estimator to its unpenalized counterpart and the averaged node-wise regression coefficients as baseline (whenever the sample size allowed for the evaluation of the latter two). Here the error is defined as the Frobenius norm of the difference between the parameter and its estimate , and analogous for the other two. Figure (3) shows the error of each estimator as function of the sample size. The error of the ridge pseudo-likelihood estimator decreases slowly with the sample size in the low-dimensional regime as expected, while a sharp increase of its error of is observed in a high-dimensional setting. In the low-dimensional regime the error of the ridge pseudo-likelihood is generally on a par with its unpenalized counterpart and the baseline node-wise regression. More refined, both the ridge and unpenalized pseudo-likelihood outperform the baseline averaged node-wise regression for all sample sizes. A full information and simultaneous parameter estimation approaches are thus preferable. Finally, the proposed ridge pseudo-likelihood estimator clearly shows better performance in the sample domain of (say) . Hence, regularization aids (in the sense of error minimization) when the dimension approaches or exceeds the sample size .

Notice that our algorithm terminates within iterations for every single run of this lattice network simulation. Obviously, the number of iterations to termination of the algorithm decreases when the penalty increases. To appreciate the computational efficiency of Algorithm (1), notice that one iteration of the naive Newton-Raphson implementation has computational complexity compared to of Algorithm (1). This permits our algorithm iterations before exceeding the computational complexity of one iteration of the naive implementation. A further numerical investigation (not detailed, but summarized in Figure 3) shows that convergence of Algorithm (1) is linear but the convergence rate scales linearly with the choice for (motivated by the results presented in Supplementary Material C). Moreover, the presented algorithm was found to always terminate within iterations. In summary, the total computational complexity of Algorithm (1) to termination is similar to a single iteration of a naive implementation.

Figure 3: Top left panel:

ROC curve, i.e. the number of true positive vs. false positive edges, of the edge selection process based on the corresponding largest (in absolute sense) parameter values. The average over 30 replicates is presented. The error bars stretch one the standard deviation in both directions from the average.

Top right panel: Scaling of the estimator errors with the sample size. Included are the cross-validated ridge pseudo-likelihood (red), its unpenalized counterpart (blue) and the ‘averaged’ node-wise regression coefficients (green). The number of replicates is inversely proportional to the sample size ( replicates for , and subsequently replicates). Error bars denote a confidence interval for the mean. Bottom left panel: Scaling of the required iterations for Algorithm (1) to converge with multiplier . Simulations were run with the lattice graph for and . The gray dotted line represents . The average of replicates is shown with error bars denoting the standard deviation. Bottom right panel: Scaling of the precision matrix estimator error with the sample size. Included are the ridge pseudo-likelihood (red), its unpenalized counterpart (blue), the ridge likelihood (green) and its unpenalized counterpart (yellow). The number of replicates is inversely proportional to the sample size ( replicates for , and subsequently replicates). Error bars denote the standard deviation.

5.2 Comparison

Finally, in another effort to compare the performance of the proposed ridge pseudo-likelihood estimator, we consider the Gaussian graphical model and compare with the ridge precision estimator (van Wieringen and Peeters, 2016). Assuming all variates being jointly normal, , the latter estimates through ridge penalized likelihood maximization. The ridge pseudo-likelihood too estimates , but does so in a limited information approach. Here we compare the quality of these full and limited approaches in silico. To this end define a three-banded precision matrix with a unit diagonal, for , for , for , and all other entries equal to zero. Here we set , and draw a sample of various sizes from the thus defined multivariate normal . From these data both the likelihood and pseudo-likelihood estimators are evaluated.

We compare performance by means of the error, defined as , to their unpenalized analogues. Figure (3) shows the error of each estimator as function of the sample size. In the low-dimensional regime, the errors of all estimators are very close and decrease slowly with the sample size as expected. In the high-dimensional regime (), expectedly, the penalized estimators clearly outperform their unpenalized counterparts as can be witnessed from their diverging error when approaches . Within the penalized estimators, the ridge likelihood appears to outperform the ridge pseudo-likelihood slightly. This is probably to the full information usage of the likelihood. More importantly, the performance (as measured by the error of the precision matrix) of the penalized pseudo-likelihood estimator is comparable to that of the penalized likelihood estimator. This corroborates the results of previous simulation studies into the (lasso) penalized pseudo-likelihood (Lee and Hastie, 2013, Höfling and Tibshirani, 2009).

6 Application

We illustrate our proposed methodology by means of learning the pairwise MRF distribution of (part of the) cellular regulatory network from omics data. The illustration uses the publicly available non-silent somatic mutation and gene expression data from the invasive breast carcinoma study of the Cancer Genome Atlas (TCGA) project (The Cancer Genome Atlas Network, 2012). The gene expression data comprises the RNA sequencing profiles of patients with invasive breast carcinomas (BRCA) and are obtained via the brcadat dataset included in the R-package XMRF (Wan et al., 2016a). The thus obtained data set includes the expression levels (normalized mRNA read counts) of genes listed in the Catalogue of Somatic Mutations in Cancer (COSMIC) (Forbes et al., 2015). Subsequently, the RNA sequencing data are preprocessed as described in Allen and Liu (2013) using the processSeq function from the XMRF package. This results in expression levels that – presumably – can be modeled with a Poisson distribution. Then, the top genes with largest variance in expression levels across patients are retained. Next, the mutation data of BRCA patients are obtained from the TCGA project and filtered with the getTCGA-function from the R-package TCGA2STAT (Wan et al., 2016b). The binary data indicate the presence of a mutation in the coding region of a gene. Furthermore, only mutation data of genes with mutations present in at least of the patients are included. Finally, the mutation and expression data sets are merged at the gene level and for common patients using the OMICSBind function from the TCGA2STAT-package. The resulting final data matrix contains variates comprising the non-silent somatic mutations of genes (binary) and the expression level of genes (counts) from BRCA patients. The genes CDH and GATA have both mutation and expression data measured.

The pairwise joint MRF distribution with node-conditional Bernoulli and Poisson distributions for mutation and expression variates, respectively, was fitted to the data described above using Algorithm (1) to find the proposed ridge pseudo-likelihood estimator. The penalty parameter is selected with -fold cross-validation using a grid search over a range of penalty parameters to minimize the mean square prediction error.

Figure 4: Left panel: Heatmap of the ridge pseudo-likelihood estimator of the invasive breast carcinoma gene network. Non-silent somatic mutations in genes are labeled red, gene expression levels are labeled yellow. Positive interactions between genes are shaded red, negative interactions are shaded blue. Algorithm (1) was run using threshold and terminated within iterations for each run. Right panel: Network visualization of the ridge pseudo-likelihood estimator of the BRCA gene network. Shown are the connected components of the network with edges whose absolute edge weight is among the largest in absolute sense (disconnected nodes are omitted). Nodes in the network are labeled with the gene name, where either non-silent somatic mutations (binary, red) or gene expression levels (counts, yellow) were measured.

The resulting estimate is visualized by means of a heatmap (Figure 4). Interaction parameters between genes’ expression levels are always negative as may be expected (Table 1). Moreover, the majority of strongest interactions (in absolute sense) relate gene expression levels to somatic mutations. Finally, we summarized the ridge pseudo-likelihood estimator as a network for interpretation purposes (Figure 4). The network includes only the edges with the largest (in the absolute sense) weights. These plotted edges include two types of interactions, interactions among genes’ expression levels and those between one genes’ expression levels and anothers somatic mutations. The majority of these edges involve the non-silent somatic mutation in TP, which is a well-known tumor suppressor gene (Levine et al., 1991). These edges relate a mutation in TP to a change in the expression level of genes that have been causally implicated with cancer (Forbes et al., 2015), thereby confirming the role of TP as guardian of the genome (Lane, 1992). Another biomarker with a large node degree is a non-silent mutation in PIKCA, a known oncogene that is often mutated in breast cancers (Cizkova et al., 2012). Otherwise noteworthy are genes CDH and GATA, the only included with both mutation and expression data. The molecular levels of these two genes are connected by an edge in the displayed network, which indicates that their expression levels are related to a mutation in their DNA template. Finally, the network contains no edges between the binary mutation variates, which may have been expected as somatic mutations may co-occur but are generally believed to be unrelated.

7 Conclusion

We presented methodology for the estimation of multivariate exponential family distributions. As special case of interest, the employed class of distributions encompasses the pairwise Markov random field that describes stochastic relations among variates of various types.

The model parameters are estimated by means of penalized pseudo-likelihood maximization to account for collinearity and an excess of variates (relative to the sample size). This estimator was shown to be consistent under mild conditions. Our algorithm allows for computational efficiency on multi-core systems and accommodates for a large number of variates. The algorithm was shown to converge and terminate. Finally, the performance of the algorithm and penalized estimator was studied by means of a simulation study, and our methodology was demonstrated with an application to an integrative omics study using data from various molecular levels (and types), which yielded biological sensible results.

Envisioned extensions of the presented ridge pseudo-likelihood estimator allow – amongst others – for variate type-wise penalization. Technically, this is a minor modification of the algorithm but brings about the demand for an efficient penalty parameter selection procedure. Furthermore, when quantitative prior information of the parameter is available it may be of interest to accommodate shrinkage to nonzero values.

Foreseeing a world with highly parallelized workloads, our algorithm provides a first step towards a theoretical framework that allows for efficient parallel evaluation of (high-dimensional) estimators. Usually and rightfully most effort concentrates on the mathematical optimization of the computational aspects of an algorithm. Once that has reached its limits, parallelization may push further. This amounts to simultaneous estimation of parts of the parameter followed by careful – to ensure convergence – recombination to construct a full updated parameter estimate. Such parallel algorithms may bring about a considerable computational gain. For example, in the presented case this gain was exploited to incorporate full second-order information without inferior computational complexity compared to existing algorithms.


  • Allen and Liu [2013] G I Allen and Z Liu. A local poisson graphical model for inferring networks from sequencing data. IEEE Transactions on Nanobioscience, 12(3):189–198, 2013.
  • Anandkumar et al. [2012] A Anandkumar, V Y F Tan, F Huang, and A S Willsky. High-dimensional structure estimation in Ising models: Local separation criterion. Annals of Statistics, 40(3):1346–1375, 2012.
  • Banerjee et al. [2008] O Banerjee, L El Ghaoui, and A DAspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data.

    Journal of Machine Learning Research

    , 9:485–516, 2008.
  • Bento and Montanari [2009] J Bento and A Montanari. Which graphical models are difficult to learn? In Advances in Neural Information Processing Systems, pages 1303–1311, 2009.
  • Besag [1974] J Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 192–236, 1974.
  • Boyle et al. [2017] E A Boyle, Y I Li, and J K Pritchard. An expanded view of complex traits: from polygenic to omnigenic. Cell, 169(7):1177–1186, 2017.
  • Chen et al. [2015] S Chen, D M Witten, and A Shojaie. Selection and estimation for mixed graphical models. Biometrika, 102(1):47–64, 2015.
  • Cizkova et al. [2012] M Cizkova, A Susini, S Vacher, G Cizeron-Clairac, C Andrieu, K Driouch, E Fourme, R Lidereau, and I Bieche. PIK3CA mutation impact on survival in breast cancer patients and in ERa, PR and ERBB2-based subgroups. Breast Cancer Research, 14(1), 2012.
  • Das et al. [2012] A K Das, P Netrapalli, S Sanghavi, and S Vishwanath. Learning Markov graphs up to edit distance. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pages 2731–2735. IEEE, 2012.
  • Forbes et al. [2015] S A Forbes, D Beare, P Gunasekaran, K Leung, N Bindal, H Boutselakis, M Ding, S Bamford, C Cole, S Ward, C Y Kok, M Jia, T De, J W Teague, M R Stratton, U McDermott, and P J Campbell. COSMIC: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Research, 43:805–811, 2015.
  • Friedman et al. [2008] J Friedman, T Hastie, and R Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432–441, 2008.
  • Friedman et al. [2010] J Friedman, T Hastie, and R Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010.
  • Gallagher et al. [2011] A C Gallagher, D Batra, and D Parikh. Inference for order reduction in Markov random fields. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1857–1864. IEEE, 2011.
  • Hammersley and Clifford [1971] J M Hammersley and P Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971.
  • Höfling and Tibshirani [2009] H. Höfling and R. Tibshirani. Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. Journal of Machine Learning Research, 10:883–906, 2009.
  • Jalali et al. [2011] A Jalali, P Ravikumar, V Vasuki, and S Sanghavi. On learning discrete graphical models using group-sparse regularization. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 378–387, 2011.
  • Lane [1992] D P Lane. Cancer. p53, guardian of the genome. Nature, 358:15–16, 1992.
  • Lauritzen [1996] S Lauritzen. Graphical Models. Oxford University Press, 1996.
  • Lauritzen and Wermuth [1989] S L Lauritzen and N Wermuth. Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals of Statistics, 17(1):31–57, 1989.
  • Lee and Hastie [2013] J Lee and T Hastie. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics, 24(1):230–253, 2013.
  • Lee et al. [2015] J.D. Lee, Y. Sun, and J. Taylor. On model selection consistency of m-estimators. Electronic Journal of Statistics, 9, 2015.
  • Lee et al. [2006] S-I Lee, H Lee, P Abbeel, and A Y Ng.

    Efficient L1-regularized logistic regression.

    In AAAI, volume 6, pages 401–408, 2006.
  • Lee et al. [2007] S-I Lee, V Ganapathi, and D Koller. Efficient structure learning of markov networks using l1-regularization. In Advances in Neural Information processing systems, pages 817–824, 2007.
  • Levine et al. [1991] A J Levine, J Momand, and C A Finlay. The p53 tumour suppressor gene. Nature, 351(6326):453–6, 1991.
  • Meinshausen and Bühlmann [2006] N Meinshausen and P Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):1436–1462, 2006.
  • Ravikumar et al. [2010] P Ravikumar, M Wainwright, and J Lafferty. High-dimensional Ising model selection using L1-regularized logistic regression. Annals of Statistics, 38(3):1287–1319, 2010.
  • Ravikumar et al. [2011] P Ravikumar, M J Wainwright, G Raskutti, and B Yu. High-dimensional covariance estimation by minimizing L1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935–980, 2011.
  • The Cancer Genome Atlas Network [2012] The Cancer Genome Atlas Network. Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.
  • van Wieringen and Peeters [2016] W N van Wieringen and C F W Peeters. Ridge estimation of inverse covariance matrices from high-dimensional data. Computational Statistics & Data Analysis, 103:284–303, 2016.
  • Vuik et al. [2006] C Vuik, P van Beek, F Vermolen, and J van Kan. Numerieke Methoden voor Differentiaalvergelijkingen. VSSD, 2006.
  • Wainwright et al. [2007] M J Wainwright, J D Lafferty, and P K Ravikumar. High-dimensional graphical model selection using l1-regularized logistic regression. In Advances in Neural Information Processing Systems, pages 1465–1472, 2007.
  • Wan et al. [2016a] Y Wan, G I Allen, Y Baker, E Yang, P Ravikumar, M L Anderson, and Z Liu. XMRF: an R package to fit Markov networks to high-throughput genetics data. BMC Systems Biology, 10(69), 2016a.
  • Wan et al. [2016b] Y Wan, G I Allen, and Z Liu. TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics, 32(6):952–954, 2016b.
  • Welsh [1993] D J A Welsh. Complexity: Knots, Colourings and Counting. Cambridge University Press, 1993.
  • Y. and Yin [2013] Xu Y. and W Yin.

    A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.

    SIAM Journal of Imaging Sciences, 6(3):1758—-1789, 2013.
  • Yang et al. [2012] E Yang, G Allen, Z Liu, and P K Ravikumar. Graphical models via generalized linear models. In Advances in Neural Information Processing Systems, pages 1358–1366, 2012.
  • Yang et al. [2014] E Yang, Y Baker, P Ravikumar, G Allen, and Z Liu. Mixed graphical models via exponential families. In Artificial Intelligence and Statistics, pages 1042–1050, 2014.