Stable Graphical Models

Stable random variables are motivated by the central limit theorem for densities with (potentially) unbounded variance and can be thought of as natural generalizations of the Gaussian distribution to skewed and heavy-tailed phenomenon. In this paper, we introduce stable graphical (SG) models, a class of multivariate stable densities that can also be represented as Bayesian networks whose edges encode linear dependencies between random variables. One major hurdle to the extensive use of stable distributions is the lack of a closed-form analytical expression for their densities. This makes penalized maximum-likelihood based learning computationally demanding. We establish theoretically that the Bayesian information criterion (BIC) can asymptotically be reduced to the computationally more tractable minimum dispersion criterion (MDC) and develop StabLe, a structure learning algorithm based on MDC. We use simulated datasets for five benchmark network topologies to empirically demonstrate how StabLe improves upon ordinary least squares (OLS) regression. We also apply StabLe to microarray gene expression data for lymphoblastoid cells from 727 individuals belonging to eight global population groups. We establish that StabLe improves test set performance relative to OLS via ten-fold cross-validation. Finally, we develop SGEX, a method for quantifying differential expression of genes between different population groups.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/07/2016

Structure Learning in Graphical Modeling

A graphical model is a statistical model that is associated to a graph w...
01/30/2013

On the Geometry of Bayesian Graphical Models with Hidden Variables

In this paper we investigate the geometry of the likelihood of the unkno...
08/09/2014

Robust Graphical Modeling with t-Distributions

Graphical Gaussian models have proven to be useful tools for exploring n...
06/18/2021

α-Stable convergence of heavy-tailed infinitely-wide neural networks

We consider infinitely-wide multi-layer perceptrons (MLPs) which are lim...
11/12/2018

Statistical Inference for Stable Distribution Using EM algorithm

The class of α-stable distributions with a wide range of applications in...
09/11/2011

The Bayesian Bridge

We propose the Bayesian bridge estimator for regularized regression and ...
05/22/2019

Distributionally Robust Formulation and Model Selection for the Graphical Lasso

Building on a recent framework for distributionally robust optimization ...

1 Introduction

Stable distributions have found applications in modeling several real-life phenomena (Berger and Mandelbrot, 1963; Mandelbrot, 1963; Nikias and Shao, 1995; Gallardo et al., 2000; Achim et al., 2001) and have robust theoretical justification in the form of the generalized central limit theorem (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Several special instances of multivariate generalization of stable distributions have also been described in literature (Samorodnitsky and Taqqu, 1994; Nolan and Rajput, 1995). Multivariate stable densities have previously been applied to modeling wavelet coefficients with bivariate -stable distributions (Achim and Kuruoglu, 2005), inferring parameters for linear models of network flows (Bickson and Guestrin, 2011) and stock market fluctuations (Bonato, 2012).

In this paper, we describe -stable graphical (-SG) models, a new class of multivariate stable densities that can be represented as directed acyclic graphs (DAG) with arbitrary network topologies. We prove that these multivariate densities also correspond to linear regression-based Bayesian networks and establish a model selection criterion that is asymptotically equivalent to the Bayesian information criterion (BIC). Using simulated data for five benchmark network topologies, we empirically show how -SG models improve structure and parameter learning performance for linear regression networks with additive heavy-tailed noise.

One motivation for the present work comes from potential applications to computational biology, especially in genomics, where Bayesian network models of gene expression profiles are a popular tool  (Friedman et al., 2000; Ben-Dor et al., 2000; Friedman, 2004). A common approach to network models of gene expression involves learning linear regression-based Gaussian graphical models. However, the distribution of experimental microarray intensities shows a clear skew and may not necessarily be best described by a Gaussian density (Section 3.2). Another aspect of microarray intensities is that they represent the average mRNA concentration in a population of cells. Assuming the number of mRNA transcripts within each cell to be independent and identically distributed, the generalized central limit theorem suggests that the observed shape should asymptotically (for large population size) approach a stable density (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Univariate stable distributions have previously been used to model gene expression data (Salas-Gonzalez et al., 2009a, b) and it is therefore natural to consider multivariate -stable densities as models for mRNA expression for larger sets of genes. In Section 3.2 we provide empirical evidence to support this reasoning. We further develop -stable graphical (-SG) models for quantifying differential expression of genes from microarray data belonging to phase III of the HapMap project (International HapMap 3 Consortium and others, 2010; Montgomery et al., 2010; Stranger et al., 2012).

The rest of the paper is structured as follows : Section 2.1 describes the basic notation and background concepts for Bayesian networks and stable densities. Section 2.2 introduces -SG models and establishes that these models are Bayesian networks that also represent multivariate stable distributions with finite spectral measures. Section 2.3 establishes the equivalence of the popular but (in this case) computationally challenging Bayesian information criterion (BIC) for structure learning and the computationally more tractable minimum dispersion criterion (MDC), for all -SG models that represent symmetric densities. Furthermore, we establish how data samples from any -SG model can be combined to generate samples from a partner symmetric -SG model with identical network topology and regression coefficients. Using these theoretical results we design StabLe, an efficient algorithm that combines ordering-based search (OBS) (Teyssier and Koller, 2005) for structure learning with the iteratively re-weighted least squares (IRLS) algorithm (Byrd and Payne, 1979) for learning the regression parameters via least

norm estimation. Finally, in Section 

3 we implement the structure and parameter learning algorithm on simulated and expression microarray data sets.

2 Methods

In this section we develop the theory and algorithms for learning -SG models from data. First, we discuss some well-established results for Bayesian networks and -stable densities.

2.1 Background

We begin with an introduction to Bayesian network models (Pearl, 1988)

for the joint probability distribution of a finite set of random variables

. A Bayesian network is specified by a directed acyclic graph (DAG) , whose vertices represent random variables in and a set of parameters

, that determine the conditional probability distribution

for each variable given the state of its parents in  (Koller and Friedman, 2009). We will overload the symbols and to represent both sets of random variables and their instantiations. The directed acyclic graph implies a factorization of the joint probability density into terms representing each variable and its parents (called a family) such that :

(1)

The dependence of on is usually specified by an appropriately chosen family of parametrized probability densities for the random variables, such as Gaussian or -Normal. In this paper, we will use multivariate stable densities to model the random variables in

. The primary motivation for modeling continuous random variables using stable distributions comes from the generalization of the central limit theorem to distributions with unbounded variance 

(Feller, 1968; Nikias and Shao, 1995). In the limit of large , all sums of

independent, identically distributed random variables approach a stable density. A formal definition for stable random variables can be provided in terms of the characteristic function (Fourier transform of the density function)

A stable random variable , is defined for each , , and . The probability density is implicitly specified by a characteristic function :

The parameters and will be called the characteristic exponent, skew, dispersion and location respectively. Unfortunately, the density does not have a closed-form analytical expression except for the three well-known stable distributions (Figure 1 and Table 1).

Except for the Gaussian case, the asymptotic (large ) behavior of univariate -stable densities shows Pareto or power law tails (Lévy, 1925). The following lemma formalizes this observation (Samorodnitsky and Taqqu, 1994; Nolan, 2013)

Lemma 1

If with , then as

Figure 1: The three instances of analytically known univariate -stable densities . Lévy (solid blue curves), Cauchy (dashed green curves) and Normal (dot-dashed red curves).
Distribution Support
Lévy
Cauchy
Normal
Table 1: Closed-form analytical expressions for Lévy, Cauchy and Normal densities and the corresponding -stable parameters.

A word on the notation used throughout this paper. We will use the symbol to represent the

norm of a vector. The

norm of a vector representing instantiations of a random variable is related to the moment . For heavy-tailed -stable densities, one convenient method for parameter estimation is via fractional lower order moments (FLOM) for  (Hardin Jr, 1984; Nikias and Shao, 1995). Later, we will discuss FLOM-based parameter learning in greater detail (Section 2.4.1).

2.2 -Stable Graphical Models

We can now introduce Bayesian network models reconstructed from stable densities that have compact representations for the characteristic function. Univariate -stable densities can be generalized to represent multivariate stable distributions that are defined as follows (Samorodnitsky and Taqqu, 1994),

A -dimensional multivariate stable distribution over is defined by an , and a spectral measure over the -dimensional unit sphere , such that the characteristic function

An -stable graphical (-SG) model is a probability distribution over such that

where are the parent nodes of in the directed acyclic graph and describes the distribution parameters

It is straightforward to see that is indeed a Bayesian network.

Lemma 2

in Definition 2.2 represents a Bayesian network

Let . First note that every directed acyclic graph can be used to infer an ordering (not necessarily unique) on the variables in such that all parents of each variable have a lower order than the variable itself. Suppose we index each variable with its order in an ordering compatible with the DAG, such that has order . The proof rests on the fact that the transformation matrix from to for such a graph is lower triangular, with each diagonal entry equal to 1. Since the determinant of a triangular matrix equals the product of its diagonal entries, the Jacobian for the transformation (or the determinant of the transformation matrix), . Furthermore, since the noise variables ’s are independent of each other

Hence, is a Bayesian network. Before establishing the fact that an -SG model is a multivariate stable density in the sense of Definition 2.2, we prove the following result (proof is provided in Appendix A) :

Lemma 3

Every -dimensional distribution with a characteristic function of the form

represents a multivariate stable distribution with a finite spectral measure .

We are now in a position to establish that -SG models imply a multivariate stable density with a spectral measure concentrated on a finite number of points over the unit sphere.

Lemma 4

Every -SG model represents a multivariate stable distribution with a finite spectral measure of the form in Lemma 3.

We will prove the lemma by induction. First, observe that every Bayesian network can be used to assign an ordering (not unique) such that . As before, we will use such an ordering to index each random variable in , such that has no descendants. The base case of the lemma, where is clearly true. Assume that the lemma is true for all Bayesian networks with . Then for any Bayesian network with random variables

Since by assumption,

Therefore, represents a -dimensional multivariate stable distribution with a finite spectral measure (Lemma 3). Therefore, by induction, every -SG model represents a multivariate stable distribution with a finite spectral measure of the form in Lemma 3.

2.3 Learning -SG Models

It is straight forward to use the characterization of stable random variables in Definition 2.1 to verify the following well-known properties (Samorodnitsky and Taqqu, 1994),

Property 1

If and are independent stable random variables, then , with

Property 2

If and , then

A popular method for structure learning in Bayesian network models is based on the Bayesian information criterion (BIC) which is also equivalent to the minimum description length (MDL) principle (Schwarz, 1978; Heckerman et al., 2000).

Given a data set , the Bayesian Information Score for a Bayesian network is defined as,

The Bayesian information criterion (BIC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs and parameters .

The major stumbling block in using stable densities is due to the fact that there is no known closed-form analytical expression for them (apart from special cases representing Gaussian, Cauchy and Levy distributions). This makes BIC based inference computationally demanding due to the marginal likelihood term . One main contribution of this paper is an efficient method of learning the network structure and parameters for -SG models. The next lemma establishes a new result that is useful in efficiently solving the learning problem.

Lemma 5

Given a data set generated from a stable random variable

Since includes samples from a stable distribution, by definition, performing a change of variable to

(2)

we get, the standard form density using Property 2. Furthermore, samples from the transformed data set are also distributed according to the following standard density :

This implies that if we know the parameters and for the density generating

where, is defined by

(3)

Here and are related via Equation 2 for all . Note that since the transformed variables are samples from , we have the following asymptotic result for large

where, is the entropy of the corresponding random variable. As things stand, the entropy of stable random variables in the standard form is just as difficult to compute as the original log-likelihood and the previous lemma has just transformed one intractable quantity into another. However, there is an important class of models where we can ignore the entropy term during structure learning. These multivariate distributions have a special property that every linear combination of random variables is distributed as a stable distribution with the same and . One scenario when this is true is when the noise term is symmetric i.e. . This special case is important since we later show (Lemma 8) that every -SG model can be easily transformed into a partner symmetric -SG model with identical network topology and regression coefficients. For all practical purposes, learning the structure of symmetric -SG models is effectively the same as learning structure of arbitrary -SG models.

Lemma 6

Given a symmetric -stable graphical model for variables in ,

The dispersion and skewness for the projection of any -dimensional stable random density is given by (Samorodnitsky and Taqqu, 1994)

Since, represents a symmetric -stable graphical model, Lemma 4 implies

We are now in a position to present the main contribution of this paper : an alternative criterion for model selection that is both computationally efficient and comes with robust theoretical guarantees (Lemma 7). The criterion is called minimum dispersion criterion (MDC) and is a penalized version of a technique previously used in signal processing literature for designing filters for heavy-tailed noise (Stuck, 1978).

Given a data set , the penalized dispersion score for a Bayesian network is defined as,

The minimum dispersion criterion (MDC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs and parameters .

Lemma 7

Given a data set generated by a symmetric -stable graphical model, , the minimum dispersion criterion is asymptotically equivalent to the Bayesian information criterion over the search space of all symmetric -stable graphical models

First consider the contribution to BIC score from each family (ie., each random variable and its parents) separately. Let be any arbitrary set of regression coefficients for a candidate network . Note that the coefficients need not be the true regression coefficients and need not be the true network . We will use the notation for the instantiation of in sample . Since includes samples from a symmetric -stable graphical model, Lemma 6 implies . Therefore, using Lemma 5

where, as in Equation 3, and are related by the transformation in Equation 2.

Since, is independent of the candidate network structure and regression parameters , we get the result that for any pair of networks and

Therefore, asymptotically, is equivalent to when data is generated by a symmetric -SG graphical model. We now show how samples from any stable graphical model can be combined to yield samples from a partner symmetric stable graphical model with identical parameters and network topology. This transformation was earlier used by Kuruoglu (2001) in order to estimate parameters from skewed univariate stable densities. We should point out that the procedure described above has the drawback that symmetrized data set has half the sample size.

Lemma 8

Every -SG model can be associated with a symmetric -SG model with identical skeleton and regression parameters.

Given a data set representing any -SG model , consider a resampled data set with variable instantiations

These ’bootstrapped’ data samples represent independent instantiations of random variables . Similarly, we may use the regression parameters to define resampled noise variables :

We now make two observations :

  1. If , then using Property 1

  2. The transformed noise variables are independent of each other.

But these conditions define an -SG model (Definition 2.2). Therefore, by Lemma 2, the resampled data is distributed according to a Bayesian network such that

2.4 The StabLe Algorithm

In this section we describe StabLe, an algorithm for learning the structure and parameters of -SG models (Algorithm 1). The first step of StabLe is to center and symmetrize the entire data matrix in terms of the variables , as described in Lemma 8. This is followed by estimating the global parameter using the method of statistics (Kuruoglu, 2001). Finally, structure learning is performed by a modified version of the ordering-based search (OBS) algorithm (Section 2.4.2). The details of parameter estimation and structure learning algorithms are discussed next.

  Input: Input data matrix , number of random restarts Nreps
  Output: -SG model over
                               // Symmetrize the data as per Lemma 8
  Estimate from                                           // Use log-statistics, Equation 4
  Initialize
  for i =1 to Nreps do
     Initialize a random ordering
                            // Ordering-based search, Algorithm 4
     if  then
        
     end if
  end for
Algorithm 1 StabLe

2.4.1 Parameter Learning

First, we describe the algorithms StabLe uses to estimate the characteristic exponent from the data matrix , as well as the parameters and for any given directed acyclic graph .

Estimating the global parameter :

Log statistics can be used to estimate the characteristic exponent from the centered and symmetrized variables in  (Kuruoglu, 2001).

Algorithm:

Since every linear combination of variables in has the same , if we define

(4)
Estimating the dispersion , and regression parameters

If is the dispersion parameter for the distribution of , then the minimum dispersion criterion selects regression parameters

Minimum dispersion regression coefficients are estimated using a connection between the -norm of a stable random variable and the dispersion parameter  (Zolotarev, 1957; Kuruoglu, 2001).

Lemma 9

If , then

where,

Therefore, to within a constant term , minimizing is identical to minimizing the -norm for .

  Input: dimensional vector for instantiations of the child node , matrix of instantiations of the parent set , tolerance and
  Output: dimensional vector of regression co-efficients
  Initialize with OLS co-efficients