1 Introduction
Stable distributions have found applications in modeling several reallife phenomena (Berger and Mandelbrot, 1963; Mandelbrot, 1963; Nikias and Shao, 1995; Gallardo et al., 2000; Achim et al., 2001) and have robust theoretical justification in the form of the generalized central limit theorem (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Several special instances of multivariate generalization of stable distributions have also been described in literature (Samorodnitsky and Taqqu, 1994; Nolan and Rajput, 1995). Multivariate stable densities have previously been applied to modeling wavelet coefficients with bivariate stable distributions (Achim and Kuruoglu, 2005), inferring parameters for linear models of network flows (Bickson and Guestrin, 2011) and stock market fluctuations (Bonato, 2012).
In this paper, we describe stable graphical (SG) models, a new class of multivariate stable densities that can be represented as directed acyclic graphs (DAG) with arbitrary network topologies. We prove that these multivariate densities also correspond to linear regressionbased Bayesian networks and establish a model selection criterion that is asymptotically equivalent to the Bayesian information criterion (BIC). Using simulated data for five benchmark network topologies, we empirically show how SG models improve structure and parameter learning performance for linear regression networks with additive heavytailed noise.
One motivation for the present work comes from potential applications to computational biology, especially in genomics, where Bayesian network models of gene expression profiles are a popular tool (Friedman et al., 2000; BenDor et al., 2000; Friedman, 2004). A common approach to network models of gene expression involves learning linear regressionbased Gaussian graphical models. However, the distribution of experimental microarray intensities shows a clear skew and may not necessarily be best described by a Gaussian density (Section 3.2). Another aspect of microarray intensities is that they represent the average mRNA concentration in a population of cells. Assuming the number of mRNA transcripts within each cell to be independent and identically distributed, the generalized central limit theorem suggests that the observed shape should asymptotically (for large population size) approach a stable density (Feller, 1968; Nikias and Shao, 1995; Nolan, 2013). Univariate stable distributions have previously been used to model gene expression data (SalasGonzalez et al., 2009a, b) and it is therefore natural to consider multivariate stable densities as models for mRNA expression for larger sets of genes. In Section 3.2 we provide empirical evidence to support this reasoning. We further develop stable graphical (SG) models for quantifying differential expression of genes from microarray data belonging to phase III of the HapMap project (International HapMap 3 Consortium and others, 2010; Montgomery et al., 2010; Stranger et al., 2012).
The rest of the paper is structured as follows : Section 2.1 describes the basic notation and background concepts for Bayesian networks and stable densities. Section 2.2 introduces SG models and establishes that these models are Bayesian networks that also represent multivariate stable distributions with finite spectral measures. Section 2.3 establishes the equivalence of the popular but (in this case) computationally challenging Bayesian information criterion (BIC) for structure learning and the computationally more tractable minimum dispersion criterion (MDC), for all SG models that represent symmetric densities. Furthermore, we establish how data samples from any SG model can be combined to generate samples from a partner symmetric SG model with identical network topology and regression coefficients. Using these theoretical results we design StabLe, an efficient algorithm that combines orderingbased search (OBS) (Teyssier and Koller, 2005) for structure learning with the iteratively reweighted least squares (IRLS) algorithm (Byrd and Payne, 1979) for learning the regression parameters via least
norm estimation. Finally, in Section
3 we implement the structure and parameter learning algorithm on simulated and expression microarray data sets.2 Methods
In this section we develop the theory and algorithms for learning SG models from data. First, we discuss some wellestablished results for Bayesian networks and stable densities.
2.1 Background
We begin with an introduction to Bayesian network models (Pearl, 1988)
for the joint probability distribution of a finite set of random variables
. A Bayesian network is specified by a directed acyclic graph (DAG) , whose vertices represent random variables in and a set of parameters, that determine the conditional probability distribution
for each variable given the state of its parents in (Koller and Friedman, 2009). We will overload the symbols and to represent both sets of random variables and their instantiations. The directed acyclic graph implies a factorization of the joint probability density into terms representing each variable and its parents (called a family) such that :(1) 
The dependence of on is usually specified by an appropriately chosen family of parametrized probability densities for the random variables, such as Gaussian or Normal. In this paper, we will use multivariate stable densities to model the random variables in
. The primary motivation for modeling continuous random variables using stable distributions comes from the generalization of the central limit theorem to distributions with unbounded variance
(Feller, 1968; Nikias and Shao, 1995). In the limit of large , all sums ofindependent, identically distributed random variables approach a stable density. A formal definition for stable random variables can be provided in terms of the characteristic function (Fourier transform of the density function)
A stable random variable , is defined for each , , and . The probability density is implicitly specified by a characteristic function :
The parameters and will be called the characteristic exponent, skew, dispersion and location respectively. Unfortunately, the density does not have a closedform analytical expression except for the three wellknown stable distributions (Figure 1 and Table 1).
Except for the Gaussian case, the asymptotic (large ) behavior of univariate stable densities shows Pareto or power law tails (Lévy, 1925). The following lemma formalizes this observation (Samorodnitsky and Taqqu, 1994; Nolan, 2013)
Lemma 1
If with , then as
Distribution  Support  

Lévy  
Cauchy  
Normal 
A word on the notation used throughout this paper. We will use the symbol to represent the
norm of a vector. The
norm of a vector representing instantiations of a random variable is related to the moment . For heavytailed stable densities, one convenient method for parameter estimation is via fractional lower order moments (FLOM) for (Hardin Jr, 1984; Nikias and Shao, 1995). Later, we will discuss FLOMbased parameter learning in greater detail (Section 2.4.1).2.2 Stable Graphical Models
We can now introduce Bayesian network models reconstructed from stable densities that have compact representations for the characteristic function. Univariate stable densities can be generalized to represent multivariate stable distributions that are defined as follows (Samorodnitsky and Taqqu, 1994),
A dimensional multivariate stable distribution over is defined by an , and a spectral measure over the dimensional unit sphere , such that the characteristic function
An stable graphical (SG) model is a probability distribution over such that
where are the parent nodes of in the directed acyclic graph and describes the distribution parameters
It is straightforward to see that is indeed a Bayesian network.
Lemma 2
in Definition 2.2 represents a Bayesian network
Let . First note that every directed acyclic graph can be used to infer an ordering (not necessarily unique) on the variables in such that all parents of each variable have a lower order than the variable itself. Suppose we index each variable with its order in an ordering compatible with the DAG, such that has order . The proof rests on the fact that the transformation matrix from to for such a graph is lower triangular, with each diagonal entry equal to 1. Since the determinant of a triangular matrix equals the product of its diagonal entries, the Jacobian for the transformation (or the determinant of the transformation matrix), . Furthermore, since the noise variables ’s are independent of each other
Hence, is a Bayesian network. Before establishing the fact that an SG model is a multivariate stable density in the sense of Definition 2.2, we prove the following result (proof is provided in Appendix A) :
Lemma 3
Every dimensional distribution with a characteristic function of the form
represents a multivariate stable distribution with a finite spectral measure .
We are now in a position to establish that SG models imply a multivariate stable density with a spectral measure concentrated on a finite number of points over the unit sphere.
Lemma 4
Every SG model represents a multivariate stable distribution with a finite spectral measure of the form in Lemma 3.
We will prove the lemma by induction. First, observe that every Bayesian network can be used to assign an ordering (not unique) such that . As before, we will use such an ordering to index each random variable in , such that has no descendants. The base case of the lemma, where is clearly true. Assume that the lemma is true for all Bayesian networks with . Then for any Bayesian network with random variables
Since by assumption,
Therefore, represents a dimensional multivariate stable distribution with a finite spectral measure (Lemma 3). Therefore, by induction, every SG model represents a multivariate stable distribution with a finite spectral measure of the form in Lemma 3.
2.3 Learning SG Models
It is straight forward to use the characterization of stable random variables in Definition 2.1 to verify the following wellknown properties (Samorodnitsky and Taqqu, 1994),
Property 1
If and are independent stable random variables, then , with
Property 2
If and , then
A popular method for structure learning in Bayesian network models is based on the Bayesian information criterion (BIC) which is also equivalent to the minimum description length (MDL) principle (Schwarz, 1978; Heckerman et al., 2000).
Given a data set , the Bayesian Information Score for a Bayesian network is defined as,
The Bayesian information criterion (BIC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs and parameters .
The major stumbling block in using stable densities is due to the fact that there is no known closedform analytical expression for them (apart from special cases representing Gaussian, Cauchy and Levy distributions). This makes BIC based inference computationally demanding due to the marginal likelihood term . One main contribution of this paper is an efficient method of learning the network structure and parameters for SG models. The next lemma establishes a new result that is useful in efficiently solving the learning problem.
Lemma 5
Given a data set generated from a stable random variable
Since includes samples from a stable distribution, by definition, performing a change of variable to
(2)  
we get, the standard form density using Property 2. Furthermore, samples from the transformed data set are also distributed according to the following standard density :
This implies that if we know the parameters and for the density generating
where, is defined by
(3) 
Here and are related via Equation 2 for all . Note that since the transformed variables are samples from , we have the following asymptotic result for large
where, is the entropy of the corresponding random variable. As things stand, the entropy of stable random variables in the standard form is just as difficult to compute as the original loglikelihood and the previous lemma has just transformed one intractable quantity into another. However, there is an important class of models where we can ignore the entropy term during structure learning. These multivariate distributions have a special property that every linear combination of random variables is distributed as a stable distribution with the same and . One scenario when this is true is when the noise term is symmetric i.e. . This special case is important since we later show (Lemma 8) that every SG model can be easily transformed into a partner symmetric SG model with identical network topology and regression coefficients. For all practical purposes, learning the structure of symmetric SG models is effectively the same as learning structure of arbitrary SG models.
Lemma 6
Given a symmetric stable graphical model for variables in ,
The dispersion and skewness for the projection of any dimensional stable random density is given by (Samorodnitsky and Taqqu, 1994)
Since, represents a symmetric stable graphical model, Lemma 4 implies
We are now in a position to present the main contribution of this paper : an alternative criterion for model selection that is both computationally efficient and comes with robust theoretical guarantees (Lemma 7). The criterion is called minimum dispersion criterion (MDC) and is a penalized version of a technique previously used in signal processing literature for designing filters for heavytailed noise (Stuck, 1978).
Given a data set , the penalized dispersion score for a Bayesian network is defined as,
The minimum dispersion criterion (MDC) selects the Bayesian network that maximizes this score over the space of all directed acyclic graphs and parameters .
Lemma 7
Given a data set generated by a symmetric stable graphical model, , the minimum dispersion criterion is asymptotically equivalent to the Bayesian information criterion over the search space of all symmetric stable graphical models
First consider the contribution to BIC score from each family (ie., each random variable and its parents) separately. Let be any arbitrary set of regression coefficients for a candidate network . Note that the coefficients need not be the true regression coefficients and need not be the true network . We will use the notation for the instantiation of in sample . Since includes samples from a symmetric stable graphical model, Lemma 6 implies . Therefore, using Lemma 5
where, as in Equation 3, and are related by the transformation in Equation 2.
Since, is independent of the candidate network structure and regression parameters , we get the result that for any pair of networks and
Therefore, asymptotically, is equivalent to when data is generated by a symmetric SG graphical model. We now show how samples from any stable graphical model can be combined to yield samples from a partner symmetric stable graphical model with identical parameters and network topology. This transformation was earlier used by Kuruoglu (2001) in order to estimate parameters from skewed univariate stable densities. We should point out that the procedure described above has the drawback that symmetrized data set has half the sample size.
Lemma 8
Every SG model can be associated with a symmetric SG model with identical skeleton and regression parameters.
Given a data set representing any SG model , consider a resampled data set with variable instantiations
These ’bootstrapped’ data samples represent independent instantiations of random variables . Similarly, we may use the regression parameters to define resampled noise variables :
We now make two observations :

If , then using Property 1

The transformed noise variables are independent of each other.
But these conditions define an SG model (Definition 2.2). Therefore, by Lemma 2, the resampled data is distributed according to a Bayesian network such that
2.4 The StabLe Algorithm
In this section we describe StabLe, an algorithm for learning the structure and parameters of SG models (Algorithm 1). The first step of StabLe is to center and symmetrize the entire data matrix in terms of the variables , as described in Lemma 8. This is followed by estimating the global parameter using the method of statistics (Kuruoglu, 2001). Finally, structure learning is performed by a modified version of the orderingbased search (OBS) algorithm (Section 2.4.2). The details of parameter estimation and structure learning algorithms are discussed next.
2.4.1 Parameter Learning
First, we describe the algorithms StabLe uses to estimate the characteristic exponent from the data matrix , as well as the parameters and for any given directed acyclic graph .
Estimating the global parameter :
Log statistics can be used to estimate the characteristic exponent from the centered and symmetrized variables in (Kuruoglu, 2001).
Algorithm:
Since every linear combination of variables in has the same , if we define
(4)  
Estimating the dispersion , and regression parameters
If is the dispersion parameter for the distribution of , then the minimum dispersion criterion selects regression parameters
Minimum dispersion regression coefficients are estimated using a connection between the norm of a stable random variable and the dispersion parameter (Zolotarev, 1957; Kuruoglu, 2001).
Lemma 9
If , then
where,
Therefore, to within a constant term , minimizing is identical to minimizing the norm for .