## 1 Introduction

Density estimation has long been a subject of paramount interest in statistics. Many parametric, nonparametric, and semiparametric methods have been developed in the literature. Assuming a known distribution family with succinct representation and interpretable parameters, the parametric approach is in general statistically and computationally efficient [Kendall1987]. However, the parametric assumption may be too restrictive for some applications. The nonparametric approach, on the other hand, does not assume a specific form for the density function and allows its shape to be decided by data. Methods such as kernel estimation [parzen1962estimation, silverman2018density], local likelihood estimators [loader1996local], and smoothing splines [gu2013smoothing]

work well for low dimensional multivariate density functions. When the dimension is moderate to large, existing nonparametric methods break down quickly due to the curse of dimensionality and/or computationally limitations. duong2007ks pointed out that the kernel density estimation is not applicable to random variables of dimension higher than six. To reduce the computational burden, jeon2006effective and gu2013nonparametric developed pseudo likelihood method for smoothing spline density estimation. However, our experience indicates that the computation become almost infeasible when the dimension is higher than twelve. Consequently, contrary to the univariate case, flexible methods for multivariate density estimation are rather limited when the dimension is large. Recent work using piecewise constant and Bayesian partitions represents a major breakthrough in this area

[lu2013multivariate, liu2014multivariate, li2016density]. Nevertheless, these methods can handle moderate dimensions only, lead to non-smooth density estimates, and cannot be used to investigate the conditional relationship.Some semiparametric methods have been proposed to take advantage of the parsimony of parametric models and the flexibility of nonparametric modeling. Semiparametric copula models consist of nonparametric marginal distributions and parametric copula functions

[genest1995]. Projection pursuit density estimation overcomes the curse of dimensionality by representing the joint density as a product of some smooth univariate functions of carefully selected linear combinations of variables [friedman1984projection]. The regularized derivative expectation operator (rodeo) method assumes the joint density equals a product of a parametric component and a nonparametric function of an unknown subset of variables [liu2007sparse]. Other semiparametric/nonparametric methods for density estimation include mixture models [richardson1997bayesian], forest density [liu2011forest], density tree [Ram11], and geometric density estimation [Dunson16]. All existing semiparametric/nonparametric methods have strengths and limitations. We will develop a new semiparametric procedure for multivariate density estimation that explores the sparse graph structure in the parametric part of the model.Graphical models are used to characterize conditional relationship between variables with a wide range of applications in natural sciences, social sciences, and economics [lauritzen1996graphical, fan2016overview, friedman2008sparse]. Gaussian graphical model (GGM) is one of the most popular models where conditional independence is reflected in the zero entries of the precision matrix [friedman2008sparse]

. The resulting structure from a GGM can be erroneous when the true distribution is far from Gaussian. The dependence structure of non-Gaussian data has not received great attention until recent years. Robustified Gaussian and elliptical graphical models against possible outliers were studied by miyamura2006robust, finegold2011robust, vogel2011elliptical, and sun2012robust. Graphical models based on generalized linear models were proposed by lee2007efficient, hofling2009estimation, ravikumar2010high, allen2012log, and yang2012graphical. Nonparametric and semiparametric approaches have also been considered. jeon2006effective and gu2013nonparametric applied SS ANOVA dendity models to estimate graphs (see Section

3for details). The computation of this nonparametric approach becomes prohibitive for large dimensions. liu2009nonparanormal, liu2012high, and xue2012regularized developed an elegant nonparanormal model which assumes that there exists a monotone transformation to each variable such that the joint distribution after transformation is multivariate Gaussian. Then any established estimation methods for the GGM can be applied to the transformed variables. Other semiparametric/nonparametric methods include graphical random forests

[fellinghauer2013stable], regularized score matching [lin2018methods], and kernel partial correlation [oh2017graphical].The goal of this article is to build a semiparametric model that combines the GGM with the SS ANOVA density model. We are interested in both density and graph estimation. The remainder of the article is organized as follows. In Section 2 we introduce the semiparametric density model and methods for estimation and computation. We propose methods for graph estimation in Section 3. Sections 4 presents theoretical properties of our methods in term of both density and graph estimation. In Section 5 we evaluate our method using simulation studies. In Section 6 we present applications to two real datasets. Some technical details are gathered in the Appendix.

## 2 Density Estimation with SS ANOVA and cGGM

### 2.1 Semiparametric Density Models with SS ANOVA and cGGM

Consider the density estimation problem in which we are given a random sample of a random vector , and we wish to estimate the density function of . Let where is a -dimensional random vector for which the density function will be modeled nonparametrically and collects elements for which the conditional density will be modeled parametrically. The joint density function can be decomposed into two components:

(1) |

We will model and using SS ANOVA models and cGGMs respectively. We now provide details of these models.

Assume where which is an arbitrary set. To deal with the positivity and unity constraints of a density function, we consider the logistic transform where is referred to as the logistic density function [gu2013smoothing]. We construct a model space for

using the tensor product of reproducing kernel Hilbert spaces (RKHS). The SS ANOVA decomposition of functions in the tensor product RKHS can be represented as

(2) |

where ’s are main effects, ’s are two-way interactions, and the rest are higher order interactions involving more than two variables. Higher order interactions are often removed in (2) for more tractable estimation and inference. An SS ANOVA model for the logistic density function assumes that belongs to an RKHS which contains a subset of components in the SS ANOVA decomposition (2). For a given SS ANOVA model, terms included in the model can be regrouped and the model space can be expressed as

(3) |

where is a finite dimensional space collecting all functions that are not going to be penalized, and are orthogonal RKHS’s with reproducing kernels (RK) for . Details about the SS ANOVA model can be found in gu2013smoothing and wang2011smoothing.

We assume a cGGM for . Specifically, we assume that where is a precision matrix and is a matrix that parameterizes the conditional relationship between and [sohn2012joint, wytock2013sparse, yuan2014partial]. We note that the negative log likelihood function is convex under this parameterization. An alternative assumption [yin2011sparse] may be used to model the conditional density where the negative log likelihood function is biconvex in and rather than jointly convex.

We will refer to the proposed semiparametric model as combined smoothing spline and conditional Gaussian graphical (cSScGG) model. The cSScGG model is closely related to the semiparametric kernel density estimation (SKDE) proposed by hoti2004semiparametric. The same decomposition in (1) was considered. Given an iid sample , , hoti2004semiparametric estimated using the kernel density, , and using the conditional Gaussian density with mean and covariance . Specifically, they estimated and covariance by and respectively, where , is the symmetric Gaussian kernel function, , and , , and are bandwidths. Selection of bandwidths can be difficult and the estimation of conditional mean and covariance can be poor when the dimension of is large. The authors focused on the classification problem. They set in their simulations to make the computation feasible. Under these weights the estimated conditional density does not depend on at all. In contrast, we model using a cGGM which will allow us to explore sparsity in the conditional dependence structure. In addition, the domain in our model is an arbitrary set while the domain in the SKDE method is a subset of . While we focus on continuous in this paper, the discrete case is a natural extension of the current work.

### 2.2 Penalized Likelihood Estimation

A cSScGG model consists of three parameters: and matrices and where is an RKHS given in (3) and is positive definite. Given an iid sample , , let , , , , and . Denote

(4) | |||||

(5) |

as the negative log pseudo likelihood and negative log likelihood functions based on and samples respectively, where some constants are ignored and is a known density for the pseudo likelihood [gu2013smoothing]. The function is continuous, convex and Fréchet differentiable [jeon2006effective], and the function is jointly convex in and .

We estimate , and as minimizers of the penalized likelihood:

(6) |

where is a semi-norm in that penalizes departure from the null space , denotes the elementwise -norm, denotes the elementwise -norm on off-diagonal entries, and indicates positive definiteness of . Together, and encourage sparsity for the cGGM. We allow different tuning parameters for different penalties.

Note that the first part of the penalized likelihood depends on only and the second part depends on and only. Therefore, we can compute the penalized likelihood estimates by solving two optimization problems separately:

(7) |

and

(8) |

As in gu2013smoothing, we approximate the solution of (7) by a linear combination of basis functions in and a random subset of representers. Then the estimate can be calcuated using the Newton-Raphson algorithm. The smoothing parameter is selected as the minimizer of an approximated cross-validation estimate of the Kullback-Leibler (KL) divergence. Details can be found in gu2013smoothing, gu2013nonparametric, and Luothesis. In the next section we propose a new computational method for solving (8).

### 2.3 Backfitting Algorithm for cGGM

Instead of updating and simultaneously as in sohn2012joint, wytock2013sparse and yuan2014partial, we will consider a backfitting procedure to update them iteratively until convergence. We use the subscript to denote quantities calculated at iteration and to denote the -th element of a matrix .

At iteration , with being fixed at , (8) reduces to the minimization of a quadratic function plus an penalty. Therefore, without needing to calculate the Hessian matrix, can be updated efficiently using the coordinate descent algorithm. The gradient . Denote as the covariance matrix. Then the th element is updated by

(9) |

where , , , and is the soft-thresholding operator with threshold .

To update at iteration , we consider the approximate conditional distribution where both and in the conditional mean are fixed at their estimates from the -th iteration. The resulting negative log likelihood

(10) |

where terms independent of are dropped. We update by

(11) |

As in hsieh2011sparse, we will find the Newton direction by approximating using a quadratic function. Based on the second-order Taylor expansion of at where and ignoring terms independent of , we consider

where and are gradient and Hessian matrices with respect to respectively, and represents the Kronecker product. The Newton direction for (11) can be written as the solution of the following regularized quadratic function [hsieh2011sparse]

(12) |

Equation (12) can be solved efficiently via the coordinate descent algorithm. Specifically, let be the initial value, and be the update at iteration . Then at iteration , the th element of is updated by

(13) |

where , and . Denote the penalized objective function at the -th iteration as . We adopt the Armijo’s rule [armijo1966minimization] to find the step size . Specifically, with a constant decrease rate (typically ), step sizes for are tried until the smallest such that

where is the backtracking termination threshold. After the step size is calculated, we update .

When , we use the maximum likelihood estimates and of and as initial values for and respectively [yin2011sparse]. In the high dimensional case when

is not invertible, we use the identity and zero matrix as initial values for

and respectively.The regularized Newton step (12) via the coordinate descent algorithm described above is the most computational expansive part of the algorithm. Despite its efficiency for lasso type of problems, updating all variables in is costly. To relieve this problem, we divide the parameter set into an active set and a free set. As in hsieh2011sparse and wytock2013sparse, at the th iteration of the algorithm, we only update and over the active set defined by

(14) | ||||

As the active set is relatively small due to sparsity induced by the regularization, this strategy provides a substantial speedup.

Tuning parameters and determine the sparsity of and . As a general selection tool, leave-one-out or -fold cross-validation can be used to select these tuning parameters. The leave-one-out cross-validation (LOOCV) can be computationally intensive and various approximations have been proposed in the literature. lian2011shrinkage and vujavcic2015computationally derived generalized approximate cross-validation (GACV) scores for selecting a single tuning parameter in the GGM. The BIC and k-fold CV have been used to select a single tuning parameter in the cGGM [yin2011sparse, sohn2012joint, wytock2013sparse, yuan2014partial, lee2012simultaneous]. LOOCV has not been used for the cGGM as it requires fitting the model times which is computationally intensive. To the best of our knowledge, there are no computationally efficient alternatives to LOOCV in the current cGGM literature. We propose a new criterion, Leave-One-Out KL (LOOKL), for selecting and involved in (8) as minimizers of

(15) | |||||

where , , , , , , , , , , , , , , , and . The derivation is defered to Appendix A. Note that the GACV in lian2011shrinkage and KLCV in vujavcic2015computationally are special cases of LOOKL with . In the penalized case, we ignored the partial derivatives corresponding to the zero elements in and lian2011shrinkage, and showed that the LOOKL score remains the same. More details can be found in Luothesis. Therefore, we conjecture that the proposed score is more appropriate for density estimation, rather than model selection.

We note the proposed backfitting procedure and LOOKL method for selecting tuning parameters are new for the cGMM. When and are simultaneously updated using the second-order Taylor expansion over all parameters [wytock2013sparse], an expensive computation of the large Hessian matrix of size is required in each iteration. In contrast, our approach forms a second-order approximation of a function of which requires a Hessian matrix of size . The remaining set of parameters in can be updated easily using the simple coordinate descent algorithm. Moreover, compared to the method in mccarter2016large, our backfitting algorithm eliminates the need for computing the large matrix in time. Note that we always require to be positive-definite after each iteration, so the algorithm still has complexity flops due to the Cholesky factorization.

Some off-the-shelf packages are utilized to solve the optimization problem. Specifically, we use QUIC [hsieh2014quic] for updating , and gss [gu2014smoothing] for computing the smoothing spline estimate of . We write R code for updating using (9). We note that other penalities such as the smoothly clipped absolute deviation (SCAD) [fan2001variable] and adaptive lasso [zou2006adaptive] may be used to replace the penalty in the estimation of and . Details can be found in Luothesis.

## 3 Graph Estimation with cSScGG Models

In Section 2 we proposed the cSScGG model as a flexible framework for estimating the multivariate density in high-dimensional setting. In terms of the graph structure, the edges among are identified by , and edges between and are identified by [sohn2012joint, wytock2013sparse, yuan2014partial]. The remaining task is the identification of conditional independence within variables which is the target of this section.

We have assumed that the model space for the logistic density contains a subset of components in the SS ANOVA decomposition (2). The interactions are often truncated to overcome the curse of dimensionality and reduce the computational cost. As in gu2013smoothing and gu2013nonparametric, in this section we consider the SS ANOVA model with all main effects and two-way interactions as the model space for . We note that the SS ANOVA model allows pairwise nonparametric interactions as opposed to linear interactions in the GGM. gu2013smoothing and gu2013nonparametric proposed the squared error projection for accessing importance of each interaction term and subsequently identify edges. However, we cannot apply their method directly to to identify edges within since the cGGM for also includes interaction terms among variables in .

The logarithm of the joint density

where is a constant independent of and . The main challenge in identifying conditional independence among comes from the fact that brings in an extra term, , into the interactions among . Let

(16) |

where . Define the functional

(17) |

and denote as . Let where collects functions whose contribution to the overall model is of question. The squared error projection of in is [gu2013smoothing]

(18) |

can be regarded as a proxy of the symmetrized KL divergence [gu2013smoothing]. Assuming , it is easy to check that . Then the ratio reflects the importance of functions in . The quantity is readily computable while details for computing the squared error projection are given in Appendix B.

For any pair of variables and , consider the decomposition where is the subspace consisting of two-way interactions between and , and contains all functions in except the two-way interactions between and . Note that where . Compute the projection ratio in which is the squared error projection of in . The ratio indicates the importance of interactions between and , and we will add the interactions to the additive model sequentially according to the descending order of ’s.

Consider the space decomposition . We start with being the subspace spanned by all main effects. We calculate the projection ratio of in as where is the squared error projection in . If is larger than a threshold, is deemed important and we move the interaction with the largest from to . We then calculate the projection ratio with the updated and . The projection ratio decreases each time we move an interaction from to . Finally the process stops when falls below a cut-off value, at which time we denote the corresponding as . Let and remove the edge between and if . In our implementations, the cut-off value is set to be .

To summarize, the conditional independences among , between and , and among are characterized by the zero elements in , and , respectively. The whole procedure for edge identification is illustrated in Figure 1.

## 4 Theoretical Analysis

We list notations, assumptions, and theoretical results only. Proofs are given in Appendix C.

### 4.1 Notations and Assumptions

Given a matrix , let , and denote the operator norm , operator norm and Frobenius norm respectively, where

represents the largest eigenvalue of

. We assume that where and are the true parameters. Let , , , , , where is the th columns of , denote the Hessian matrix evaluated at the true parameters, and . Let be the maximum number of non-zeros in any column of which represents the maximum degree of in the graph.Denote and , then . In the following theoretical analysis, we assume that and . Similar arguments apply to the case of . The objective function can be rewritten as

(19) |

We make the following assumptions.

###### Assumption 1.

(Underlying Model) where has the maximum degree .

###### Assumption 2.

(Restricted Convexity) For any , let denote the nonzero indices of the -th column of (i.e., the edges between and ). We have , where denotes the smallest eigenvalue and represents the matrix with columns of indexed by .

###### Assumption 3.

(Mutual incoherence) Let denote the support set of in vector form where denotes the indicator function of whether an element is zero. Let denote the complement of . We have for some , where and represent the and sub-matrices of with entries in and respectively.

###### Assumption 4.

(Control of eigenvalues) There exists some constants , such that .

Assumption 1 provides the true underlying model. Assumption 2 ensures the solution of optimization problem (19) is restricted to the active set (nonzero entries in and ), which is also used in wainwright2009sharp and wytock2013sparse. Assumption 3 limits the influence of edges in inactive set () can have on the edges in active set (), and Assumption 4 bounds the eigenvalues of the precision matrix. Define and .

###### Assumption 5.

is completely continuous with respect to and .

Under the Assumption 5, there exists such that , , and , where is the Kronecker delta and is referred to as the eigenvalues of with respect to . Denote the Fourier series expansion of as where are the Fourier coefficients. Let where and .

###### Assumption 6.

(a) The eigenvalues of with respect to satisfy for some and when is sufficiently large.

(b) There exists some constants , and such that holds uniformly for in a convex set around containing and , , and for any and .

(c) There exists some such that .

### 4.2 Asymptotic Consistency of the Estimated Parameters

The following theorem provides the estimation error bound and edge selection accuracy.