Smoothing Spline Semiparametric Density Models

01/10/2019 ∙ by Jian Shi, et al. ∙ 0

Density estimation plays a fundamental role in many areas of statistics and machine learning. Parametric, nonparametric and semiparametric density estimation methods have been proposed in the literature. Semiparametric density models are flexible in incorporating domain knowledge and uncertainty regarding the shape of the density function. Existing literature on semiparametric density models is scattered and lacks a systematic framework. In this paper, we consider a unified framework based on the reproducing kernel Hilbert space for modeling, estimation, computation and theory. We propose general semiparametric density models for both a single sample and multiple samples which include many existing semiparametric density models as special cases. We develop penalized likelihood based estimation methods and computational methods under different situations. We establish joint consistency and derive convergence rates of the proposed estimators for both the finite dimensional Euclidean parameters and an infinite-dimensional functional parameter. We validate our estimation methods empirically through simulations and an application.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Density estimation plays a fundamental role in many areas of statistics and machine learning. The estimated density functions are useful for model building and diagnostics, inference, prediction and clustering. Traditionally there are two distinct approaches for density estimation: maximum likelihood method within a parametric family and nonparametric methods. Parametric models are in general parsimonious with meaningful and interpretable parameters

[pearson1902systematic, pearson1902systematic2, kendall1946advanced, fisher1997absolute]. Nonparametric methods based on minimal assumptions are in general more flexible [Silverman, izenman1991review, Gu2013bk]. Often in practice it is desirable to model some components of the density function parametrically while leaving other components unspecified. Many semiparametric density models have been proposed for different purposes, including the partially linear semiparametric density models [yang2009penalized], the mixture models [olkin1987semiparametric, bordes2006semiparametric, ma2015flexible], mixture models with exponential tilts for multiple samples [Anderson1972, Qin1999, ZouFineYandell2002, Tan2009], combinations of parametric and nonparametric functions [hjort1995nonparametric, efron1996using], semiparametric models for Bayesian testing [lenk2003bayesian], and transformation models including the location-scale family distributions [wand1991transformations, potgieter2012nonparametric]. However, a systematic framework with a broad formulation of the semiparametric density models is lacking. In this paper, we propose a general framework that includes the above mentioned semiparametric density models as special cases. We develop a smoothing spline based estimation method for the general model and prove the asymptotic results based on our estimation method. The general framework provides a unified platform for the developments of estimation, computation, and theory.

For a single random sample, ,

, from a common probability density

on a general domain , we consider the following general semiparametric density model

(1)

where is a known function of given and , which will be referred to as the logistic transformation of . The parameter and the nonparametric function are unknown and need to be estimated. Often certain conditions, which depend on the form of , are necessary to make model (1) identifiable. We assume that model (1) is identifiable, and discuss identifiability conditions for specific models in the following sections.

Many existing semiparametric models are special cases of model (1). olkin1987semiparametric proposed a mixture of a parametric and a nonparametric density function, namely, , where is a known density function up to parameters , is an unknown weight parameter, and is a nonparametric density function. They showed that the semiparametric density estimate provides a compromise between the parametric and nonparametric estimates. It is a special case of model (1) with , where and are the parameters and is the nonparametric function. hjort1995nonparametric proposed a density estimation procedure by starting out with a parametric density estimate , and then multiplying with a nonparametric kernel type estimate of a correction function . It was shown that their semiparametric density model can perform better than a nonparametric fit when the true density is in the neighborhood of the initial parametric density. Their model is a special case of model (1) with , where is the parameter and is the nonparametric function. efron1996using proposed a specially designed exponential family for density estimation, , where

is a carrier density and estimated by kernel density estimation,

is a

-dimensional vector of sufficient statistics,

is a -dimensional vector of parameters, and is a normalizing parameter making integrate to 1 over . The proposed method matches the estimated expectation of with its sample expectation. The model is a special case of model (1) with , where is the carrier density given by nonparametric function , and and are the parameters. lenk2003bayesian proposed a flexible semiparametric model for Bayesian testing of , where is a vector of nonconstant functions, is a zero mean, second-order Gaussian process with bounded, continuous covariance function, and is a known dominating measure on the support . The semiparametric model allows the predictive distribution to deviate from the parametric family. If the parametric family is inadequate, the semiparametric predictive density coherently adapts to the data. yang2009penalized also used the logistic transformation of density function as in lenk2003bayesian with modeled as an unknown smooth function. The model in yang2009penalized is a special case of model (1) with . wand1991transformations considered density estimation of data with local features, and proposed a data transformation technique so that global smoothing is appropriate. This transformation model is a special case of model (1) with where is the transformation, is the parameter and is the nonparametric function.

In the case of multiple samples, assume there are independent groups, and in each group , there are iid observations such that on domain . We consider the following general semiparametric density model

(2)

where , the logistic transformation of , is a known function of the parameter and the nonparametric function . We are interested in the estimation of , , and ultimately the overall density function from the observations.

potgieter2012nonparametric considered a two sample transformation model. Suppose that and , and ’s have the same distribution as ’s after a certain invertible transformation parametrized by , denoted as . They considered the nonparametric estimation of the density function based on asymptotic likelihood, and showed that the estimators are often near optimal when compared to fully parametric methods. The model is a special case of our model (2) with , and , where is the logistic transformation of . Anderson1972, Qin1999, ZouFineYandell2002, and Tan2009 considered exponential tilt mixture models for biased sampling and case control studies, where multiple samples are collected and each sample follows an exponential tilt mixture model, that is, for sample , , and . We can see that this model is also a special case of model (2) with , where is the parameter and is the logistic transformation of .

We consider the smoothing spline based estimation method for models (1) and (2), that is, we assume that , where is a reproducing kernel Hilbert space (RKHS). Our work builds upon the existing literature and extends it to include semiparametric nonlinear density models. We develop computation methods based on profile penalized likelihood and backfitting, and joint asymptotic theory of the parametric and nonparametric components. For the single sample case, smoothing spline based nonparametric density estimation has been considered by many authors, including silverman_1982, CoxOsullivan, and Gu2013bk. yang2009penalized extended such methods for nonparametric models to a semiparametric density estimation model, where is linear in both the parametric and nonparametric components and , as discussed above. Great progress has been made toward joint asymptotic theory in semiparametric models recently, with the seminal work by cheng2015joint and follow-up papers [ZhaoChengLiu16, ChaoVolgushevCheng17]. Prior to this, semiparametric asymptotic theory focused on the convergence of the parametric component, whereas the nonparametric component was usually considered as a nuisance parameter [Bickelbk, Tsiatisbk, Kosorokbk]

. We note that one approach is to extend the framework for joint asymptotics of the semiparametric linear regression model in cheng2015joint to the semiparametric linear density model, and the same joint rate of convergence as derived by yang2009penalized can be obtained. This approach relies heavily on the extension of the reproducing kenel of

, where the nonparametric component lives, to the product parameter space using the linear form of in and . However, such an extension for a general nonlinear function is nontrivial to obtain. In order to develop joint asymptotics for model (1) with general and for model (2) with general , we extend an approach from gu_qiu_1993 for nonparametric density estimation to the general nonlinear semiparametric model through a first order linearization technique motivated by the study of the nonlinear nonparametric regression model in o’sullivan_1990. In particular, we specify the additional assumptions needed for our general framework and we introduce an appropriate metric in which the rate of convergence is derived for the joint estimator. Additional assumptions include, for example, smoothness criteria for and the existence of certain bounded linear operators related to the first Fréchet partial derivatives of , all of which are redundant when is linear in . In addition to the rate of convergence of the joint estimator, we also obtain the convergence rate of the parametric component in the standard norm as well as the convergence rate of the overall density function in the symmetrized Kullback-Leibler distance.

In Section 3, we consider models (1) and (2) in three scenarios and develop different computation procedures. Asymptotic properties of the proposed estimators are considered in Section 4. Simulations are conducted to evaluate the proposed estimation procedures in Section 5. Section 6 shows applications to real dataset.

2 Some notations

We begin by introducing some notations that will be used throughout the rest of the paper. Unless specified otherwise, let denote any vector, whose the th element is . The standard norm of is denoted . We also denote any matrix , where represent the th entry of the matrix. If there exist positive constants , such that , we write . We use the notation to represent the expectation taken over the joint sample distribution, whose density function is . For simplicity, we will also sometimes use to represent the pair , i.e., and .

Denote the product parameter space as . Let be the Fréchet partial differential operator with respect to , and let be the Fréchet partial differential operator with respect to . If and are any (real) Banach spaces, represents the space of bounded linear operators from to . Note that for any function , the Fréchet partial derivatives of are maps and by definition.

3 Penalized likelihood estimation

In this section, we will focus on model (2). Model (1) is a special case with . We propose to estimate and in (2) as the minimizer of the following penalized likelihood with respect to :

(3)

where the first part of (3) is the joint negative log likelihood, is a square seminorm penalty, and is the smoothing parameter. Assume that is a -dimensional space with basis functions , then , where is an RKHS with reproducing kernel (RK) denoted by .

The minimizer of the penalized likelihood (3) does not fall in a finite dimensional space. In the following, we consider estimation of the model (2) under the following three scenarios: additive, general, and transformation models.

3.1 The additive model

The model (2) is additive when is the sum of a term involving the parameters and the nonparametric component, that is,

(4)

where is a known function of with unknown parameters . The model proposed by efron1996using, lenk2003bayesian, yang2009penalized, and hjort1995nonparametric are special cases of with . In particular, the first three of them considered the linear additive model in which , where . The model proposed in ZouFineYandell2002 is an example of the nonlinear additive model with , where and .

For the additive semiparametric density model (4), the penalized likelihood objective (3) becomes the following:

(5)

We propose a profiled penalized likelihood based approach to optimize (5). First, with a fixed , optimizing (5) with respect to is equivalent to optimizing the following weighted penalized likelihood:

(6)

where . As in Gu (2013), we approximate the solution by

(7)

where is a random sample of , , , , and .

For and , let , be the diagonal matrix whose th diagonal entry is where , be the matrix whose )th entry is for , and be the matrix whose th entry is for . Let be the matrix whose th entry is . Define for , , , and . These notations extend in the obvious way to vectors and matrices of functions.

We plug (7) back into (6) and compute minimizers and using the Newton iterative method. Let , where and are and vectors calculated in the previous step of the iteration. Denote

and note is the transpose of . We can also define , , and similarly to the first three equations above by replacing by and by . Similar to Gu2013bk, the Newton updating equation is

(8)

At convergence of the Newton method, we denote the estimate of with fixed and as . Similar to Gu2013bk, the smoothing parameter is selected by optimizing the relative Kullback-Leibler (KL) distance between the density associated with the estimate, , and the true density corresponding to for ,

(9)

Following similar derivations in Gu2013bk, an approximated cross-validation estimate of the relative Kullback-Leibler distance is

(10)

where , , , , and . We may set a larger to prevent under-smoothing, such as as in Gu2013bk.

The optimal smoothing parameter for a given , , is defined as the that minimizes the cross validation criterion (10). The corresponding nonparametric function estimate is denoted as for simplicity. Plugging into (5), we have the profiled penalized likelihood

(11)

Then the estimate of , , is the minimizer of (11). The minimization is achieved by the line search algorithm in nelder1965simplex. The final estimate of is .

In fact, for a given , minimizing the cross validation criterion (10) to obtain and minimizing (5) to obtain can be easily implemented by utilizing the ssden function in the R package gss with the option bias to specify the weights [gss].

3.2 The general nonlinear model

In this section, we consider the general form of in model (2) that is not additive. For example, when , a mixture model of two densities assumes , where and . Originally considered by olkin1987semiparametric, this model is often referred to as the two-component mixture model. Different estimation methods and applications can be found in bordes2006semiparametric and ma2015flexible.

We will extend the Gauss-Newton procedure for the estimation of and to the general nonlinear semiparametric density model. For fixed , let be the current estimate of . We assume that the Fréchet derivative of with respect to at exists, and we denote it by . Approximating by its first order Taylor expansion at and , we have

(12)

where .

Through this linearization, the nonlinear semiparametric density is approximated by an additive model as in Section 3.1 but with the evaluational functional replaced by the bounded linear functional . Note that the method in Section 3.1 can be extended to bounded linear functionals with replaced by and replaced by , .

In summary, for the general nonlinear semiparametric density model, there are two loops of iterations. In the inner loop, for a fixed , we estimate by the extended Gaussian-Newton iteration until convergence, where smoothing parameters are estimated at each iteration. In the outer loop, the profile penalized likelihood (11) is optimized with respect to .

3.3 The transformation model

Model (2) is called a transformation model if

(13)

where ’s are known differentiable and invertible functions with unknown parameters . When is fixed, the logistic density of is . Then the problem reduces to nonparametric density estimation with different weights. The two sample location-scale family density model [potgieter2012nonparametric] belongs to this case with and , where and are the location and scale parameters and .

We note that the profiled penalized likelihood method developed for the general nonlinear model in Section 3.2 applies to the transformation model. Nevertheless, the following backfitting procedure works better for the transformation model. First, provide initial value for the parameter . Next we iterate between the following two steps until convergence:

  1. Based on current estimates , transform the data using and we have . With the transformed data, update by minimizing the penalized likelihood

    (14)

    with the weight function .

  2. Based on current estimate of , update as the MLE based on the likelihood of .

The minimization of the weighted penalized likelihood (14) is solved by the Newton method proposed in Gu2013bk.

4 Joint asymptotic theory

In this section, we develop a joint asymptotic theory for the penalized likelihood estimators for models (1) and (2).

For , is a density function over the sample space of the form given by (2). We consider and a known function , where , is the space of functions on

with finite second moment with respect to the measure given by the true density

, and is the true parameter. The constant functions have been removed for identifiability. In general, need not be one-to-one on . However, for identifiability, we require that for all , is one-to-one in a neighborhood of the true parameter. We specify the precise assumptions on this neighborhood in Section 4.2.2.

Recall the penalized likelihood for estimating the parameter given by (3), and denote it as . Let . If the joint negative log likelihood is convex, under certain conditions (see Theorem  in Gu2013bk), the existence of the penalized semiparametric estimator given by

is guaranteed. In general, the functions need not satisfy these conditions, so for our analysis on joint consistency, we assume that is convex and achieves a unique local minimum at in . We study the consistency of this estimator and obtain the rate of convergence of to the true parameters as and . In our analysis, we may drop the subscript when for convenience.

Note that by adapting the methods in cox_1988, CoxOsullivan, and o’sullivan_1990, we have established local existence and uniqueness of . A detailed proof can be found in the Appendix.

4.1 Outline of the proof of consistency

By our assumption for , the estimator is taken to be the unique solution of

in . To study the asymptotic behavior of the estimator , we follow a similar outline as in the nonparametric case [gu_qiu_1993]. We first introduce an approximation of , denoted , which is the minimizer of

(15)

where , , and . We see that for each , is almost like a quadratic approximation of at , ignoring terms independent of and terms involving second derivatives of . Since and are quadratic functionals, one can check that is convex with respect to , and attains its minimum at if and only if is the solution for the system

Using a quadratic approximation of the penalized likelihood to attain an approximation of the estimator is a common intermediate step in the study of convergence of smoothing spline estimators (see silverman_1982, CoxOsullivan, o’sullivan_1990, and gu_qiu_1993). In particular, when , our definition of can be seen as a generalized version of the quadratic approximation given by equation (5.2) in gu_qiu_1993.

In Section 4.2, we define a norm related to and , in which the rate of convergence for can be derived. For readability, we restrict our analysis to the special case when and establish the rate of convergence for in Section 4.3. Together with a bound for the approximation error , the rate of convergence for then follows by the triangle inequality (Section 4.4). In addition, we also establish the rate of convergence of the estimate in terms of the symmetrized Kullback-Leibler distance, which is defined as

(16)

Section 4.5 states results for without proofs since the proofs are straightforward extensions of those in Sections 4.3 and 4.4.

4.2 Norm and assumptions

In this section, we introduce an appropriate norm in which the convergence rate of our estimator is measured and we provide some assumptions needed for our analysis.

4.2.1 Parameter space

Assumption 1.

The penalty is a square seminorm in with a finite-dimensional null space . Therefore, extends to a square seminorm on , and its null space is again finite-dimensional. Denote by the semi-inner product associated with the seminorm . We also assume that .

Assumption 2.

For , there are bounded linear operators and , with zero nullspaces, which satisfy the following conditions:

  1. Suppose is a real Hilbert space of functions, equipped with norm . For any , there exist positive constants , such that

  2. For any satisfying and for any , there exists a positive constant such that

    (17)

By Assumptions 1 and 2, we see that is an inner product on , and its induced norm, denoted by , is complete on . One can also see that can be represented by the vector of functions . We may use to denote the linear operator or its vector form, i.e., . Denote by the matrix in which the th entry is , and note that one can write