In the past three decades, flexible and parsimonious additive partially linear models (APLMs) have been extensively studied and widely used in many statistical applications, including biology, econometrics, engineering, and social science. Examples of recent work on APLMs include Liang et al. (2008), Liu et al. (2011), Ma and Yang (2011), Wang et al. (2011), Ma et al. (2013), Wang et al. (2014) and Lian et al. (2014)
. APLMs are natural extensions of classical parametric models with good interpretability and are becoming more and more popular in data analysis.
Suppose we observe . For subject , is a univariate response, is a
-dimensional vector of covariates that may be linearly associated with the response, andis a -dimensional vector of continuous covariates that may have nonlinear associations with the response. We assume is an i.i.d sample from the distribution of , satisfying the following model:
where is the intercept, , , are unknown regression coefficients, are unknown smooth functions, and each is centered with to make model (1) identifiable. The is a -dimensional vector of zero-mean covariates having density with a compact support. Without loss of generality, we assume that each covariate can be rescaled into an interval . The
terms are iid random errors with mean zero and variance.
The APLM is particularly convenient when is a vector of categorical or discrete variables, and in this case, the components of enter the linear part of model (1) automatically, and the continuous variables usually enter the model nonparametrically. In practice, we might have reasons to believe that some of the continuous variables should enter the model linearly rather than nonparametrically. A natural question is how to determine which continuous covariates have a linear effect and which continuous covariates have a nonlinear effect. If the choice of linear components is correctly specified, then the biases in the estimation of these components are eliminated and root- convergence rates can be obtained for the linear coefficients. However, such prior knowledge is rarely available, especially when the number of covariates is large. Thus, structure identification, or linear and nonlinear detection, is an important step in the process of building an APLM from high-dimensional data.
When the number of covariates in the model is fixed, structure identification in additive models (AMs) has been studied in the literature. Zhang et al. (2011)
proposed a penalization procedure to identify the linear components in AMs in the context of smoothing splines ANOVA. They demonstrated the consistency of the model structure identification and established the convergence rate of the proposed method specifically under the tensor product design.Huang et al. (2012b) proposed another penalized semiparametric regression approach using a group minimax concave penalty to identify the covariates with linear effects. They showed consistency in determining the linear and nonlinear structure in covariates, and obtained the convergence rate of nonlinear function estimators and asymptotic properties of linear coefficient estimators; but they did not perform variable selection at the same time.
For high-dimensional AMs, Lian et al. (2015) proposed a double penalization procedure to distinguish covariates that enter the nonparametric and parametric parts and to identify significant covariates simultaneously. They demonstrated the consistency of the model structure identification, and established the convergence rate of nonlinear function estimators and asymptotic normality of linear coefficient estimators. Despite the nice theoretical properties, their method heavily relies on the local quadratic approximation in Fan and Li (2001), which is incapable of producing naturally sparse estimates. In addition, employing the local quadratic approximation can be extremely expensive because it requires the repeated factorization of large matrices, which becomes infeasible when the number of covariates is very large.
Note that all the aforementioned papers (Zhang et al., 2011; Huang et al., 2012b; Lian et al., 2015) about structure identification focus on the AM with continuous explanatory variables. However, in many applications, a canonical partitioning of the variables exists. In particular, if there are categorical or discrete explanatory variables, as in the case of the SAM data studies (see the details in Section 5) and in many genome-wide association studies, we may want to keep discrete explanatory variables separate from the other design variables and let discrete variables enter the linear part of the model directly. In addition, if there is some prior knowledge of certain parametric forms for some specific covariates, such as a linear form, we may lose efficiency if we simply model all the covariates nonparametrically.
The above practical and theoretical concerns motivate our further investigation of the simultaneous variable selection and structure selection problem for flexible and parsimonious APLMs, in which the features of the data suitable for parametric modeling are modeled parametrically and nonparametric components are used only where needed. We consider the setting where both the dimension of the linear components and the dimension of nonlinear components is ultra-high. We propose an efficient and stable penalization procedure for simultaneously identifying linear and nonlinear components, removing insignificant predictors, and estimating the remaining linear and nonlinear components. We prove the proposed Sparse Model Identification, Learning and Estimation (referred to as SMILE) procedure is consistent. We propose an iterative group coordinate descent approach to solve the penalized minimization problem efficiently. Our algorithm is very easy to implement because it only involves simple arithmetic operations with no complicated numerical optimization steps, matrix factorizations, or inversions. In one simulation example with and , it takes less than one minute to complete the entire model identification and variable selection process on a regular PC.
After variable selection and structure detection, we would like to provide an inferential tool for the linear and nonparametric components. The spline method is fast and easy to implement; however, the rate of convergence is only established in mean squares sense, and there is no asymptotic distribution or uniform convergence, so no measures of confidence can be assigned to the estimators. In this paper, we propose a two-step spline-backfitted local-linear smoothing (SBLL) procedure for APLM estimation, model selection and simultaneous inference for all the components. In the first stage, we approximate the nonparametric functions , , with undersmoothed constant spline functions. We perform model selection for the APLM using a triple penalized procedure to select important variables and identify the linear vs. nonlinear structure for the continuous covariates, which is crucial to obtain efficient estimators for the non-zero components. We show that the proposed model selection and structure identification for both parametric and nonparametric terms are consistent, and the estimators of the nonzero linear coefficients and nonzero nonparametric functions are both -norm consistent. In the second stage, we refit the data with covariates selected in the first step using higher-order polynomial splines to achieve root- consistency of the coefficient estimators in the linear part, and apply a one-step local-linear backfitting to the projected nonparametric components obtained from the refitting. Asymptotic normality for both linear coefficient estimators and nonlinear component estimators, as well as simultaneous confidence bands (SCBs) for all nonparametric components, are provided.
The rest of the paper is organized as follows. In Section 2, we describe the first-stage spline smoothing and propose a triple penalized regularization method for simultaneous model identification and variable selection. The theoretical properties of selection consistency and rates of convergence for the coefficient estimators and nonparametric estimators are developed. Section 3 introduces the spline-backfitted local-linear estimators and SCBs for the nonparametric components. The performance of the estimators is assessed by simulations in Section 4 and illustrated by application to the SAM data in Section 5. Some concluding remarks are given in Section 6. Section A of the online Supplemental Materials evaluates the effect of different smoothing parameters on the performance of the proposed method. Technical details are provided in Section B of the Supplemental Materials.
2.1 Model Setup
In the following, the functional form (linear vs. nonlinear) for each continuous covariate in model (1) is assumed to be unknown. In order to decide the form of , for each , we can decompose into a linear part and a nonlinear part: , where is some unknown smooth nonlinear function (see Assumption (A1) in Appendix E.1). For model identifiability, we assume that , and . The first two constraints and , are required to guarantee identifiability for the APLM, that is, . The constraint ensures there is no linear form in nonlinear function . Note that these constraints are also in accordance with the definition of nonlinear contrast space in Zhang et al. (2011), which is a subspace of the orthogonal decomposition of RKHS. In the following, we assume values are centered so that we can express the APLM in (1) without an intercept parameter as
In the following, we define predictor variableas irrelevant in model (2), if and only if , and as irrelevant if and only if and for all on its support. A predictor variable is defined as relevant if and only if it is not irrelevant. Suppose that only an unknown subset of predictor variables is relevant. We are interested in identifying such subsets of relevant predictors consistently while simultaneously estimating their coefficients and/or functions.
For covariates , we define
For continuous covariate , we say it is a linear covariate if and for all on its support, and is a nonlinear covariate if . Explicitly, we define the following index sets for :
Note that the active nonlinear index set for , , can be decomposed as , where is the index set for covariates whose linear and nonlinear terms in (2) are both nonzero, and is the index set for active pure nonlinear index set for .
Therefore, the model selection problem for model (2) is equivalent to the problem of identifying , , , , and . To achieve this, we propose to minimize
where , and , and are penalty functions explained in detail in Section 2.3. The tuning parameters , and decide the complexity of the selected model. The smoothness of predicted nonlinear functions is controlled by , and , and go to as increases to .
2.2 Spline Basis Approximation
We approximate the smooth functions in (2) by polynomial splines for their simplicity in computation. For example, for each , let be knots that partition with . The space of polynomial splines of order , , consisting of functions satisfying (i) the restriction of to subintervals , , and , is a polynomial of -degree (or less); (ii) for and , is times continuously differentiable on . Below we denote , , the basis functions of .
To ensure and , we consider the following normalized first-order B-splines, referred to as piecewise constant splines. We define for any the piecewise constant B-spline function as the indicator function of the equally-spaced subintervals of with length , that is,
Define the following centered spline basis
with the standardized version given for any ,
So , . In practice, we use the empirical distribution of to perform the centering and scaling in the definitions of and .
We approximate the nonparametric function , , using the above normalized piecewise constant splines
where , and is a vector of the spline coefficients. By using the centered constant spline basis functions, we can guarantee that , and except at the location of the knots.
2.3 Adaptive Group LASSO Regularization
We use adaptive LASSO (Zou, 2006) and adaptive group LASSO (Huang et al., 2010) for variable selection and estimation. Other popular choices include methods based on the Smoothly Clipped Absolute Deviation penalty (Fan and Li, 2001) or the minimax concave penalty (Zhang, 2010). Specifically, we start with group LASSO estimators obtained from the following minimization:
Then, let , , , where by convention, . The adaptive group LASSO objective function is defined as
The adaptive group LASSO estimators are minimizers of (7), denoted by
The model structure selected is defined by
The spline estimators of each component function are
Accordingly, the spline estimators for the original component functions ’s are .
The following theorems establish the asymptotic properties of the adaptive group LASSO estimators. Theorem 1 shows the proposed method can consistently distinguish nonzero components from zero components. Theorem 2 gives the convergence rates of the estimators. We only state the main results here. To facilitate the development of the asymptotic properties, we assume the following sparsity condition:
(Sparsity) The numbers of nonzero components , and are fixed, and there exist positive constants , and such that , , and .
In the following, to avoid confusion, we use , to denote the true parameters in model (2), and to denote the nonlinear functions in model (2). Let , where consists of all nonzero components of , and without loss of generality; similarly, let , where consists of all nonzero components of , and without loss of generality.
Suppose that Assumptions (A1), (A2)–(A6) in Appendix E.1 hold. Then
3 Two-stage SBLL Estimator and Inference
After model selection, our next step is to conduct statistical inference for the nonparametric component functions of those important variables. Although the one-step penalized estimation in Section 2.3 can quickly identify the nonzero nonlinear components, the asymptotic distribution is not available for the resulting estimators.
To obtain estimators whose asymptotic distribution can be used for inference, we first refit the data using selected model,
We approximate the smooth functions in (8) by polynomial splines introduced in Section 2.2. Let be the space of polynomial splines of order , and . Working with ensures that the spline functions are centered, see for example Xue and Yang (2006); Wang and Yang (2007); Wang et al. (2014). Let be a set of standardized spline basis functions for with dimension , where , , so that , . Specifically, if , and is the standardized piecewise constant spline function defined in (4).
We propose a one-step backfitting using refitted pilot spline estimators in the first stage followed by local-linear estimators. The refitted coefficients are defined as
Then the refitted spline estimator for nonlinear functions is
Next we establish the asymptotic normal distribution for the parametric estimators. To makeestimable at the rate, we need a condition to ensure and are not functionally related. Define as the Hilbert space of theoretically centered additive functions. For any , let be the coordinate mapping that maps to its -th component so that , and let be the orthogonal projection of onto . Let . Similarly, for any , let be the coordinate mapping that maps to its -th component so that , and let
be the orthogonal projection of onto . Let . Define and . Denote vector and as , .
The proof of Theorem 3 is similar to the proof of Liu et al. (2011) and Li et al. (2018) and thus omitted. Let and , . If and are given, can be consistently estimated by , where with given in (E.16) in the Supplemental Materials and . In practice, we replace and with and , respectively, to obtain the corresponding estimate.
Let . In the selection step, we estimate and consistently, that is, . Within the event , that is, and , the estimator is root- consistent according to Theorem 3. Since is shown to have probability tending to one, we can conclude that is also root- consistent.
Denote a continuous kernel function, and let be a rescaling of , where is usually called the bandwidth. Next, we define the spline-backfitted local-linear (SBLL) estimator of as based on , which attempts to mimic the would-be SBLL estimator of based on if the unobservable “oracle” responses were available:
where and , with and as defined in (12), respectively; and the weight and “design” matrices are